Project Notes: Project Title: Speeding up Genome Sequence Alignment Using Volunteer Computing ​ Name: Jay Nagpaul ​

Total Page:16

File Type:pdf, Size:1020Kb

Project Notes: Project Title: Speeding up Genome Sequence Alignment Using Volunteer Computing ​ Name: Jay Nagpaul ​ Project Notes: Project Title: Speeding up genome sequence alignment using volunteer computing ​ Name: Jay Nagpaul ​ Note Well: There are NO SHORT-cuts to reading journal articles and taking notes from them. Comprehension is paramount. You ​ will most likely need to read it several times so set aside enough time in your schedule. Contents: Important Notes: 2 Knowledge Gaps: 3 Literature Search Parameters: 4 Article #0 Notes: Title 5 Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language 6 Article #2 Notes: Efficient genomic read alignment in an in-memory database 7 Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP 11 Article #4 Notes: The Cost of Sequencing a Human Genome 13 Article #5 Notes: Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures 15 Article #6 Notes: System and method for grid and cloud computing 18 Article #7 Notes: Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity 23 Article #8 Notes: bíogo: a simple high-performance bioinformatics toolkit for the Go language 25 Article #9 Notes: Best Practices for Scientific Computing 27 Article #10 Notes: Biopython: freely available Python tools for computational molecular biology and bioinformatics 29 Article #11 Notes: BioJava: an open-source framework for bioinformatics 31 Article #12 Notes: On the performance and design of BioSequences compared to the Seq language 33 Article #13 Notes: Title 35 Nagpaul 1 Nagpaul 2 Important Notes: Notes Article Date resolved ● Rust would be a great candidate 7 10/8/2020 for the project ○ Memory safe ○ Fast ○ Concurrent Nagpaul 3 Knowledge Gaps: This list provides a brief overview of the major knowledge gaps for this project, how they were resolved and where to find the information. Knowledge Gap Resolved By Information is Date resolved located Nagpaul 4 Literature Search Parameters: These searches were performed between (Start Date of reading) and XX/XX/2019. List of keywords and databases used during this project. Database/search engine Keywords Summary of search Google Patents bowtie2 Arxiv Faster and More Accurate Sequence Alignment with SNAP Google scholar Genome sequencing Google scholar bowtie Google scholar biogo Nagpaul 5 Article #0 Notes: Title Article notes should be on separate sheets KEEP THIS BLANK AND USE AS A TEMPLATE Source Title Source citation (APA Format) Original URL Source type Keywords Summary of key points (include methodology) Research Question/Problem/ Need Important Figures Notes Cited references to follow up on Follow up Questions Nagpaul 6 Article #1 Notes: A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language Article notes should be on separate sheets Source Title A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language Source citation (APA Zargarbashi, S. S. H., & Babaali, B. (2019). A Multi-Modal ​ Format) Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language. ​ https://arxiv.org/abs/1910.00330v1 Original URL https://arxiv.org/abs/1910.00330v1 Source type ArXiv paper Keywords Alzheimer’s, Machine Learning, Computer Science, Language Summary of key - Alzheimer’s early diagnosis crucial points (include - Current tests are costly, slow, and time-consuming methodology) - Enter: machine learning - Algorithm which analyzes spoken language of patient - Looking for semantic errors, irregular tone/acoustics, and other irregularities - 3 models were tested: - I-vector & x-vector - Operated on the sound itself - No semantic understanding - N-gram - Operates on word semantics and ordering - Combination of 3 models diagnosed with 83.6% accuracy Research How can we diagnose Alzheimer’s earlier for more people? Question/Problem/ Need Important Figures N/A Nagpaul 7 Notes - Specific details on the algo were hard to grasp. <- possible knowledge gap? - Article results were quite light. <- Inconclusive research? - Takes up low resources - Able to be implemented in a variety of languages - This is especially is promising for my initial idea - This model isnt specific to vocal analysis - The actual details of the model can be incorporated in other data Cited references to P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. follow up on Lai, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992. <- ​ Possible explanation for the algo details Follow up Questions What are the specific details of the algorithm? How effective is this today? What is the false positives percentage? Answered in results: 83.6% ​ success rate How general purpose are the vector models Are these vocal samples public? Why were random noise samples added to the data? Is the model overfitted? 83.6% seems like a unrealistic level of accuracy. Nagpaul 8 Article #2 Notes: Efficient genomic read alignment in an in-memory database Source Title Efficient genomic read alignment in an in-memory database Source citation (APA Plattner, H., Schapranow, M., & Ziegler, E. (2014). European. Patent ​ Format) No. EP2759952A1. Munich, Germany. European Patent Office. ​ Original URL https://patents.google.com/patent/EP2759952A1/en Source type Patent Keywords Search term: Bowtie2, bioinformatics, possible project Summary of key ● Software for genomic alignment points (include ○ Faster than existing solutions methodology) ● Sequencing becoming more ubiquitous in modern research ● Patent sequences faster and cheaper ● Processes in parallel! ○ Validation for my idea ● Goes onto describe details about the design itself ○ Not necessary for my project specifically ● Chunks data ○ Chunks are processed concurrently ● Uses a query system ○ I.e Rusts RLS new development strategy ○ Based on in-memory database infrastructure (IMDB) ● Distributes using multiple computer cores ○ And possibly multiple computers Research Genome alignment is an expensive and slow process. Can we Question/Problem/ develop a better alternative? Need Nagpaul 9 Important Figures Notes ● Super pleased about this patent ○ Verifies my assumption that parallelization would lead to beneficial improvements in genome alignments ● Paper also mentioned Bowtie2 as a possible tool which could work with their software ● Truly no unique ideas anymore ○ Will have to consult with Sontheimer for other ideas ○ Although this one isn’t open source Cited references to LANGMEAD, B.; SALZBERG, S.L.: "Fast gapped-read alignment with Bowtie follow up on 2", NATURE METHODS, vol. 9, no. 4, April 2012 (2012-04-01), pages 357 - 9, XP002715401, DOI: doi:10.1038/nmeth.1923 STEVE HOFFMANN ET AL: "Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures", PLOS COMPUTATIONAL BIOLOGY, vol. 5, no. 9, 11 September 2009 (2009-09-11), pages e1000502, XP055070487, DOI: 10.1371/journal.pcbi.1000502 * ZAHARIA, M.; BOLOSKY, W.J.; CURTIS, K.; FOX, A.; PATTERSON, D.; SHENKER, S.; STOICA, I.; KARP, R.M.; SITTLER, T., FASTER AND MORE Nagpaul 10 ACCURATE SEQUENCE ALIGNMENT WITH SNAP, November 2011 (2011-11-01) Follow up Questions ● Parallelization is a technique I’m familiar with when applied to projects such as these ○ This is the first time I’m hearing about IMDB ○ Is it a common use case? ■ (look into badger- GoLang) ■ ANSWERED https://docs.oracle.com/en/database/oracle/ora cle-database/19/vldbg/inmemory-parallel-exec. html ● Yes, it’s an underlying model behind many rdms databases today ● ● How does this system differ from Bowties? ● The patent repeatedly alludes to a new algorithm: their secret sauce ○ Claims it takes advantage of shortcuts, but is simple ○ Would a project I develop be focused more on new algorithmic design rather than utilizing modern techniques. ■ Similar but not quite the same ● i.e rewriting algorithms vs implementing PVC network Nagpaul 11 Article #3 Notes: Faster and More Accurate Sequence Alignment with SNAP Source Title Faster and More Accurate Sequence Alignment with SNAP Source citation (APA Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson, D., Format) Shenker, S., Stoica, I., Karp, R. M., & Sittler, T. (2011). Faster and more accurate sequence alignment with snap. ArXiv:1111.5572 [Cs, q-Bio]. http://arxiv.org/abs/1111.5572 Original URL https://arxiv.org/abs/1111.5572 Source type ArXiv Paper Keywords Bowtie2, Sequence alignment, optimization Summary of key ● New sequence aligner (alternative To Bowtie2) points (include ● Uses new algorithm methodology) ○ Not BWA (B2’s alg) ● 10-100x speed up ○ Cheaper to run ■ $2 AWS unit ● Tested on "m2.4xlarge" Amazon EC2 machine with 68 GB RAM. ● Optimized for more error prone data ○ A notable area which is weak in existing tools Research How can we speed up nucleotide alignment while preserving a high Question/Problem/ accuracy percentage? Need Important Figures Can see the sheer magnitude of improvement Notes ● Github: https://github.com/amplab/snap ​ ○ Hm, seems less popular than bowtie2 ■ Odd considering it’s enormous speed improvements ● Optimized for both short and long read alignments Nagpaul 12 ● Results necessarily show an improvement in accuracy ○ BWA: 93.0% reads 0.05% error 662 reads/s ○ SNAP: 94.1% 0.05% error 34100 reads/s. ○ Will this hold in all cases? ● 40 GitHub issues ○ Large amount of duplicate bug reports ○ Paper is light on flaws of SNAP? ● Last paragraph agrees my suspicions of how to approach project ○ Reconsidering the algorithms at the core of such projects can lead to interesting speed ups and benefits Cited references to R. Li, C. Yu, Y. Li, T.-W. W. Lam, S.-M. M. Yiu, K. Kristiansen, and J. follow up on Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, Aug. 2009. Z. Ning, A. J. Cox, and J. C. Mullikin. SSAHA: A fast search method for large DNA databases. Genome Research, 11(10):1725–1729, 2001. NHGRI data on DNA Sequencing Costs. http: //www.genome.gov/sequencingcosts/. Follow up Questions ● Why isn’t SNAP more popular if it’s such a considerable upgrade over bowtie2 ○ ANSWERED ■ Github reveals a shocking number of bugs ■ 40 open issues vs Bowtie2’s 7.
Recommended publications
  • Benchmarking of Bioperl, Perl, Biojava, Java, Biopython, and Python for Primitive Bioinformatics Tasks 6 and Choosing a Suitable Language
    Taewan Ryu : Benchmarking of BioPerl, Perl, BioJava, Java, BioPython, and Python for Primitive Bioinformatics Tasks 6 and Choosing a Suitable Language Benchmarking of BioPerl, Perl, BioJava, Java, BioPython, and Python for Primitive Bioinformatics Tasks and Choosing a Suitable Language Taewan Ryu Dept of Computer Science, California State University, Fullerton, CA 92834, USA ABSTRACT Recently many different programming languages have emerged for the development of bioinformatics applications. In addition to the traditional languages, languages from open source projects such as BioPerl, BioPython, and BioJava have become popular because they provide special tools for biological data processing and are easy to use. However, it is not well-studied which of these programming languages will be most suitable for a given bioinformatics task and which factors should be considered in choosing a language for a project. Like many other application projects, bioinformatics projects also require various types of tasks. Accordingly, it will be a challenge to characterize all the aspects of a project in order to choose a language. However, most projects require some common and primitive tasks such as file I/O, text processing, and basic computation for counting, translation, statistics, etc. This paper presents the benchmarking results of six popular languages, Perl, BioPerl, Python, BioPython, Java, and BioJava, for several common and simple bioinformatics tasks. The experimental results of each language are compared through quantitative evaluation metrics such as execution time, memory usage, and size of the source code. Other qualitative factors, including writeability, readability, portability, scalability, and maintainability, that affect the success of a project are also discussed. The results of this research can be useful for developers in choosing an appropriate language for the development of bioinformatics applications.
    [Show full text]
  • The Bioperl Toolkit: Perl Modules for the Life Sciences
    Downloaded from genome.cshlp.org on January 25, 2012 - Published by Cold Spring Harbor Laboratory Press The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich, David Block, Kris Boulez, et al. Genome Res. 2002 12: 1611-1618 Access the most recent version at doi:10.1101/gr.361602 Supplemental http://genome.cshlp.org/content/suppl/2002/10/20/12.10.1611.DC1.html Material References This article cites 14 articles, 9 of which can be accessed free at: http://genome.cshlp.org/content/12/10/1611.full.html#ref-list-1 Article cited in: http://genome.cshlp.org/content/12/10/1611.full.html#related-urls Email alerting Receive free email alerts when new articles cite this article - sign up in the box at the service top right corner of the article or click here To subscribe to Genome Research go to: http://genome.cshlp.org/subscriptions Cold Spring Harbor Laboratory Press Downloaded from genome.cshlp.org on January 25, 2012 - Published by Cold Spring Harbor Laboratory Press Resource The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich,1,18,19 David Block,2,18 Kris Boulez,3 Steven E. Brenner,4 Stephen A. Chervitz,5 Chris Dagdigian,6 Georg Fuellen,7 James G.R. Gilbert,8 Ian Korf,9 Hilmar Lapp,10 Heikki Lehva¨slaiho,11 Chad Matsalla,12 Chris J. Mungall,13 Brian I. Osborne,14 Matthew R. Pocock,8 Peter Schattner,15 Martin Senger,11 Lincoln D. Stein,16 Elia Stupka,17 Mark D. Wilkinson,2 and Ewan Birney11 1University Program in Genetics, Duke University, Durham, North Carolina 27710, USA; 2National Research Council of
    [Show full text]
  • Review of Java
    Review of Java z Classes are object factories ¾ Encapsulate state/data and behavior/methods ¾ Ask not what you can do to an object, but what … z A program is created by using classes in libraries provided and combining these with classes you design/implement ¾ Design classes, write methods, classes communicate ¾ Communication is via method call z We've concentrated on control within and between methods ¾ Data types: primitive, array, String ¾ Control: if, for-loop, while-loop, return Genome Revolution: COMPSCI 006G 3.1 Smallest of 2, 3, …,n z We want to print the lesser of two elements, e.g., comparing the lengths of two DNA strands int small = Math.min(s1.length(),s2.length()); z Where does min function live? How do we access it? ¾ Could we write this ourselves? Why use library method? public class Math { public static int min(int x, int y) { if (x < y) return x; else return y; } } Genome Revolution: COMPSCI 006G 3.2 Generalize from two to three z Find the smallest of three strand lengths: s1, s2, s3 int small = … z Choices in writing code? ¾ Write sequence of if statements ¾ Call library method ¾ Advantages? Disadvantages? Genome Revolution: COMPSCI 006G 3.3 Generalize from three to N z Find the smallest strand length of N (any number) in array public int smallest(String[] dnaCollection) { // return shortest length in dnaCollection } z How do we write this code? Where do we start? ¾ ¾ ¾ Genome Revolution: COMPSCI 006G 3.4 Static methods analyzed z Typically a method invokes behavior on an object ¾ Returns property of object, e.g., s.length();
    [Show full text]
  • Plat: a Web Based Protein Local Alignment Tool
    University of Rhode Island DigitalCommons@URI Open Access Master's Theses 2017 Plat: A Web Based Protein Local Alignment Tool Stephen H. Jaegle University of Rhode Island, [email protected] Follow this and additional works at: https://digitalcommons.uri.edu/theses Recommended Citation Jaegle, Stephen H., "Plat: A Web Based Protein Local Alignment Tool" (2017). Open Access Master's Theses. Paper 1080. https://digitalcommons.uri.edu/theses/1080 This Thesis is brought to you for free and open access by DigitalCommons@URI. It has been accepted for inclusion in Open Access Master's Theses by an authorized administrator of DigitalCommons@URI. For more information, please contact [email protected]. PLAT: A WEB BASED PROTEIN LOCAL ALIGNMENT TOOL BY STEPHEN H. JAEGLE A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF RHODE ISLAND 2017 MASTER OF SCIENCE THESIS OF STEPHEN H. JAEGLE APPROVED: Thesis Committee: Major Professor Lutz Hamel Victor Fay-Wolfe Ying Zhang Nasser H. Zawia DEAN OF THE GRADUATE SCHOOL UNIVERSITY OF RHODE ISLAND 2017 ABSTRACT Protein structure largely determines functionality; three-dimensional struc- tural alignment is thus important to analysis and prediction of protein function. Protein Local Alignment Tool (PLAT) is an implementation of a web-based tool with a graphic interface that performs local protein structure alignment based on user-selected amino acids. Global alignment compares entire structures; local alignment compares parts of structures. Given input from the user and the RCSB Protein Data Bank, PLAT determines an optimal translation and rotation that minimizes the distance between the structures defined by the selected inputs.
    [Show full text]
  • The Bioperl Toolkit: Perl Modules for the Life Sciences
    Downloaded from genome.cshlp.org on October 30, 2013 - Published by Cold Spring Harbor Laboratory Press View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Cold Spring Harbor Laboratory Institutional Repository The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich, David Block, Kris Boulez, et al. Genome Res. 2002 12: 1611-1618 Access the most recent version at doi:10.1101/gr.361602 Supplemental http://genome.cshlp.org/content/suppl/2002/10/20/12.10.1611.DC1.html Material References This article cites 14 articles, 9 of which can be accessed free at: http://genome.cshlp.org/content/12/10/1611.full.html#ref-list-1 Creative This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the Commons first six months after the full-issue publication date (see License http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/. Email Alerting Receive free email alerts when new articles cite this article - sign up in the box at the Service top right corner of the article or click here. To subscribe to Genome Research go to: http://genome.cshlp.org/subscriptions Cold Spring Harbor Laboratory Press Resource The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich,1,18,19 David Block,2,18 Kris Boulez,3 Steven E. Brenner,4 Stephen A. Chervitz,5 Chris Dagdigian,6 Georg Fuellen,7 James G.R.
    [Show full text]
  • The Biojava Tutorial the Biojava Tutorial
    The BioJava Tutorial The BioJava Tutorial BioJava is a library of open source classes intended as a framework for applications which analyse or present biological sequence data. This tutorial illustrates the core sequence-handling interfaces available to the application programmer, and explains how BioJava differs from other sequence-handling libraries. For more comprehensive descriptions of the BioJava API, please consult the JavaDoc documentation. 1. Symbols and SymbolLists 2. Sequences and Features 3. Sequence I/O basics 4. ChangeEvent overview 5. ChangeEvent example using Distribution objects 6. Implementing Changeable 7. Blast-like parsing (NCBI Blast, WU-Blast, HMMER) 8. walkthrough of one of the dynamic programming examples 9. Installing BioSQL The BioJava tutorial, like BioJava itself, is a work in progress, and all suggestions (and offers to write extra chapters ;) are welcome. If you see any glaring errors, or would like to contribute some documentation, please contact Thomas Down or the biojava-l mailing list. http://www.biojava.org/tutorials/index.html [02/04/2003 13.39.36] BioJava.org - Main Page BioJava.org Open Bio sites About BioJava bioperl.org The BioJava Project is an open-source project dedicated to providing Java tools biopython.org for processing biological data. This will include objects for manipulating bioxml.org sequences, file parsers, CORBA interoperability, DAS, access to ACeDB, biodas.org dynamic programming, and simple statistical routines to name just a few things. biocorba.org The BioJava library is useful for automating those daily and mundane bioinformatics tasks. As the library matures, the BioJava libraries will provide a Documentation foundation upon which both free software and commercial packages can be Overview developed.
    [Show full text]
  • An Open-Source Sandbox for Increasing the Accessibility Of
    University of Connecticut OpenCommons@UConn UCHC Articles - Research University of Connecticut Health Center Research 2012 An Open-Source Sandbox for Increasing the Accessibility of Functional Programming to the Bioinformatics and Scientific ommC unities Matthew eF nwick University of Connecticut School of Medicine and Dentistry Colbert Sesanker University of Connecticut School of Medicine and Dentistry Jay Vyas University of Connecticut School of Medicine and Dentistry Michael R. Gryk University of Connecticut School of Medicine and Dentistry Follow this and additional works at: https://opencommons.uconn.edu/uchcres_articles Part of the Biomedical Engineering and Bioengineering Commons, Life Sciences Commons, and the Medicine and Health Sciences Commons Recommended Citation Fenwick, Matthew; Sesanker, Colbert; Vyas, Jay; and Gryk, Michael R., "An Open-Source Sandbox for Increasing the Accessibility of Functional Programming to the Bioinformatics and Scientific ommC unities" (2012). UCHC Articles - Research. 249. https://opencommons.uconn.edu/uchcres_articles/249 NIH Public Access Author Manuscript Proc Int Conf Inf Technol New Gener. Author manuscript; available in PMC 2014 October 15. NIH-PA Author ManuscriptPublished NIH-PA Author Manuscript in final edited NIH-PA Author Manuscript form as: Proc Int Conf Inf Technol New Gener. 2012 ; 2012: 89–94. doi:10.1109/ITNG.2012.21. An Open-Source Sandbox for Increasing the Accessibility of Functional Programming to the Bioinformatics and Scientific Communities Matthew Fenwick1, Colbert Sesanker1,
    [Show full text]
  • Java for Bioinformatics and Biomedical Applications Java for Bioinformatics and Biomedical Applications
    JAVA FOR BIOINFORMATICS AND BIOMEDICAL APPLICATIONS JAVA FOR BIOINFORMATICS AND BIOMEDICAL APPLICATIONS by Harshawardhan Bal Booz Allen Hamilton, Inc., Rockville, MD and Johnny Hujol Vertex Pharmaceuticals, Inc., Cambridge, MA ^ Sprringei r Library of Congress Control Number: 2006930294 ISBN-10: 0-387-37235-0 e-ISBN-10: 0-387-37237-7 ISBN-13: 978-0-387-37237-8 Printed on acid-free paper. © 2007 Springer Science-HBusiness Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science-t-Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 987654321 springer.com Contents Foreword IX Introduction IX Background and history IX Interfaces and standards X Java as a platform X The future XI Preface XIII Chapter 1 1 Introduction to Bioinformatics and Java 1 The Origins of Bioinformatics 1 Current State of Biomedical Research 3 The cancer Biomedical Informatics Grid program 6 caBIG™ Organization and Architecture 7 The Model-View-Controller Framework 9 Web Services and Service-Oriented Architecture 10 CaGrid 11 Let's look at each of the tools in turn and understand how they sub­ serve or address a small component of the bigger research problem.
    [Show full text]
  • The Bioperl Toolkit: Perl Modules for the Life Sciences
    Resource The Bioperl Toolkit: Perl Modules for the Life Sciences Jason E. Stajich,1,18,19 David Block,2,18 Kris Boulez,3 Steven E. Brenner,4 Stephen A. Chervitz,# Chris Dagdigian,% Georg Fuellen,( James G.R. Gilbert,8 Ian Korf,9 Hilmar Lapp,10 Heikki Lehva1slaiho,11 Chad Matsalla,12 Chris J. Mungall,13 Brian I. Osborne,14 Matthew R. Pocock,8 Peter Schattner,15 Martin Senger,11 Lincoln D. Stein,16 Elia Stupka,17 Mark D. Wilkinson,2 and Ewan Birney11 1University Program in Genetics, Duke University, Durham, North Carolina 27710, USA; 2National Research Council of Canada, Plant Biotechnology Institute, Saskatoon, SK S7N &'( Canada; )AlgoNomics, # 9052 Gent, Belgium; +Department of Plant and Molecular Biology, University of California, Berkeley, California 94720, USA; *Affymetrix, Inc., Emeryville, California 94608, USA; 1Open Bioinformatics Foundation, Somerville, Massachusetts 02144, USA; 7Integrated Functional Genomics, IZKF, University Hospital Muenster, 48149 Muenster, Germany; 2The Welcome Trust Sanger Institute, Welcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK; (Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA; 10Genomics Institute of the Novartis Research Foundation (GNF), San Diego, California 92121, USA; 11European Bioinformatics Institute, Welcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK; 12Agriculture and Agri-Food Canada, Saskatoon Research Centre, Saskatoon SK, S7N 0X2 Canada; 13Berkeley Drosophila Genome Project, University of California, Berkeley, California 94720,
    [Show full text]
  • Persistent Bioperl
    Persistent Bioperl BOSC 2003 Hilmar Lapp Genomics Institute Of The Novartis Research Foundation San Diego, USA Acknowledgements • Bio* contributors and core developers ß Aaron, Ewan, ThomasD, Matthew, Mark, Elia, ChrisM, BradC, Jeff Chang, Toshiaki Katayama ß And many others • Sponsors of Biohackathons ß Apple (Singapore 2003) ß O’Reilly (Tucson 2002) ß Electric Genetics (Cape Town 2002) • GNF for its generous support of OSS development Overview • Use cases • BioSQL Schema • Bioperl-DB ß Key features and design goals ß Examples • Status & Plans • Summary Use cases (I) • ‘Local GenBank with random access’ ß Local cache or replication of public databanks ß Indexed random access, easy retrieval ß Preserves annotation (features, dbxrefs,…), possibly even format • ‘GenBank in relational format’ ß Normalized schema, predictably populated ß Allows arbitrary queries ß Allows tables to be added to support my data/question/… Use Cases (II) • ‘Integrate GenBank, Swiss-Prot, LocusLink, …’ ß Unifying relational schema ß Provide common (abstracted) view on different sources of annotated genes • ‘Database for my lab sequences and my annotation’ ß Store FASTA-formatted sequences ß Add, update, modify, remove various types of annotation Use Cases (III) • Persistent storage for my favorite Bio* toolkit ß Relational model accommodates object model ß Persistence API with transparent insert, update, delete Persistent Bio* • Normalized relational schema designed for Bio* interoperability BioSQL • Toolkit-specific persistence API Biojava Bioperl-DB Biopython
    [Show full text]
  • Biopython Tutorial and Cookbook
    Biopython Tutorial and Cookbook Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, Michiel de Hoon, Peter Cock Last Update – September 2008 Contents 1 Introduction 5 1.1 What is Biopython? ......................................... 5 1.1.1 What can I find in the Biopython package ......................... 5 1.2 Installing Biopython ......................................... 6 1.3 FAQ .................................................. 6 2 Quick Start – What can you do with Biopython? 8 2.1 General overview of what Biopython provides ........................... 8 2.2 Working with sequences ....................................... 8 2.3 A usage example ........................................... 9 2.4 Parsing sequence file formats .................................... 10 2.4.1 Simple FASTA parsing example ............................... 10 2.4.2 Simple GenBank parsing example ............................. 11 2.4.3 I love parsing – please don’t stop talking about it! .................... 11 2.5 Connecting with biological databases ................................ 11 2.6 What to do next ........................................... 12 3 Sequence objects 13 3.1 Sequences and Alphabets ...................................... 13 3.2 Sequences act like strings ...................................... 14 3.3 Slicing a sequence .......................................... 15 3.4 Turning Seq objects into strings ................................... 15 3.5 Concatenating or adding sequences ................................. 16 3.6 Nucleotide sequences and
    [Show full text]
  • International Journal of Applied Science and Engineering Review
    International Journal of Applied Science and Engineering Review ISSN: 2582-6271 Vol.1 No.6; Dec. 2020 PROCESSING GENETIC DATA USING BIOJAVA Eziechina Malachy A.1, Esiagu Ugochukwu E.2 & Ojinnaka, Ebuka R.3 and Okechukwu Oliver4 1,2 Department of Computer Science, Akanu Ibiam Federal Polytechnic Unwana 3Department Of Science Education, Michael Okpara University of Agriculture, Umudike 4Department Of Mathematics Education, Enugu State University of Science and Technology, Enugu ABSTRACT The study, analysis and processing of biological data using computer is known as bioinformatics. Bioinformatics is an interdisciplinary science that develops and applies techniques to generate, organize, analyze, store, and retrieve biological data. Bioinformatics make use of expertise from the fields of computer science, statistics, and biology. The fundamental element of bioinformatics is the genetic data and the associated gene expression. Genetic data is the entire Deoxyribonucleic Acid (DNA) properties of an organism, both heritable and inheritable. BioJava is a computer solution system that is equipped with powerful functionalities to handle genetic data and its complexities. This paper therefore dwells on the application of computer tools in the study and computation of genetic data. Statistical and computer models are used to illustrate some basic genetic and computer concepts. Samples of data generated from genomic studies were manipulated using Biojava. From the result it was revealed that BioJava proved very effective in analysis and computation of genetic data. Biojava supports standards-based middleware technologies to provide seamless access to remote data, annotation and computational servers, thereby, enabling researchers with limited local resources to benefit from available public infrastructure. KEYWORDS: Genetics, Database, DNA, Biojava, Bioinformatics, geWorkbench.
    [Show full text]