Developing Bioinformatics Computer Skills.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Developing Bioinformatics Computer Skills.Pdf Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Cynthia Gibas Per Jambeck Publisher: O'Reilly First Edition April 2001 ISBN: 1-56592-664-1, 446 pages Buy Print Version Developing Bioinformatics Computer Skills will help biologists, researchers, and students develop a structured approach to biological data and the computer skills they'll need to analyze it. The book covers Copyright the Unix file system, building tools and databases for bioinformatics, Table of Contents computational approaches to biological problems, an introduction to Index Perl for bioinformatics, data mining, data visualization, and tips for Full Description tailoring data analysis software to individual research needs. About the Author Reviews Reader reviews Errata Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills [6/2/2002 8:49:35 AM] Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. The O'Reilly logo is a registered trademark of O'Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O'Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. The association between the image of a Caenorhabditis elegans and the topic of bioinformatics is a trademark of O'Reilly & Associates, Inc. While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills&mode=3 [6/2/2002 8:49:49 AM] Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Preface Audience for This Book Structure of This Book Our Approach to Bioinformatics URLs Referenced in This Book Conventions Used in This Book Comments and Questions Acknowledgments I: Introduction 1. Biology in the Computer Age 1.1 How Is Computing Changing Biology? 1.2 Isn't Bioinformatics Just About Building Databases? 1.3 What Does Informatics Mean to Biologists? 1.4 What Challenges Does Biology Offer Computer Scientists? 1.5 What Skills Should a Bioinformatician Have? 1.6 Why Should Biologists Use Computers? 1.7 How Can I Configure a PC to Do Bioinformatics Research? 1.8 What Information and Software Are Available? 1.9 Can I Learn a Programming Language Without Classes? 1.10 How Can I Use Web Information? 1.11 How Do I Understand Sequence Alignment Data? 1.12 How Do I Write a Program to Align Two Biological Sequences? 1.13 How Do I Predict Protein Structure from Sequence? 1.14 What Questions Can Bioinformatics Answer? 2. Computational Approaches to Biological Questions 2.1 Molecular Biology's Central Dogma 2.2 What Biologists Model 2.3 Why Biologists Model 2.4 Computational Methods Covered in This Book 2.5 A Computational Biology Experiment II: The Bioinformatics Workstation 3. Setting Up Your Workstation 3.1 Working on a Unix System 3.2 Setting Up a Linux Workstation 3.3 How to Get Software Working 3.4 What Software Is Needed? 4. Files and Directories in Unix 4.1 Filesystem Basics 4.2 Commands for Working with Directories and Files 4.3 Working in a Multiuser Environment http://safari.oreilly.com/main.asp?bookname=bioskills&mode=1 (1 of 4) [6/2/2002 8:49:58 AM] Safari | Developing Bioinformatics Computer Skills 5. Working on a Unix System 5.1 The Unix Shell 5.2 Issuing Commands on a Unix System 5.3 Viewing and Editing Files 5.4 Transformations and Filters 5.5 File Statistics and Comparisons 5.6 The Language of Regular Expressions 5.7 Unix Shell Scripts 5.8 Communicating with Other Computers 5.9 Playing Nicely with Others in a Shared Environment III: Tools for Bioinformatics 6. Biological Research on the Web 6.1 Using Search Engines 6.2 Finding Scientific Articles 6.3 The Public Biological Databases 6.4 Searching Biological Databases 6.5 Depositing Data into the Public Databases 6.6 Finding Software 6.7 Judging the Quality of Information 7. Sequence Analysis, Pairwise Alignment, and Database Searching 7.1 Chemical Composition of Biomolecules 7.2 Composition of DNA and RNA 7.3 Watson and Crick Solve the Structure of DNA 7.4 Development of DNA Sequencing Methods 7.5 Genefinders and Feature Detection in DNA 7.6 DNA Translation 7.7 Pairwise Sequence Comparison 7.8 Sequence Queries Against Biological Databases 7.9 Multifunctional Tools for Sequence Analysis 8. Multiple Sequence Alignments, Trees, and Profiles 8.1 The Morphological to the Molecular 8.2 Multiple Sequence Alignment 8.3 Phylogenetic Analysis 8.4 Profiles and Motifs 9. Visualizing Protein Structures and Computing Structural Properties 9.1 A Word About Protein Structure Data 9.2 The Chemistry of Proteins 9.3 Web-Based Protein Structure Tools 9.4 Structure Visualization 9.5 Structure Classification 9.6 Structural Alignment 9.7 Structure Analysis 9.8 Solvent Accessibility and Interactions 9.9 Computing Physicochemical Properties 9.10 Structure Optimization 9.11 Protein Resource Databases 9.12 Putting It All Together 10. Predicting Protein Structure and Function from Sequence 10.1 Determining the Structures of Proteins 10.2 Predicting the Structures of Proteins 10.3 From 3D to 1D http://safari.oreilly.com/main.asp?bookname=bioskills&mode=1 (2 of 4) [6/2/2002 8:49:58 AM] Safari | Developing Bioinformatics Computer Skills 10.4 Feature Detection in Protein Sequences 10.5 Secondary Structure Prediction 10.6 Predicting 3D Structure 10.7 Putting It All Together: A Protein Modeling Project 10.8 Summary 11. Tools for Genomics and Proteomics 11.1 From Sequencing Genes to Sequencing Genomes 11.2 Sequence Assembly 11.3 Accessing Genome Informationon the Web 11.4 Annotating and Analyzing Whole Genome Sequences 11.5 Functional Genomics: New Data Analysis Challenges 11.6 Proteomics 11.7 Biochemical Pathway Databases 11.8 Modeling Kinetics and Physiology 11.9 Summary IV: Databases and Visualization 12. Automating Data Analysis with Perl 12.1 Why Perl? 12.2 Perl Basics 12.3 Pattern Matching and Regular Expressions 12.4 Parsing BLAST Output Using Perl 12.5 Applying Perl to Bioinformatics 13. Building Biological Databases 13.1 Types of Databases 13.2 Database Software 13.3 Introduction to SQL 13.4 Installing the MySQL DBMS 13.5 Database Design 13.6 Developing Web-Based Software That Interacts with Databases 14. Visualization and Data Mining 14.1 Preparing Your Data 14.2 Viewing Graphics 14.3 Sequence Data Visualization 14.4 Networks and Pathway Visualization 14.5 Working with Numerical Data 14.6 Visualization: Summary 14.7 Data Mining and Biological Information Bibliography Unix SysAdmin Perl General Reference Bioinformatics Reference Molecular Biology/Biology Reference Protein Structure and Biophysics Genomics Biotechnology Databases Visualization Data Mining http://safari.oreilly.com/main.asp?bookname=bioskills&mode=1 (3 of 4) [6/2/2002 8:49:58 AM] Safari | Developing Bioinformatics Computer Skills Colophon Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills&mode=1 (4 of 4) [6/2/2002 8:49:58 AM] Safari | Developing Bioinformatics Computer Skills -> Preface Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills > Preface See All Titles Make Note | Bookmark CONTINUE > 158127045003020048038218232180015152050067001135112006120215207095121041242031111020227 Preface Computers and the World Wide Web are rapidly and dramatically changing the face of biological research. These days, the term "paradigm shift" is used to describe everything from new business trends to new flavors of cola, but biological science is in the midst of a paradigm shift in the classical sense. Theoretical and computational biology have existed for decades on the "fringe" of biological science. But within just a few short years, the flood of new biological data produced by genomics efforts and, by necessity, the application of computers to the analysis of this genomic data, has begun to affect every aspect of the biological sciences. Research that used to start in the laboratory now starts at the computer, as scientists search databases for information that might suggest new hypotheses. In the last two decades, both personal computers and supercomputers have become accessible to scientists across all disciplines. Personal computers have developed from expensive novelties with little real computing power into machines that are as powerful as the supercomputers of 10 years ago. Just as they've replaced the author's typewriter and the accountant's ledger, computers have taken their place in controlling and collecting data from lab equipment.
Recommended publications
  • (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat Et Al
    US 2003O21, 1987A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat et al. (43) Pub. Date: Nov. 13, 2003 (54) METHODS AND MATERIALS RELATING TO Apr. 7, 2000 (US)........................................... O9545,714 STEM CELL GROWTH FACTOR-LIKE Apr. 11, 2000 (US)........................................... O9547358 POLYPEPTIDES AND POLYNUCLEOTDES Publication Classification (76) Inventors: Ivan Labat, Mountain View, CA (US); Y Tom Tang, San Jose, CA (US); Radoje T. Drmanac, Palo Alto, CA (51) Int. Cl." ....................... A61K 38/18; CO7K 14/475; (US); Chenghua Liu, San Jose, CA C12O 1/68; CO7H 21/04; (US); Juhi Lee, Fremont, CA (US); C12M 1/34; C12P 21/02; Nancy K Mize, Mountain View, CA C12N 5/08 (US); John Childs, Sunnyvale, CA (52) U.S. Cl. ......... 514/12; 435/69.1; 435/6; 435/320.1; (US); Cheng-Chi Chao, Cupertino, CA 435/366; 530/399; 536/23.5; (US) 435/287.2 Correspondence Address: MARSHALL, GERSTEIN & BORUN LLP (57) ABSTRACT 6300 SEARS TOWER 233 S. WACKER DRIVE The invention provides novel polynucleotides and polypep CHICAGO, IL 60606 (US) tides encoded by Such polynucleotides and mutants or variants thereof that correspond to a novel human Secreted (21) Appl. No.: 10/168,365 Stem cell growth factor-like polypeptide. These polynucle otides comprise nucleic acid Sequences isolated from cDNA (22) PCT Filed: Dec. 23, 2000 libraries prepared from human fetal liver Spleen, ovary, adult (86) PCT No.: PCT/US00/35260 brain, lung tumor, Spinal cord, cervix, ovary, endothelial cells, umbilical cord, lymphocyte, lung fibroblast, fetal (30) Foreign Application Priority Data brain, and testis.
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Use Solid K-Mers In
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Use Solid K-mers In MinHash-Based Genome Distance Estimation A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by An Zheng Committee in charge: Professor Pavel Pevzner, Chair Professor Vikas Bansal Professor Melissa Gymrek 2017 Copyright An Zheng, 2017 All rights reserved. The thesis of An Zheng is approved, and it is acceptable in quality and form for publication on microfilm and electroni- cally: Chair University of California, San Diego 2017 iii TABLE OF CONTENTS Signature Page . iii Table of Contents . iv List of Figures . v List of Tables . vi Acknowledgements . vii Abstract of the Thesis . viii Chapter 1 Introduction and background . 1 1.1 Genome distance estimation . 1 1.2 Current methods . 2 1.3 MinHash . 3 1.4 Solid k-mer powered MinHash . 5 Chapter 2 Method . 7 2.1 General scheme . 7 2.2 Identification of overlapping read pairs . 8 2.2.1 Workflow . 8 2.2.2 Data . 9 2.2.3 Implementation . 9 2.3 Genome identification . 10 2.3.1 Workflow . 10 2.3.2 Data . 10 2.3.3 Implementation . 10 Chapter 3 Result . 15 3.1 Identification of overlapping read pairs . 15 3.1.1 Performance comparison between solid k-mer pow- ered MinHash and regular MinHash . 15 3.1.2 Selecting the solid k-mer threshold . 17 3.2 Genome identification . 19 Chapter 4 Discussion and future work . 21 Bibliography . 23 iv LIST OF FIGURES Figure 1.1: An example of how to use MinHash to compute the resemblance of two genome sequences.
    [Show full text]
  • The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM
    The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM Volume 19 | Issue 16 | Page 21 | Aug. 29, 2005 Previous | Issue Contents | Next FEATURE | SEVEN TECHNOLOGIES How 90,000 lines of code helped spark the bioinformatics explosion By Anne Harding You've just cloned and sequenced a gene, but you don't know what it does. Now Sponsored by: what do you do? In the absence of functional clues, it's hard to know where to start. One approach is to ask what other known sequences are similar to yours, thereby inferring function from homology. Each weekday, some 200,000 or so researchers do just that, asking a server at the National Center for Biotechnology Information (NCBI) in Bethesda, Md., to compare their particular sequence against GenBank, a DNA database that, at the end of 2004, held more than 40 million sequences totaling 44.5 billion nucleotides. The NCBI devotes 158 two-processor computers to those queries, 75% of which return within 22 seconds. The software these servers use, a sturdy 15-year-old program known as the Basic Local Alignment Search Tool, or BLAST, remains, for many, bioinformatics' "killer app." It wasn't the first DNA database search tool, but it was fast, and it provided metrics to assess the significance of the matches it found--all in 90,000 lines of C code. "The fact that every biologist has been using BLAST tells everything," says Jin Billy Li of the Washington University Genome Sequencing Center in St. Louis, who has used BLASTP (a protein homology tool) to identify flagellar genes in several species, including the human gene that causes Bardet-Biedl syndrome, a ciliation disorder.
    [Show full text]
  • Msc THESIS Genetic Sequence Alignment on a Supercomputing Platform
    Computer Engineering 2011 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Genetic sequence alignment on a supercomputing platform Erik Vermij Abstract Genetic sequence alignment is an important tool for researchers. It lets them see the differences and similarities between two genetic sequences. This is used in several fields, like homology research, auto immune disease research and protein shape estimation. There are various algorithms that can perform this task and several hard- ware platforms suitable to deliver the necessary computation power. CE-MS-2011-02 Given the large volume of the datasets used, throughput is nowadays the major bottleneck in sequence alignment. In this thesis we discuss some of the existing solutions for high throughput genetic sequence alignment and present a new one. Our solution implements the well known Smith-Waterman optimal local alignment algorithm on the HC-1 hybrid supercomputer from Convey Computer. This platform features four FPGAs which can be used to accelerate the problem in question. The FPGAs, and the CPU that controls them, live in the same virtual memory space and share one large memory. We developed a hardware description for the FPGAs and a software program for the CPU. Some focus points were: a sustainable peak performance, being able to align sequences of any length, FPGA area efficient computations and the cancellation of unnecessary workload. The result is a Smith-Waterman FPGA core that can run at 100% utilization for many alignments long. They are packed per six on a FPGA running on 150 MHz, which results in a full system performance of 460 GCUPS (billion elementary operations per second).
    [Show full text]
  • Computation Resources for Molecular Biology: a Special Issue
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE Editorial provided by Elsevier - Publisher Connector Computation Resources for Molecular Biology: A Special Issue Increasingly, computational approaches are hav- ubiquitin-like folds in the protein databank (UbSRD). ing a central role across many areas of research The resource quantifies the structures of ubiquitins tackling the challenges of understanding the com- and SUMOs (small ubiquitin-like modifier proteins) plexity of biological systems. A resource such as and their different modes of protein–protein interac- BLAST (Basic Local Alignment Search Tool) [1], tions. The database allowed the authors to identify which was published in this journal in 1990, has that the ubiquitin tail is flexible and adopts a range of transformed sequence searching because of its conformations on binding. Users can browse the speed and power in detecting distant but biologically database by phylogeny, by structural properties and significant relationships. Since then, high-throughput by residue interactions. molecular biology technologies have led to a rapid The third database [4] in this Special Issue, expansion in available sequence, structural and authored by Keerthikumar et al., is ExoCarta that is -omics data for many systems being studied. a manually curated compendium of exosomal Computational biologists, statisticians and mathe- proteins, RNAs and lipids. The current version maticians have been motivated by this exponential details more than 41,000 protein, 7000 RNA and growth of data to develop enhanced novel compu- 1000 lipid molecules. Users can browse the data- tational tools. Challenges include the storage and base by organism, content type or gene.
    [Show full text]
  • Coding Sequences: a History of Sequence Comparison Algorithms As a Scientiªc Instrument
    Coding Sequences: A History of Sequence Comparison Algorithms as a Scientiªc Instrument Hallam Stevens Harvard University Sequence comparison algorithms are sophisticated pieces of software that com- pare and match identical or similar regions of DNA, RNA, or protein se- quence. This paper examines the origins and development of these algorithms from the 1960s to the 1990s. By treating this software as a kind of scien- tiªc instrument used to examine sets of biological objects, the paper shows how algorithms have been used as different sorts of tools and appropriated for dif- ferent sorts of uses according to the disciplinary context in which they were deployed. These particular uses have made sequences themselves into different kinds of objects. Introduction Historians of molecular biology have paid signiªcant attention to the role of scientiªc instruments and their relationship to the production of bio- logical knowledge. For instance, Lily Kay has examined the history of electrophoresis, Boelie Elzen has analyzed the development of the ultra- centrifuge as an enabling technology for molecular biology, and Nicolas Rasmussen has examined how molecular biology was transformed by the introduction of the electron microscope (Kay 1998, 1993; Elzen 1986; Rasmussen 1997).1 Collectively, these historians have demonstrated how instruments and other elements of the material culture of the labora- tory have played a decisive role in determining the kind and quantity of knowledge that is produced by biologists. During the 1960s, a versatile new kind of instrument began to be deployed in biology: the electronic computer (Ceruzzi 2001; Lenoir 1999). Despite the signiªcant role that 1. One could also point to Robert Kohler’s (1994) work on the fruit ºy, Jean-Paul Gaudillière (2001) on laboratory mice, and Hannah Landecker (2007) on the technologies of tissue culture.
    [Show full text]
  • STUDY of the RELATIONSHIP BETWEEN Mus Musculus PROTEIN SEQUENCES and THEIR BIOLOGICAL FUNCTIONS a Thesis Presented to the Gradua
    STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Pawan Seth May, 2007 STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS Pawan Seth Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Zhong-Hui Duan Dr. Ronald F. Levant _______________________________ _______________________________ Committee Member Dean of the Graduate School Dr. Chien-Chung Chan Dr. George R. Newkome _______________________________ _______________________________ Committee Member Date Dr. Xuan-Hien Dang _______________________________ Committee Member Dr. Yingcai Xiao _______________________________ Department Chair Dr. Wolfgang Pelz ii ABSTRACT The central challenge in post-genomic era is the characterization of biological functions of newly discovered proteins. Sequence similarity based approaches infer protein functions based upon the homology between proteins. In this thesis, we present the similarity relationship between protein sequences and functions for mouse proteome in the context of gene ontology slim. The similarity between protein sequences is computed using a novel measure based upon the local BLAST alignment scores. The similarity between protein functions is characterized using the three gene ontology categories. In the study, the ontology categories are represented using a general tree structure. Three ontology trees are constructed using the definitions provided in gene ontology slim. The mouse protein sequences are then mapped onto the trees. We present the sequence similarity distributions at different levels of GO tree. The similarities of protein sequences across gene ontology levels and traversing branches are studied.
    [Show full text]
  • K-Mulus: a Database-Clustering Approach to Protein BLAST in the Cloud
    K-mulus: A Database-Clustering Approach to Protein BLAST in the Cloud Carl H. Albachy Sebastian G. Angely Christopher M. Hilly Mihai Pop Department of Computer Science University of Maryland, College Park College Park, MD 20741 fcalbach, sga001, cmhill, mpopg@umiacs.umd.edu ABSTRACT with the intention of efficiently searching for these similar- With the increased availability of next-generation sequenc- ities. The most widely used application is the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST[3]. they are able to process and analyze. One of the most widely performed analyses is identifying regions of similar- With the increased availability of next-generation sequenc- ity between DNA or protein sequences using the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST. Due to the large scale ever before. This large influx of data has become a ma- of sequencing data produced, parallel implementations of jor issue as researchers have a difficult time processing and BLAST are needed to process the data in a timely manner. analyzing it. For this reason, optimizing the performance In this paper we present K-mulus, an application that per- of BLAST and developing new alignment tools has been a forms distributed BLAST queries via Hadoop and MapRe- well researched topic over the past few years. Take the ex- duce and aims to generate speedups by clustering the database. ample of environmental sequencing projects, in which the Clustering the sequence database reduces the search space bio-diversity of various environments, including the human for a given query, and allows K-mulus to easily parallelize microbiome, is analyzed and characterized to generate on the remaining work.
    [Show full text]
  • In Its Most Basic Form a Sequence Alignment Is Simply Comparing Two Or More Sequences by Searching for Character Patterns and Other Similarities
    In its most basic form a sequence alignment is simply comparing two or more sequences by searching for character patterns and other similarities. Blasting some sequence is often the first step a researcher will take to characterize an unknown sequence or even a whole genome, but in 1970, when Needleman and Wunsch first introduced their algorithm for automated global alignment of sequences, there were still very few sequences to work with. Throughout the late seventies, determining nucleotide sequences was a torturous tedious process, but in 1977 there were two major technological breakthroughs, one pioneered by Maxam and Gilbert and the other by Sanger that opened the door for automated sequencing. These two developments represent the bedrock on which high through put sequencing is built. Alignment techniques can be broken into two components. The first is construction of a scoring matrix, and the second is the actual algorithms used to compute a score. The simplest scoring matrix is a unitary matrix. If the character matches a one is assigned, and if not the matrix element is zero. In this most fundamental case, the alignment would simply be the path through this matrix. The two most commonly used scoring matrices today are PAM and BLOSUM. PAM or percent accepted mutation rate different substitution rates of amino acids were calculated based on alignments of protein sequences that were at least eight-five percent identical (Heinkoff, 1992). Rates for PAMs correlating to different evolutionary distances were then extrapolated from these original calculations. BLOSUM takes a slightly different approach. Instead of relying on extrapolated data, the various BLOSUMs are calculated directly.
    [Show full text]
  • Thesis by Submitted in Partial Fulfillment of the Requirements For
    Sequential Optimization of Global Sequence Alignments Relative to Different Cost Functions Thesis by Enas Mohammad Odat, Master Degree in Computer Science Submitted in Partial Fulfillment of the Requirements for the degree of Masters of Computer Science King Abdullah University of Science and Technology Mathmatical and Computer Sciences and Engineering Division Computer System Thuwal, Makkah Province, Kingdom of Saudi Arabia May, 2011 2 The dissertation/thesis of Enas Mohammad Odat is approved by the examination committee Committee Chairperson: Committee Co-Chair: Committee Member: King Abdullah University of Science and Technology 2011 ABSTRACT Sequential Optimization of Global Sequence Alignments Relative to Different Cost Functions Enas Mohammad Odat The purpose of this dissertation is to present a methodology to model global sequence alignment problem as directed acyclic graph which helps to extract all pos- sible optimal alignments. Moreover, a mechanism to sequentially optimize sequence alignment problem relative to different cost functions is suggested. Sequence alignment is mostly important in computational biology. It is used to find evolutionary relationships between biological sequences. There are many algo- rithms that have been developed to solve this problem. The most famous algorithms are Needleman-Wunsch and Smith-Waterman that are based on dynamic program- ming. In dynamic programming, problem is divided into a set of overlapping sub- problems and then the solution of each subproblem is found. Finally, the solutions to these subproblems are combined into a final solution. In this thesis it has been proved that for two sequences of length m and n over a fixed alphabet, the suggested optimization procedure requires O(mn) arithmetic operations per cost function on a single processor machine.
    [Show full text]
  • Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas a Dissertation Submitted in Partial Satisfaction Of
    Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Katherine A. Yelick, Chair Professor James W. Demmel Professor Daniel S. Rokhsar Summer 2016 Scalable Parallel Algorithms for Genome Analysis Copyright 2016 by Evangelos Georganas 1 Abstract Scalable Parallel Algorithms for Genome Analysis by Evangelos Georganas Doctor of Philosophy in Computer Science University of California, Berkeley Professor Katherine A. Yelick, Chair A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled \shot- gun" sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (typically hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes (e.g. cells without nuclei), the genomes of eukaryotes (e.g. cells with nuclei) are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes. In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at the Joint Genome Institute.
    [Show full text]
  • HMMER User's Guide
    HMMER User’s Guide Biological sequence analysis using profile hidden Markov models http://hmmer.org/ Version 3.1b2; February 2015 Sean R. Eddy, Travis J. Wheeler and the HMMER development team Copyright (C) 2015 Howard Hughes Medical Institute. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. HMMER is licensed and freely distributed under the GNU General Public License version 3 (GPLv3). For a copy of the License, see http://www.gnu.org/licenses/. HMMER is a trademark of the Howard Hughes Medical Institute. 1 Contents 1 Introduction 7 How to avoid reading this manual . 7 How to avoid using this software (links to similar software) . 7 What profile HMMs are . 7 Applications of profile HMMs . 8 Design goals of HMMER3 . 9 What’s new in HMMER3.1 . 10 What’s still missing in HMMER3.1 . 11 How to learn more about profile HMMs . 11 2 Installation 13 Quick installation instructions . 13 System requirements . 13 Multithreaded parallelization for multicores is the default . 14 MPI parallelization for clusters is optional . 14 Using build directories . 15 Makefile targets . 15 Why is the output of ’make’ so clean? . 15 What gets installed by ’make install’, and where? . 15 Staged installations in a buildroot, for a packaging system . 16 Workarounds for some unusual configure/compilation problems . 16 3 Tutorial 18 The programs in HMMER . 18 Supported formats . 19 Files used in the tutorial . 19 Searching a protein sequence database with a single protein profile HMM . 20 Step 1: build a profile HMM with hmmbuild .
    [Show full text]