GAME-CHANGERS Some Papers Have a Profound and Obvious Influence on Future Research and Industry Applications

Total Page:16

File Type:pdf, Size:1020Kb

GAME-CHANGERS Some Papers Have a Profound and Obvious Influence on Future Research and Industry Applications INNOVATION | NATURE INDEX David Lipman, who co-authored the BLAST paper, which has been cited in at least 4,900 inventions according to patent documents in the Lens database. GAME-CHANGERS Some papers have a profound and obvious influence on future research and industry applications. Patents citing these life science papers indicate their bearing on developments which have widespread health implications. ists of the most highly cited academic patents are a general indicator of the dynamic selected from the Lens platform, based on articles garner considerable attention between science and technology, and can articles cited in patents. Each paper had been from the research community. But infer that a piece of research has influenced cited in more than 1,000 patent families by Larticles that are highly cited in patents don’t an invention (see Patently clear). Here, the 2016. Patent families represent a single inven- BILL REITZEL receive the same attention. This is surprising index profiles three life science articles that tion. Inventors often file patents in multiple given the demand from governments that sci- have been highly cited in patents. Each arti- countries, which is why the number of citing entists demonstrate the societal or economic cle has had profound impact on industry and, patents is larger than the number of patent value of their research. Citations of articles in eventually, consumers. These papers were families. NATURE INDEX 2017 | INNOVATION | S9 NATURE INDEX | INNOVATION Basic Local Alignment Search Tool (BLAST) THE GOOGLE OF GENOMES used pattern recognition software to perform ANTIBODIES AS THERAPY faster sequence comparisons. It could calculate Basic Local Alignment Search Tool the statistical level of similarity between two Replacing the complementarity- published in the Journal of Molecular sequences, says Lipman, who recently left the determining regions in a human Biology in 1990. NCBI after 28 years as its director. antibody with those from a mouse Within 10 years automated DNA sequenc- published in Nature in 1986. › CITED IN 4,900 PATENT FAMILIES ing machines had also gained a foothold and some 50,000 nucleotide sequences from plants › CITED IN 2,089 PATENT FAMILIES nferring the function of a protein 40 years and animals were stored in the NCBI-owned ago required finding a related protein with genetic sequence database known as Genbank. n 1986, Greg Winter and colleagues at the a known function. To determine the simi- BLAST enabled researchers to search for, or UK’s Medical Research Council (MRC) Ilarity between two proteins meant comparing compare, DNA sequences. Just five years after described in Nature a method for swapping their amino acid sequences using a time- its release, BLAST was handling about 200,000 Ipieces of a mouse antibody with those from a consuming algorithm. queries a week. These comparisons could human to create a chimeric antibody. This was In 1983, biologist David Lipman and col- yield several types of clues: what organism the the second essential step in the development of league W. John Wilbur reported a faster sequence probably came from, its evolutionary antibody-based therapies for human disease, method to identify the similarity between origin, and potential function. which represented more than 40% of total sales two unrelated sections of DNA or protein. A BLAST has since evolved into a family of free of biopharmaceutical products in 2016. year later, a global team of scientists used the web-based bioinformatics search tools that are The first step occurred a decade earlier when technique to show that amino acid sequences still widely used. Since 1990, the BLAST paper Nobel prize winning researchers Georges from a human growth factor closely resembled has been cited by at least 4,900 new inventions, Köhler and César Milstein developed mouse sequences from a cancer gene in a chicken according to patent documents in the Lens antibodies that recognize a single foreign virus. The paper marked significant progress database. Chemical giant Dupont and several molecule. While such monoclonal antibod- in the basic understanding of cancer develop- of its subsidiaries own the most patents that ies, had a wide range of applications in medical ment and revealed the value of computational cite the paper. research and diagnostics, their use in medicine tools for making biological discoveries. The NCBI team have created a new pro- was limited. Mouse antibodies are different Lipman, then based at the National Institute gramme within the BLAST toolkit that they from human ones — even when they target of Arthritis, Diabetes, and Digestive and Kid- hope to publish by September that will assist and bind to the same part of a protein. “Anti- ney Diseases, says this unexpected discovery with finding small genetic variations in bacte- bodies that are generated in another species prompted him to find how to detect more dis- ria, which may be linked to traits such as anti- cause side-effects in humans,” says vaccinolo- tant relationships between proteins. In 1990, biotic resistance. ■ gist, Ursula Wiedermann, of the Medical Uni- Lipman and colleagues Stephen Altschul and versity of Vienna. Indeed, the first US Food Warren Gish, along with collaborators Webb By Branwen Morgan and Drug Administration (FDA) approved Miller, at Pennsylvania State University, and therapeutic antibody, CD3, is no longer used Eugene Myers at the University of Arizona, pub- 1. Altschul et al. Journal of Molecular Biology for this reason. lished details of a more advanced algorithm. The 215,403-410 (1990) Winter’s method for creating chimeric PAPERS IN 1,200 PATENTS Lines represent 1,100 the number Source: Lens.org Source: of patent 1,000 documents that The 1990 paper by Lipman and cited one of three 900 life science colleagues has been cited in papers by 2016. 6,807 patent documents to 2016, The number of 800 totalling 4,900 new inventions. patent families refers to the 700 number of new inventions. 600 The 1986 paper by Winter and colleagues has been cited in 500 3,182 patent documents, totalling 2,089 new inventions. Lipman-citing 400 patents Number of citing patents per year 300 Winter-citing The 1999 paper by Golub has been patents cited in 2,031 patent documents, 200 totalling 1,278 new inventions. Golub-citing patents 100 0 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 Year Figures represent applications and granted patents from 95 jurisdictions. S10 | NATURE INDEX 2017 | INNOVATION INNOVATION | NATURE INDEX in cancer research, he faced a dilemma: “The PATENTLY CLEAR question was, what’s the right experiment to do to really highlight its potential?” If Golub’s group, working out of MIT’s How patents and their citations work Whitehead Institute, chose a problem that was too easy to demonstrate, no-one in the research community would care. If they made • Patent filings signal the holder's • Front pages of most patents include the problem too hard, they might diminish intent to commercialize an bibliographic information, including their chances of success. invention, or to stop others from citations to earlier patents or The team, which included geneticist and commercializing such a product. scientific documents. mathematician, Eric Lander, chose to look • Patents are only valid in the • Citations to science literature provide at how using the microarrays to profile gene jurisdiction in which they're evidence that the invention is in some expression could help classify cancers. It was sought and granted. There are no way related — or initiated or stimulated among the earliest attempts to move the iden- international patents. — by research activities. However cited tification of specific cancers beyond symptoms, • Patents are expensive to prepare, file, papers are rarely the key source of the responsiveness to treatment, and appearance. and prosecute (advance to granting); idea that led to the invention. To achieve this, the researchers selected two and require regular payments • Applicants or patent attorneys cancers — acute myeloid leukaemia and acute to maintain their validity during and examiners can add citations lymphoblastic leukaemia — for which there their typical 20-year lifetime. How to patent documents; although, were existing diagnostic tests. If gene expres- much an applicant is willing to pay studies have found most non-patent sion profiling enabled them to distinguish indicates how much they think the citations (NPCs) were supplied by between the two malignancies, they would be invention is worth. the applicant/inventor. able to show the accuracy of the new method, • Patent applications are published • There is no standard method for says Golub, now director of the MIT/Harvard 18 months after submission with all references in patents between Broad Institute. supporting files. They are typically jurisdictions, which may affect Golub and his colleagues were successful, open access and not copyrighted. qualitative analysis of citations. and could distinguish between the cancers on the basis of the expression of just 50 genes. Source: Richard and Osmat Jefferson; Robert Tijssen; Michael Meyer Nearly 20 years later, the experiment they settled on has become one of the most cited life sciences papers in global patents in the Lens database. Patents range from diagnostic antibodies meant they were less likely to be rec- officer of Australian biotech company, Imu- technology for cancer, to skin substitutes for ognised as a foreign protein and destroyed by gene, says much of the current interest in industry use, anti-inflammatory treatments, the human immune system. But, it was more monoclonal antibodies comes from the rapid and algorithms for genomic data analysis. than ten years after the Nature paper that chi- development of cancer immunotherapies. “We The study provided a framework for the use meric therapeutic antibodies were approved are in the middle of an exciting era,” she says. of the new DNA microarray technology to for treating conditions such as rheumatoid “A lot has changed in the last 15 years.” Major classify other diseases besides cancers, includ- arthritis and non-Hodgkin lymphoma.
Recommended publications
  • (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat Et Al
    US 2003O21, 1987A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2003/0211987 A1 Labat et al. (43) Pub. Date: Nov. 13, 2003 (54) METHODS AND MATERIALS RELATING TO Apr. 7, 2000 (US)........................................... O9545,714 STEM CELL GROWTH FACTOR-LIKE Apr. 11, 2000 (US)........................................... O9547358 POLYPEPTIDES AND POLYNUCLEOTDES Publication Classification (76) Inventors: Ivan Labat, Mountain View, CA (US); Y Tom Tang, San Jose, CA (US); Radoje T. Drmanac, Palo Alto, CA (51) Int. Cl." ....................... A61K 38/18; CO7K 14/475; (US); Chenghua Liu, San Jose, CA C12O 1/68; CO7H 21/04; (US); Juhi Lee, Fremont, CA (US); C12M 1/34; C12P 21/02; Nancy K Mize, Mountain View, CA C12N 5/08 (US); John Childs, Sunnyvale, CA (52) U.S. Cl. ......... 514/12; 435/69.1; 435/6; 435/320.1; (US); Cheng-Chi Chao, Cupertino, CA 435/366; 530/399; 536/23.5; (US) 435/287.2 Correspondence Address: MARSHALL, GERSTEIN & BORUN LLP (57) ABSTRACT 6300 SEARS TOWER 233 S. WACKER DRIVE The invention provides novel polynucleotides and polypep CHICAGO, IL 60606 (US) tides encoded by Such polynucleotides and mutants or variants thereof that correspond to a novel human Secreted (21) Appl. No.: 10/168,365 Stem cell growth factor-like polypeptide. These polynucle otides comprise nucleic acid Sequences isolated from cDNA (22) PCT Filed: Dec. 23, 2000 libraries prepared from human fetal liver Spleen, ovary, adult (86) PCT No.: PCT/US00/35260 brain, lung tumor, Spinal cord, cervix, ovary, endothelial cells, umbilical cord, lymphocyte, lung fibroblast, fetal (30) Foreign Application Priority Data brain, and testis.
    [Show full text]
  • UNIVERSITY of CALIFORNIA, SAN DIEGO Use Solid K-Mers In
    UNIVERSITY OF CALIFORNIA, SAN DIEGO Use Solid K-mers In MinHash-Based Genome Distance Estimation A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by An Zheng Committee in charge: Professor Pavel Pevzner, Chair Professor Vikas Bansal Professor Melissa Gymrek 2017 Copyright An Zheng, 2017 All rights reserved. The thesis of An Zheng is approved, and it is acceptable in quality and form for publication on microfilm and electroni- cally: Chair University of California, San Diego 2017 iii TABLE OF CONTENTS Signature Page . iii Table of Contents . iv List of Figures . v List of Tables . vi Acknowledgements . vii Abstract of the Thesis . viii Chapter 1 Introduction and background . 1 1.1 Genome distance estimation . 1 1.2 Current methods . 2 1.3 MinHash . 3 1.4 Solid k-mer powered MinHash . 5 Chapter 2 Method . 7 2.1 General scheme . 7 2.2 Identification of overlapping read pairs . 8 2.2.1 Workflow . 8 2.2.2 Data . 9 2.2.3 Implementation . 9 2.3 Genome identification . 10 2.3.1 Workflow . 10 2.3.2 Data . 10 2.3.3 Implementation . 10 Chapter 3 Result . 15 3.1 Identification of overlapping read pairs . 15 3.1.1 Performance comparison between solid k-mer pow- ered MinHash and regular MinHash . 15 3.1.2 Selecting the solid k-mer threshold . 17 3.2 Genome identification . 19 Chapter 4 Discussion and future work . 21 Bibliography . 23 iv LIST OF FIGURES Figure 1.1: An example of how to use MinHash to compute the resemblance of two genome sequences.
    [Show full text]
  • Outlier Detection in BLAST Hits∗
    Outlier Detection in BLAST Hits∗ Nidhi Shah1, Stephen F. Altschul2, and Mihai Pop3 1 Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA [email protected] 2 Computational Biology Branch, NCBI, NLM, NIH, Bethesda, MD, USA [email protected] 3 Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA [email protected] Abstract An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. The similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. On the other hand, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step.
    [Show full text]
  • Developing Bioinformatics Computer Skills.Pdf
    Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Cynthia Gibas Per Jambeck Publisher: O'Reilly First Edition April 2001 ISBN: 1-56592-664-1, 446 pages Buy Print Version Developing Bioinformatics Computer Skills will help biologists, researchers, and students develop a structured approach to biological data and the computer skills they'll need to analyze it. The book covers Copyright the Unix file system, building tools and databases for bioinformatics, Table of Contents computational approaches to biological problems, an introduction to Index Perl for bioinformatics, data mining, data visualization, and tips for Full Description tailoring data analysis software to individual research needs. About the Author Reviews Reader reviews Errata Delivered for Maurice ling Last updated on 10/30/2001 Swap Option Available: 7/15/2002 Developing Bioinformatics Computer Skills, © 2002 O'Reilly © 2002, O'Reilly & Associates, Inc. http://safari.oreilly.com/main.asp?bookname=bioskills [6/2/2002 8:49:35 AM] Safari | Developing Bioinformatics Computer Skills Show TOC | Frames My Desktop | Account | Log Out | Subscription | Help Programming > Developing Bioinformatics Computer Skills See All Titles Developing Bioinformatics Computer Skills Copyright © 2001 O'Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O'Reilly & Associates, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O'Reilly & Associates books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safari.oreilly.com). For more information contact our corporate/institutional sales department: 800-998-9938 or [email protected].
    [Show full text]
  • The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM
    The Scientist :: Blast, Aug. 29, 2005 09/18/2005 04:42 PM Volume 19 | Issue 16 | Page 21 | Aug. 29, 2005 Previous | Issue Contents | Next FEATURE | SEVEN TECHNOLOGIES How 90,000 lines of code helped spark the bioinformatics explosion By Anne Harding You've just cloned and sequenced a gene, but you don't know what it does. Now Sponsored by: what do you do? In the absence of functional clues, it's hard to know where to start. One approach is to ask what other known sequences are similar to yours, thereby inferring function from homology. Each weekday, some 200,000 or so researchers do just that, asking a server at the National Center for Biotechnology Information (NCBI) in Bethesda, Md., to compare their particular sequence against GenBank, a DNA database that, at the end of 2004, held more than 40 million sequences totaling 44.5 billion nucleotides. The NCBI devotes 158 two-processor computers to those queries, 75% of which return within 22 seconds. The software these servers use, a sturdy 15-year-old program known as the Basic Local Alignment Search Tool, or BLAST, remains, for many, bioinformatics' "killer app." It wasn't the first DNA database search tool, but it was fast, and it provided metrics to assess the significance of the matches it found--all in 90,000 lines of C code. "The fact that every biologist has been using BLAST tells everything," says Jin Billy Li of the Washington University Genome Sequencing Center in St. Louis, who has used BLASTP (a protein homology tool) to identify flagellar genes in several species, including the human gene that causes Bardet-Biedl syndrome, a ciliation disorder.
    [Show full text]
  • Msc THESIS Genetic Sequence Alignment on a Supercomputing Platform
    Computer Engineering 2011 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ MSc THESIS Genetic sequence alignment on a supercomputing platform Erik Vermij Abstract Genetic sequence alignment is an important tool for researchers. It lets them see the differences and similarities between two genetic sequences. This is used in several fields, like homology research, auto immune disease research and protein shape estimation. There are various algorithms that can perform this task and several hard- ware platforms suitable to deliver the necessary computation power. CE-MS-2011-02 Given the large volume of the datasets used, throughput is nowadays the major bottleneck in sequence alignment. In this thesis we discuss some of the existing solutions for high throughput genetic sequence alignment and present a new one. Our solution implements the well known Smith-Waterman optimal local alignment algorithm on the HC-1 hybrid supercomputer from Convey Computer. This platform features four FPGAs which can be used to accelerate the problem in question. The FPGAs, and the CPU that controls them, live in the same virtual memory space and share one large memory. We developed a hardware description for the FPGAs and a software program for the CPU. Some focus points were: a sustainable peak performance, being able to align sequences of any length, FPGA area efficient computations and the cancellation of unnecessary workload. The result is a Smith-Waterman FPGA core that can run at 100% utilization for many alignments long. They are packed per six on a FPGA running on 150 MHz, which results in a full system performance of 460 GCUPS (billion elementary operations per second).
    [Show full text]
  • Computation Resources for Molecular Biology: a Special Issue
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE Editorial provided by Elsevier - Publisher Connector Computation Resources for Molecular Biology: A Special Issue Increasingly, computational approaches are hav- ubiquitin-like folds in the protein databank (UbSRD). ing a central role across many areas of research The resource quantifies the structures of ubiquitins tackling the challenges of understanding the com- and SUMOs (small ubiquitin-like modifier proteins) plexity of biological systems. A resource such as and their different modes of protein–protein interac- BLAST (Basic Local Alignment Search Tool) [1], tions. The database allowed the authors to identify which was published in this journal in 1990, has that the ubiquitin tail is flexible and adopts a range of transformed sequence searching because of its conformations on binding. Users can browse the speed and power in detecting distant but biologically database by phylogeny, by structural properties and significant relationships. Since then, high-throughput by residue interactions. molecular biology technologies have led to a rapid The third database [4] in this Special Issue, expansion in available sequence, structural and authored by Keerthikumar et al., is ExoCarta that is -omics data for many systems being studied. a manually curated compendium of exosomal Computational biologists, statisticians and mathe- proteins, RNAs and lipids. The current version maticians have been motivated by this exponential details more than 41,000 protein, 7000 RNA and growth of data to develop enhanced novel compu- 1000 lipid molecules. Users can browse the data- tational tools. Challenges include the storage and base by organism, content type or gene.
    [Show full text]
  • Coding Sequences: a History of Sequence Comparison Algorithms As a Scientiªc Instrument
    Coding Sequences: A History of Sequence Comparison Algorithms as a Scientiªc Instrument Hallam Stevens Harvard University Sequence comparison algorithms are sophisticated pieces of software that com- pare and match identical or similar regions of DNA, RNA, or protein se- quence. This paper examines the origins and development of these algorithms from the 1960s to the 1990s. By treating this software as a kind of scien- tiªc instrument used to examine sets of biological objects, the paper shows how algorithms have been used as different sorts of tools and appropriated for dif- ferent sorts of uses according to the disciplinary context in which they were deployed. These particular uses have made sequences themselves into different kinds of objects. Introduction Historians of molecular biology have paid signiªcant attention to the role of scientiªc instruments and their relationship to the production of bio- logical knowledge. For instance, Lily Kay has examined the history of electrophoresis, Boelie Elzen has analyzed the development of the ultra- centrifuge as an enabling technology for molecular biology, and Nicolas Rasmussen has examined how molecular biology was transformed by the introduction of the electron microscope (Kay 1998, 1993; Elzen 1986; Rasmussen 1997).1 Collectively, these historians have demonstrated how instruments and other elements of the material culture of the labora- tory have played a decisive role in determining the kind and quantity of knowledge that is produced by biologists. During the 1960s, a versatile new kind of instrument began to be deployed in biology: the electronic computer (Ceruzzi 2001; Lenoir 1999). Despite the signiªcant role that 1. One could also point to Robert Kohler’s (1994) work on the fruit ºy, Jean-Paul Gaudillière (2001) on laboratory mice, and Hannah Landecker (2007) on the technologies of tissue culture.
    [Show full text]
  • STUDY of the RELATIONSHIP BETWEEN Mus Musculus PROTEIN SEQUENCES and THEIR BIOLOGICAL FUNCTIONS a Thesis Presented to the Gradua
    STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Pawan Seth May, 2007 STUDY OF THE RELATIONSHIP BETWEEN Mus musculus PROTEIN SEQUENCES AND THEIR BIOLOGICAL FUNCTIONS Pawan Seth Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Zhong-Hui Duan Dr. Ronald F. Levant _______________________________ _______________________________ Committee Member Dean of the Graduate School Dr. Chien-Chung Chan Dr. George R. Newkome _______________________________ _______________________________ Committee Member Date Dr. Xuan-Hien Dang _______________________________ Committee Member Dr. Yingcai Xiao _______________________________ Department Chair Dr. Wolfgang Pelz ii ABSTRACT The central challenge in post-genomic era is the characterization of biological functions of newly discovered proteins. Sequence similarity based approaches infer protein functions based upon the homology between proteins. In this thesis, we present the similarity relationship between protein sequences and functions for mouse proteome in the context of gene ontology slim. The similarity between protein sequences is computed using a novel measure based upon the local BLAST alignment scores. The similarity between protein functions is characterized using the three gene ontology categories. In the study, the ontology categories are represented using a general tree structure. Three ontology trees are constructed using the definitions provided in gene ontology slim. The mouse protein sequences are then mapped onto the trees. We present the sequence similarity distributions at different levels of GO tree. The similarities of protein sequences across gene ontology levels and traversing branches are studied.
    [Show full text]
  • K-Mulus: a Database-Clustering Approach to Protein BLAST in the Cloud
    K-mulus: A Database-Clustering Approach to Protein BLAST in the Cloud Carl H. Albachy Sebastian G. Angely Christopher M. Hilly Mihai Pop Department of Computer Science University of Maryland, College Park College Park, MD 20741 fcalbach, sga001, cmhill, [email protected] ABSTRACT with the intention of efficiently searching for these similar- With the increased availability of next-generation sequenc- ities. The most widely used application is the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST[3]. they are able to process and analyze. One of the most widely performed analyses is identifying regions of similar- With the increased availability of next-generation sequenc- ity between DNA or protein sequences using the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST. Due to the large scale ever before. This large influx of data has become a ma- of sequencing data produced, parallel implementations of jor issue as researchers have a difficult time processing and BLAST are needed to process the data in a timely manner. analyzing it. For this reason, optimizing the performance In this paper we present K-mulus, an application that per- of BLAST and developing new alignment tools has been a forms distributed BLAST queries via Hadoop and MapRe- well researched topic over the past few years. Take the ex- duce and aims to generate speedups by clustering the database. ample of environmental sequencing projects, in which the Clustering the sequence database reduces the search space bio-diversity of various environments, including the human for a given query, and allows K-mulus to easily parallelize microbiome, is analyzed and characterized to generate on the remaining work.
    [Show full text]
  • In Its Most Basic Form a Sequence Alignment Is Simply Comparing Two Or More Sequences by Searching for Character Patterns and Other Similarities
    In its most basic form a sequence alignment is simply comparing two or more sequences by searching for character patterns and other similarities. Blasting some sequence is often the first step a researcher will take to characterize an unknown sequence or even a whole genome, but in 1970, when Needleman and Wunsch first introduced their algorithm for automated global alignment of sequences, there were still very few sequences to work with. Throughout the late seventies, determining nucleotide sequences was a torturous tedious process, but in 1977 there were two major technological breakthroughs, one pioneered by Maxam and Gilbert and the other by Sanger that opened the door for automated sequencing. These two developments represent the bedrock on which high through put sequencing is built. Alignment techniques can be broken into two components. The first is construction of a scoring matrix, and the second is the actual algorithms used to compute a score. The simplest scoring matrix is a unitary matrix. If the character matches a one is assigned, and if not the matrix element is zero. In this most fundamental case, the alignment would simply be the path through this matrix. The two most commonly used scoring matrices today are PAM and BLOSUM. PAM or percent accepted mutation rate different substitution rates of amino acids were calculated based on alignments of protein sequences that were at least eight-five percent identical (Heinkoff, 1992). Rates for PAMs correlating to different evolutionary distances were then extrapolated from these original calculations. BLOSUM takes a slightly different approach. Instead of relying on extrapolated data, the various BLOSUMs are calculated directly.
    [Show full text]
  • Thesis by Submitted in Partial Fulfillment of the Requirements For
    Sequential Optimization of Global Sequence Alignments Relative to Different Cost Functions Thesis by Enas Mohammad Odat, Master Degree in Computer Science Submitted in Partial Fulfillment of the Requirements for the degree of Masters of Computer Science King Abdullah University of Science and Technology Mathmatical and Computer Sciences and Engineering Division Computer System Thuwal, Makkah Province, Kingdom of Saudi Arabia May, 2011 2 The dissertation/thesis of Enas Mohammad Odat is approved by the examination committee Committee Chairperson: Committee Co-Chair: Committee Member: King Abdullah University of Science and Technology 2011 ABSTRACT Sequential Optimization of Global Sequence Alignments Relative to Different Cost Functions Enas Mohammad Odat The purpose of this dissertation is to present a methodology to model global sequence alignment problem as directed acyclic graph which helps to extract all pos- sible optimal alignments. Moreover, a mechanism to sequentially optimize sequence alignment problem relative to different cost functions is suggested. Sequence alignment is mostly important in computational biology. It is used to find evolutionary relationships between biological sequences. There are many algo- rithms that have been developed to solve this problem. The most famous algorithms are Needleman-Wunsch and Smith-Waterman that are based on dynamic program- ming. In dynamic programming, problem is divided into a set of overlapping sub- problems and then the solution of each subproblem is found. Finally, the solutions to these subproblems are combined into a final solution. In this thesis it has been proved that for two sequences of length m and n over a fixed alphabet, the suggested optimization procedure requires O(mn) arithmetic operations per cost function on a single processor machine.
    [Show full text]