Ontology-Based Methods for Analyzing Life Science Data

Total Page:16

File Type:pdf, Size:1020Kb

Ontology-Based Methods for Analyzing Life Science Data Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective . 44 2.4.3 Semantic expansion . 45 2.4.4 Query generation . 45 2.5 Synthesis ........................................ 47 3 Reasoning based on classification 51 3.1 Principle......................................... 51 3.1.1 OWL Classes . 52 3.1.2 Union and intersection of classes . 53 3.1.3 Disjoint classes . 53 3.1.4 Negation: complement of a class . 54 3.1.5 Existential and universal restrictions . 54 3.1.6 Cardinality restrictions . 55 3.1.7 Property chains . 55 3 3.1.8 Synthesis . 55 3.2 Methodology: Description-logics representation of anatomy . 57 3.2.1 Context . 57 3.2.2 Objective . 58 3.2.3 Converting the FMA into OWL-DL . 58 3.2.4 Addressing expressiveness and application-independence: OWL-Full . 60 3.2.5 Pattern-based generation of consistency constraints . 61 3.3 Methodology: diagnosis of heart-related injuries . 64 3.3.1 Context . 64 3.3.2 Objective . 65 3.3.3 Reasoning about coronary artery ischemia . 65 3.3.4 Reasoning about pericardial effusion . 68 3.4 Optimization: modeling strategies for estimating pacemaker alerts severity . 73 3.4.1 Context . 73 3.4.2 Objective . 74 3.4.3 CHA2DS2VASc score . 74 3.4.4 Modeling strategies . 76 3.4.5 Comparison of the strategies' performances . 77 3.5 Synthesis ........................................ 80 4 Reasoning with incomplete information 83 4.1 Principle......................................... 84 4.2 Methodology: grading tumors . 84 4.2.1 Context . 84 4.2.2 Objective . 85 4.2.3 Why the NCIT is not up to the task . 85 4.2.4 An ontology of glioblastoma based on the NCIT . 86 4.2.5 Narrowing the possible grades in case of incomplete information . 89 4.3 Methodology: clinical trials recruitment . 91 4.3.1 Context . 91 4.3.2 Objective . 92 4.3.3 The problem of missing information . 92 4.3.4 Eligibility criteria design pattern . 95 4.3.5 Reasoning . 96 4.4 Synthesis ........................................ 98 5 Reasoning with similarity and particularity 101 5.1 Principle......................................... 101 5.1.1 Comparing elements with independent annotations . 102 5.1.2 Taking the annotations underlying structure into account . 103 5.1.3 Synthesis . 108 5.2 Methodology: semantic particularity measure . 109 5.2.1 Context . 109 5.2.2 Objective . 110 5.2.3 Definition of semantic particularity . 111 5.2.4 Formal properties of semantic particularity . 111 5.2.5 Measure of semantic particularity . 112 5.2.6 Use case: Homo sapiens aquaporin-mediated transport . 113 5.3 Methodology: threshold determination for similarity and particularity . 116 5.3.1 Context . 117 4 5.3.2 Objective . 118 5.3.3 Similarity threshold . 118 5.3.4 Particularity threshold . 125 5.3.5 Evaluation of the impact of the new threshold on HolomoGene . 126 5.4 Synthesis ........................................ 128 6 Conclusion and research perspectives 129 6.1 Producing and querying linked data . 130 6.1.1 Representing our data as linked data . 131 6.1.2 Querying linked data . 134 6.2 Analyzing data . 135 6.2.1 Selecting relevant candidates when reconstructing metabolic pathways . 136 6.2.2 Analyzing TGF-β signaling pathways . 136 6.2.3 Data analysis method combining ontologies and formal concept analysis . 137 Bibliography 139 Curriculum Vitæ 163 5 Acknowledgments I and indebted to Mark Musen, Anita Burgun and Anne Siegel for allowing me to join their team, for their help and their encouragement, as well as for setting up such exciting environments. I am most grateful to Michel Dumontier, Christine Froidevaux and Fabien Gandon for kindly accepting to review this document, and also to Anita Burgun, Marie-Dominique Devignes, Anne Siegel and Alexandre Termier for accepting to be parts of the committee. I am thankful to all my colleagues for all the work we have accomplished together, as summarized by Figure 1 on the facing page. The PhD students I co-supervised have brought significant contributions to most of the works presented in this manuscript: Nicolas Lebreton, Elodie´ Roques, Charles Bettembourg, Philippe Finet, Jean Coquet and Yann Rivault. Close collaborations with colleagues from whom I learnt a lot were very stimulating: Daniel Rubin, Natasha Noy, Anita Burgun, Julie Chabalier, Andrea Splendiani, Paolo Besana, Lynda Temal, Pierre Zweigenbaum, Cyril Grouin, Annabel Bourd´e,Oussama Zekri, Pascal van Hille, Jean-Fran¸cois Ethier,´ Bernard Gibaud, R´egine Le Bouquin-Jeann`es, Anne Siegel, Jacques Nico- las, Guillaume Collet, Sylvain Prigent, Geoffroy Andrieux, Nathalie Th´eret, Fabrice Legeai, Anthony Bretaudeau, Olivier Filangi and again Charles Bettembourg. Collaborating with colleagues from INRA was a great opportunity to tackle \real problems" and gave a great perspective on what this is all about: Christian Diot, Fr´ed´eric H´erault, Denis Tagu, M´elanie Jubault, Aur´elie Evrard,´ Cyril Falentin Pierre-Yves Lebail and all the ATOL curators: J´er^omeBugeon, Alice Fatet, Isabelle Hue, Catherine Hurtaud, Matthieu Reichstadt, Marie-Christine Salaun,¨ Jean Vernet, L´eaJoret. Similarly, collaborations with colleagues from EHESP provided the \human" counterpart: Nolwenn Le Meur, Yann Rivault. More broadly, I benefited from the interactions with many other colleagues: Gwena¨elle Marquet, Fleur Mougin, Ammar Mechouche, Mika¨el Roussel, as well as the whole Symbiose group at IRISA, and particularly Pierre Peterlongo. Working on workflows with Olivier Collin, Yvan Le Bras, Alban Gaignard and Audrey Bihou´ee has been fun and I am eager to see what will come out of it. In addition to research all these years were also the occasion of great teaching-related en- counters: Christian Delamarche, Emmanuelle Becker, Emmanuel Giudice, Annabelle Monnier, Yann le Cunff, C´edric Wolf. Thank you all, I enjoyed all of it. 6 Figure 1: Graph of my co-authors. Two authors are linked if they share at least one publication. Node size is proportional to the number of articles. 7 8 Chapter 1 Introduction This document summarizes my research activities since the defense of my PhD in Decem- ber 2003. This work has been carried initially as a postdoctoral fellow at Stanford University with Mark Musen's Stanford Medical Informatics group (now BMIR1), and then as an associate professor at University of Rennes 1, first with the UPRES-EA 3888 (which became UMR 936 INSERM { University of Rennes 1 in 2009) from 2005 to 2012, and then with the Dyliss team at IRISA since 2013. First, I will present the context in which my research takes place. We will see that the traditional approaches for analyzing life science data do not scale up and cannot handle their increasing quantity, complexity and connectivity. It has become necessary to develop automatic tools not only for performing the analyses, but also for helping the experts do it. Yet, processing the raw data is so difficult to automate that these tools usually hinge on annotations and metadata as machine-processable proxies that describe the data and the relations between them. Second, I will identify the main challenges. While generating these metadata is a challenge of its own that I will not tackle here, it is only the first step. Even if metadata tend to be more compact than the original data, each piece of data is typically associated with many metadata, so the problem of data quantity remains. These metadata have to be reused and combined, even if they have been generated by different people, in different places, in different contexts, so we also have a problem of integration. Eventually, the analyses require some reasoning on these metadata. Most of these analyses were not possible before the data deluge, so we are inventing and improving them now. This also means that we have to design new reasoning methods for answering life science questions using the opportunities created by the data deluge while not drowning in it. Arguably, biology has become an information science. Third, I will summarize the contributions presented in the document. Some of the reasoning methods that we develop rely on life science background knowledge. Ontologies are the formal representations of the symbolic part of this knowledge. The Semantic Web is a more general effort that provides an unified framework of technologies and associated tools for representing, sharing, combining metadata and pairing them with ontologies.
Recommended publications
  • D3.3 Initial Technical & Logical Architecture and Future Work
    D3.3 Initial Technical & Logical Architecture and future work recommendations ECP 2008 DILI 558001 Europeana v1.0 D3.3 Initial Technical & Logical Architecture and future work recommendations Deliverable number D3.3 Dissemination level Public Delivery date 30 July 2010 Status Final Makx Dekkers, Stefan Gradmann, Jan Author(s) Molendijk eContentplus This project is funded under the eContentplus programme1, a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable. 1 OJ L 79, 24.3.2005, p. 1. 1/14 D3.3 Initial Technical & Logical Architecture and future work recommendations 1. Introduction This deliverable has two tasks: To characterise the technical and logical architecture of Europeana as a system in its 1.0 state (that is to say by the time of the ‘Rhine’ release To outline the future work recommendations that can reasonably be made at that moment. This also provides a straightforward and logical structure to the document: characterisation comes first followed by the recommendations for future work. 2. Technical and Logical Architecture From a high-level architectural point of view, Europeana.eu is best characterized as a search engine and a database. It loads metadata delivered by providers and aggregators into a database, and uses that database to allow users to search for cultural heritage objects, and to find links to those objects. Various methods of searching and browsing the objects are offered, including a simple and an advanced search form, a timeline, and an openSearch API. It is also important to describe what Europeana.eu does not do, even though people sometimes expect it to.
    [Show full text]
  • A Knowledge Discovery Approach
    Semantic XML Tagging of Domain-Specific Text Archives: A Knowledge Discovery Approach Dissertation zur Erlangung des akademisches Grades Doktoringenieur (Dr.-Ing.) angenommen durch die Fakult¨at fur¨ Informatik der Otto-von-Guericke-Universit¨at Magdeburg von Diplom-Kaufmann Peter Karsten Winkler, geboren am 1. Oktober 1971 in Berlin Gutachterinnen und Gutachter: Prof. Dr. Myra Spiliopoulou Prof. Dr. Gunter Saake Prof. Dr. Stefan Conrad Ort und Datum des Promotionskolloquiums: Magdeburg, 22. Januar 2009 Karsten Winkler. Semantic XML Tagging of Domain-Specific Text Archives: A Knowl- edge Discovery Approach. Dissertation, Faculty of Computer Science, Otto von Guericke University Magdeburg, Magdeburg, Germany, January 2009. Contents List of Figures v List of Tables vii List of Algorithms xi Abstract xiii Zusammenfassung xv Acknowledgments xvii 1 Introduction 1 1.1 TheAbundanceofText ............................ 1 1.2 Defining Semantic XML Markup . 3 1.3 BenefitsofSemanticXMLMarkup . 9 1.4 ResearchQuestions ............................... 12 1.5 ResearchMethodology ............................. 14 1.6 Outline...................................... 16 2 Literature Review 19 2.1 Storage, Retrieval, and Analysis of Textual Data . ....... 19 2.1.1 Knowledge Discovery in Textual Databases . 19 2.1.2 Information Storage and Retrieval . 23 2.1.3 InformationExtraction. 25 2.2 Discovering Concepts in Textual Data . 26 2.2.1 Topic Discovery in Text Documents . 27 2.2.2 Extracting Relational Tuples from Text . 31 2.2.3 Learning Taxonomies, Thesauri, and Ontologies . 35 2.3 Semantic Annotation of Text Documents . 39 2.3.1 Manual Semantic Text Annotation . 40 2.3.2 Semi-Automated Semantic Text Annotation . 43 2.3.3 Automated Semantic Text Annotation . 47 2.4 Schema Discovery in Marked-Up Text Documents .
    [Show full text]
  • A Logical Framework for Modularity of Ontologies∗
    A Logical Framework for Modularity of Ontologies∗ Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov and Ulrike Sattler The University of Manchester School of Computer Science Manchester, M13 9PL, UK {bcg, horrocks, ykazakov, sattler }@cs.man.ac.uk Abstract ontology itself by possibly reusing the meaning of external symbols. Hence by merging an ontology with external on- Modularity is a key requirement for collaborative tologies we import the meaning of its external symbols to de- ontology engineering and for distributed ontology fine the meaning of its local symbols. reuse on the Web. Modern ontology languages, To make this idea work, we need to impose certain con- such as OWL, are logic-based, and thus a useful straints on the usage of the external signature: in particular, notion of modularity needs to take the semantics of merging ontologies should be “safe” in the sense that they ontologies and their implications into account. We do not produce unexpected results such as new inconsisten- propose a logic-based notion of modularity that al- cies or subsumptions between imported symbols. To achieve lows the modeler to specify the external signature this kind of safety, we use the notion of conservative exten- of their ontology, whose symbols are assumed to sions to define modularity of ontologies, and then prove that be defined in some other ontology. We define two a property of some ontologies, called locality, can be used restrictions on the usage of the external signature, to achieve modularity. More precisely, we define two no- a syntactic and a slightly less restrictive, seman- tions of locality for SHIQ TBoxes: (i) a tractable syntac- tic one, each of which is decidable and guarantees tic one which can be used to provide guidance in ontology a certain kind of “black-box” behavior, which en- editing tools, and (ii) a more general semantic one which can ables the controlled merging of ontologies.
    [Show full text]
  • Proquest Dissertations
    Automated learning of protein involvement in pathogenesis using integrated queries Eithon Cadag A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2009 Program Authorized to Offer Degree: Department of Medical Education and Biomedical Informatics UMI Number: 3394276 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI Dissertation Publishing UMI 3394276 Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code. uest ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Eithon Cadag and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Reading Committee: (SjLt KJ. £U*t~ Peter Tgffczy-Hornoch In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with "fair use" as prescribed in the U.S.
    [Show full text]
  • Molecular Biology for Computer Scientists
    CHAPTER 1 Molecular Biology for Computer Scientists Lawrence Hunter “Computers are to biology what mathematics is to physics.” — Harold Morowitz One of the major challenges for computer scientists who wish to work in the domain of molecular biology is becoming conversant with the daunting intri- cacies of existing biological knowledge and its extensive technical vocabu- lary. Questions about the origin, function, and structure of living systems have been pursued by nearly all cultures throughout history, and the work of the last two generations has been particularly fruitful. The knowledge of liv- ing systems resulting from this research is far too detailed and complex for any one human to comprehend. An entire scientific career can be based in the study of a single biomolecule. Nevertheless, in the following pages, I attempt to provide enough background for a computer scientist to understand much of the biology discussed in this book. This chapter provides the briefest of overviews; I can only begin to convey the depth, variety, complexity and stunning beauty of the universe of living things. Much of what follows is not about molecular biology per se. In order to 2ARTIFICIAL INTELLIGENCE & MOLECULAR BIOLOGY explain what the molecules are doing, it is often necessary to use concepts involving, for example, cells, embryological development, or evolution. Bi- ology is frustratingly holistic. Events at one level can effect and be affected by events at very different levels of scale or time. Digesting a survey of the basic background material is a prerequisite for understanding the significance of the molecular biology that is described elsewhere in the book.
    [Show full text]
  • Deciding Semantic Matching of Stateless Services∗
    Deciding Semantic Matching of Stateless Services∗ Duncan Hull†, Evgeny Zolin†, Andrey Bovykin‡, Ian Horrocks†, Ulrike Sattler†, and Robert Stevens† School of Computer Science, Department of Computer Science, † ‡ University of Manchester, UK University of Liverpool, UK Abstract specifying automated reasoning algorithms for such stateful service descriptions is basically impossible in the presence We present a novel approach to describe and reason about stateless information processing services. It can of any expressive ontology (Baader et al. 2005). Stateless- be seen as an extension of standard descriptions which ness implies that we do not need to formulate pre- and post- makes explicit the relationship between inputs and out- conditions since our services do not change the world. puts and takes into account OWL ontologies to fix the The question we are interested in here is how to help the meaning of the terms used in a service description. This biologist to find a service he or she is looking for, i.e., a allows us to define a notion of matching between ser- service that works with inputs and outputs the biologist can vices which yields high precision and recall for service provide/accept, and that provides the required functionality. location. We explain why matching is decidable, and The growing number of publicly available biomedical web provide biomedical example services to illustrate the utility of our approach. services, 3000 as of February 2006, required better match- ing techniques to locate services. Thus, we are concerned with the question of how to describe a service request Q Introduction and service advertisements Si such that the notion of a ser- Understanding the data generated from genome sequenc- vice S matching the request Q can be defined in a “useful” ing projects like the Human Genome Project is recognised way.
    [Show full text]
  • Bibliography [Abiteboul and Kanellakis, 1989] Serge Abiteboul and Paris Kanellakis
    507 Bibliography [Abiteboul and Kanellakis, 1989] Serge Abiteboul and Paris Kanellakis. Object identity as a query language primitive. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 159–173, 1989. [Abiteboul et al., 1995] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison Wesley Publ. Co., Reading, Massachussetts, 1995. [Abiteboul et al., 1997] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. The Lorel query language for semistructured data. Int. J. on Digital Libraries, 1(1):68–88, 1997. [Abiteboul et al., 2000] Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web: from Relations to Semistructured Data and XML. Morgan Kaufmann, Los Altos, 2000. [Abiteboul, 1997] Serge Abiteboul. Querying semi-structured data. In Proc. of the 6th Int. Conf. on Database Theory (ICDT’97), pages 1–18, 1997. [Abrahams et al., 1996] Merryll K. Abrahams, Deborah L. McGuinness, Rich Thomason, Lori Alperin Resnick, Peter F. Patel-Schneider, Violetta Cavalli-Sforza, and Cristina Conati. NeoClassic tutorial: Version 1.0. Technical report, Artificial Intelligence Prin- ciples Research Department, AT&T Labs Research and University of Pittsburgh, 1996. Available as http://www.bell-labs.com/project/classic/papers/NeoTut/NeoTut. html. [Abrett and Burstein, 1987] Glen Abrett and Mark H. Burstein. The KREME knowledge editing environment. Int. J. of Man-Machine Studies, 27(2):103–126, 1987. [Abrial, 1974] J. R. Abrial. Data semantics. In J. W. Klimbie and K. L. Koffeman, editors, Data Base Management, pages 1–59. North-Holland Publ. Co., Amsterdam, 1974. [Achilles et al., 1991] E. Achilles, B.
    [Show full text]
  • Ploidetect Enables Pan-Cancer Analysis of the Causes and Impacts of Chromosomal Instability
    bioRxiv preprint doi: https://doi.org/10.1101/2021.08.06.455329; this version posted August 8, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Ploidetect enables pan-cancer analysis of the causes and impacts of chromosomal instability Luka Culibrk1,2, Jasleen K. Grewal1,2, Erin D. Pleasance1, Laura Williamson1, Karen Mungall1, Janessa Laskin3, Marco A. Marra1,4, and Steven J.M. Jones1,4, 1Canada’s Michael Smith Genome Sciences Center at BC Cancer, Vancouver, British Columbia, Canada 2Bioinformatics training program, University of British Columbia, Vancouver, British Columbia, Canada 3Department of Medical Oncology, BC Cancer, Vancouver, British Columbia, Canada 4Department of Medical Genetics, Faculty of Medicine, Vancouver, British Columbia, Canada Cancers routinely exhibit chromosomal instability, resulting in tumors mutate, these variants are considerably more difficult the accumulation of changes in the abundance of genomic ma- to detect accurately compared to other types of mutations terial, known as copy number variants (CNVs). Unfortunately, and consequently they may represent an under-explored the detection of these variants in cancer genomes is difficult. We facet of tumor biology. 20 developed Ploidetect, a software package that effectively iden- While small mutations can be determined through base tifies CNVs within whole-genome sequenced tumors. Ploidetect changes embedded within aligned sequence reads, CNVs was more sensitive to CNVs in cancer related genes within ad- are variations in DNA quantity and are typically determined vanced, pre-treated metastatic cancers than other tools, while also segmenting the most contiguously.
    [Show full text]
  • Description Logics
    Description Logics Franz Baader1, Ian Horrocks2, and Ulrike Sattler2 1 Institut f¨urTheoretische Informatik, TU Dresden, Germany [email protected] 2 Department of Computer Science, University of Manchester, UK {horrocks,sattler}@cs.man.ac.uk Summary. In this chapter, we explain what description logics are and why they make good ontology languages. In particular, we introduce the description logic SHIQ, which has formed the basis of several well-known ontology languages, in- cluding OWL. We argue that, without the last decade of basic research in description logics, this family of knowledge representation languages could not have played such an important rˆolein this context. Description logic reasoning can be used both during the design phase, in order to improve the quality of ontologies, and in the deployment phase, in order to exploit the rich structure of ontologies and ontology based information. We discuss the extensions to SHIQ that are required for languages such as OWL and, finally, we sketch how novel reasoning services can support building DL knowledge bases. 1 Introduction The aim of this section is to give a brief introduction to description logics, and to argue why they are well-suited as ontology languages. In the remainder of the chapter we will put some flesh on this skeleton by providing more technical details with respect to the theory of description logics, and their relationship to state of the art ontology languages. More detail on these and other matters related to description logics can be found in [6]. Ontologies There have been many attempts to define what constitutes an ontology, per- haps the best known (at least amongst computer scientists) being due to Gruber: “an ontology is an explicit specification of a conceptualisation” [47].3 In this context, a conceptualisation means an abstract model of some aspect of the world, taking the form of a definition of the properties of important 3 This was later elaborated to “a formal specification of a shared conceptualisation” [21].
    [Show full text]
  • BIOINFORMATICS APPLICATIONS NOTE Pages 380-381
    Vol. 14 no. 4 1998 BIOINFORMATICS APPLICATIONS NOTE Pages 380-381 MView: a web-compatible database search or multiple alignment viewer NigelP. Brown, ChristopheLeroy and Chris Sander European Bioinformatics Institute (EMBLĆEBI), Wellcome Genome Campus, CambridgeCB10 1SD, UK Received on December 10, 1997; revised and accepted on January 15, 1998 Abstract may be hyperlinked to the SRS system (Etzold et al., 1996), a text field, a field of scoring information from searches, and Summary: MView is a tool for converting the results of a a field reporting the per cent identity of each sequence with sequence database search into the form of a coloured multiple respect to a preferred sequence in the alignment, usually the alignment of hits stacked against the query. Alternatively, an query in the case of a search. existing multiple alignment can be processed. In either case, Multiple alignments require minimal parsing and are the output is simply HTML, so the result is platform independent and does not require a separate application or subjected only to formatting stages. Search hits are first applet to be loaded. stacked against the ungapped query sequence and require Availability: Free from http://www.sander.ebi.ac.uk/mview/ special processing. Ungapped search (e.g. BLAST) hit subject to copyright restrictions. fragments are assembled into a single string by overlaying Contact: [email protected] them preferentially by score onto a template string, while gapped search (e.g. FASTA) hits have columns corresponding Often when running FASTA (Pearson, 1990) or BLAST to query gaps excised. Consequently, the stacked alignment is (Altschul et al., 1990), it is desired to visualize the database a patchwork of reconstituted sequences that nevertheless is hits stacked against the query sequence.
    [Show full text]
  • PREDICTD: Parallel Epigenomics Data Imputation with Cloud-Based Tensor Decomposition
    bioRxiv preprint doi: https://doi.org/10.1101/123927; this version posted April 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition Timothy J. Durham Maxwell W. Libbrecht Department of Genome Sciences Department of Genome Sciences University of Washington University of Washington J. Jeffry Howbert Jeff Bilmes Department of Genome Sciences Department of Electrical Engineering University of Washington University of Washington William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington April 4, 2017 Abstract The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project have produced thousands of data sets mapping the epigenome in hundreds of cell types. How- ever, the number of cell types remains too great to comprehensively map given current time and financial constraints. We present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to address this issue by computationally im- puting missing experiments in collections of epigenomics experiments. PREDICTD leverages an intuitive and natural model called \tensor decomposition" to impute many experiments si- multaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining methods yields further improvement. We show that PREDICTD data can be used to investigate enhancer biology at non-coding human accelerated regions. PREDICTD provides reference imputed data sets and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, two technologies increasingly applicable in bioinformatics.
    [Show full text]
  • Are You an Invited Speaker? a Bibliometric Analysis of Elite Groups for Scholarly Events in Bioinformatics
    Are You an Invited Speaker? A Bibliometric Analysis of Elite Groups for Scholarly Events in Bioinformatics Senator Jeong, Sungin Lee, and Hong-Gee Kim Biomedical Knowledge Engineering Laboratory, Seoul National University, 28–22 YeonGeon Dong, Jongno Gu, Seoul 110–749, Korea. E-mail: {senator, sunginlee, hgkim}@snu.ac.kr Participating in scholarly events (e.g., conferences, work- evaluation, but it would be hard to claim that they have pro- shops, etc.) as an elite-group member such as an orga- vided comprehensive lists of evaluation measurements. This nizing committee chair or member, program committee article aims not to provide such lists but to add to the current chair or member, session chair, invited speaker, or award winner is beneficial to a researcher’s career develop- practices an alternative metric that complements existing per- ment.The objective of this study is to investigate whether formance measures to give a more comprehensive picture of elite-group membership for scholarly events is represen- scholars’ performance. tative of scholars’ prominence, and which elite group is By one definition (Jeong, 2008), a scholarly event is the most prestigious. We collected data about 15 global “a sequentially and spatially organized collection of schol- (excluding regional) bioinformatics scholarly events held in 2007. We sampled (via stratified random sampling) ars’ interactions with the intention of delivering and shar- participants from elite groups in each event. Then, bib- ing knowledge, exchanging research ideas, and performing liometric indicators (total citations and h index) of seven related activities.” As such, scholarly events are communica- elite groups and a non-elite group, consisting of authors tion channels from which our new evaluation tool can draw who submitted at least one paper to an event but were its supporting evidence.
    [Show full text]