Diegos GO & SPECIES ENRICHMENT ANALYSIS

Total Page:16

File Type:pdf, Size:1020Kb

Load more

dieGOS GO & SPECIES ENRICHMENT ANALYSIS by DIEGO MONCAYO M. Supervisors: Dr Mikael Bod´en. Dr Minh Duc Cao. Department of Electrical and Computer Engineering, University of Queensland. Submitted for the degree of Master of Computer Science. NOVEMBER 2013. ii iii 32/1 Mitre St. St. Lucia QLD 4072 Tel. (04)3102 6180 November 18, 2013 The Dean School of Engineering University of Queensland St Lucia, Q 4072 Dear Professor Paul Strooper, In accordance with the requirements of the degree of Master of Computer Science in the School of Information Technology and Electrical Engineering, I present the following thesis entitled \dieGOS - GO & Species Enrichment Analysis". This work was performed under the supervision of Dr Mikael Bod´enand Dr Minh Duc Cao. I declare that the work submitted in this thesis is my own, except as acknowl- edged in the text and footnotes, and has not been previously submitted for a degree at the University of Queensland or any other institution. Yours sincerely, DIEGO MONCAYO M. iv v To my beloved family, especially to my parents. Acknowledgments I wish to express my sincere gratitude to the main supervisor of this thesis, Dr Mikael Bod´en,for providing me the opportunity to do this project and for giving his valuable guidance, advice, and help. I also would like to thank him for the support and assistance during the recovery following my accident. Besides, I would also like to thank to Dr Minh Duc Cao for his recommendations and comments during his supervision of this thesis work and to all the Bioinformatics Research Group, especially to Dr Yosephine Gumulya and PhD. student Julian Zaugg, for their suggestions about the tool. vi Abstract This thesis addresses the lack of tools for determining statistical significance of an user-specified sequence motif associated to the protein sequences of a species and to the annotations related to those sequences. The tool called dieGOS is implemented as a published web service in the Internet. It enables the user to input different types of sequence motif, to select the species of interest, to choose the statistical significance test of preference and to obtain tabular and visual results. The web service considerably assists the researchers to investigate the significance of a particular motif for a group of species and between the gene ontology terms associated with those species. The process to obtain the results this tool provides is usually by developing or applying different tools over big files. This tool integrates all the functionality in one simple view but with powerful options due to advanced input and output visual components to simplify the user interaction. As the user experience depends on the algorithms responsible to process the data, they have been optimized for being effective and efficient and thus are able to return the results in an acceptable time frame. The results obtained by applying the tool to the data of a real study were slightly different. However, the results are consistent with the numbers involved in the statistical tests. vii Contents Acknowledgments vi Abstract vii List of Figures xii List of Tables xiii 1 Introduction 1 1.1 Challenges . .2 2 Background 4 2.1 Sequence Motif . .4 2.1.1 PROSITE Pattern notation . .4 2.1.2 Matrices . .4 2.2 Protein Localisation Signal . .5 2.3 UniProtKB . .6 2.4 Taxonomy . .7 2.4.1 NCBI Taxonomy database . .7 2.5 Gene Ontology . .7 2.5.1 Ontology . .7 2.5.2 Gene Ontology Project . .8 2.5.3 Structure of GO . .8 2.6 Statistical Test . .8 2.6.1 Fisher's exact test . .8 2.6.2 Mann Withney Wilcoxon test . 10 3 Previous Work 11 3.1 Repository . 11 3.2 Gene Ontology enrichment tools . 11 viii CONTENTS ix 4 Materials 12 4.1 Data . 12 4.1.1 Protein Sequences . 12 4.1.2 Gene Ontology . 12 4.1.3 Taxonomy . 13 4.1.4 Statistics of Database Files . 13 4.2 Technology . 13 4.2.1 Computer . 14 4.2.2 Programming Language . 14 4.2.3 Integrated Development Environment . 14 4.2.4 Libraries . 14 4.2.5 Server . 14 5 Methods 16 5.1 Motif to Protein algorithm . 16 5.1.1 Loading and accessing protein sequence data . 17 5.1.2 Sequence Matching and Scoring . 18 5.2 GO Associations . 19 5.2.1 Mapping sequences and annotations . 20 5.2.2 Extracting GO terms from DAG . 20 5.3 Taxonomy Tree . 21 5.3.1 Generating Taxonomy . 21 5.4 Statistical Test . 22 5.4.1 Species Enrichment . 22 5.4.2 Gene Ontology Enrichment . 22 5.5 Web Service . 22 5.5.1 dieGOS . 23 5.5.2 Input . 23 5.5.3 Species Tree . 25 5.5.4 Statistical Test . 26 5.5.5 Results . 26 5.5.6 Visualization . 26 6 Results and Discussion 31 6.1 Results Comparison . 31 6.2 Similar tools Comparison . 33 7 Conclusions 36 7.1 Possible future work . 36 7.2 Comments . 36 x CONTENTS Appendices 38 A Similar Tools 39 A.1 AmiGO . 39 A.2 BiNGO . 39 A.3 Bioconductor . 40 A.4 DAVID................................... 40 A.5 ermineJ . 41 A.6 FuSSiMeG . 41 A.7 g:Profiler . 41 A.8 GOrilla . 42 A.9 GREAT . 42 A.10 GOBU . 42 A.11 GOFFA . 43 A.12 GeneMANIA . 43 A.13 GeneMerge . 43 A.14 GeneTools . 44 A.15 TermFinder . 44 A.16 GOTermMapper . 44 A.17 GoBean . 44 A.18 GraphWeb . 45 A.19 GoMiner . 45 A.20 NOA . 45 A.21 Onto-Express . 46 A.22 OE2GO . 46 A.23 PiNGO . 46 A.24 ProteInOn . 46 A.25 STEM . 47 A.26 StRAnGER . 47 A.27 ToppGene . 47 A.28 agriGO . 48 B Package Documentation 49 B.1 Sequence . 49 B.1.1 Sequence.java . 49 B.1.2 PWMScore.java . 49 B.1.3 RegExp.java . 52 B.2 stats . 52 B.3 go ..................................... 52 B.4 net . 52 CONTENTS xi B.5 taxonomy . 52 B.6 data . 53 B.7 bean . 53 B.8 controller . 53 List of Figures 2.1 Flow diagram of the UniprotKB annotation process. .7 2.2 Screenshot from the software OBO-Edit, showing a small set of terms from the ontology. .9 5.1 Timeline for each stage. 16 5.2 10 runs readFastaFile. 17 5.3 Used heap after read fasta file. 18 5.4 10 runs RegExp.search. 19 5.5 dieGOS Logo . ..
Recommended publications
  • Applied Category Theory for Genomics – an Initiative

    Applied Category Theory for Genomics – an Initiative

    Applied Category Theory for Genomics { An Initiative Yanying Wu1,2 1Centre for Neural Circuits and Behaviour, University of Oxford, UK 2Department of Physiology, Anatomy and Genetics, University of Oxford, UK 06 Sept, 2020 Abstract The ultimate secret of all lives on earth is hidden in their genomes { a totality of DNA sequences. We currently know the whole genome sequence of many organisms, while our understanding of the genome architecture on a systematic level remains rudimentary. Applied category theory opens a promising way to integrate the humongous amount of heterogeneous informations in genomics, to advance our knowledge regarding genome organization, and to provide us with a deep and holistic view of our own genomes. In this work we explain why applied category theory carries such a hope, and we move on to show how it could actually do so, albeit in baby steps. The manuscript intends to be readable to both mathematicians and biologists, therefore no prior knowledge is required from either side. arXiv:2009.02822v1 [q-bio.GN] 6 Sep 2020 1 Introduction DNA, the genetic material of all living beings on this planet, holds the secret of life. The complete set of DNA sequences in an organism constitutes its genome { the blueprint and instruction manual of that organism, be it a human or fly [1]. Therefore, genomics, which studies the contents and meaning of genomes, has been standing in the central stage of scientific research since its birth. The twentieth century witnessed three milestones of genomics research [1]. It began with the discovery of Mendel's laws of inheritance [2], sparked a climax in the middle with the reveal of DNA double helix structure [3], and ended with the accomplishment of a first draft of complete human genome sequences [4].
  • Gene Prediction: the End of the Beginning Comment Colin Semple

    Gene Prediction: the End of the Beginning Comment Colin Semple

    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
  • Meeting Review: Bioinformatics and Medicine – from Molecules To

    Meeting Review: Bioinformatics and Medicine – from Molecules To

    Comparative and Functional Genomics Comp Funct Genom 2002; 3: 270–276. Published online 9 May 2002 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.178 Feature Meeting Review: Bioinformatics And Medicine – From molecules to humans, virtual and real Hinxton Hall Conference Centre, Genome Campus, Hinxton, Cambridge, UK – April 5th–7th Roslin Russell* MRC UK HGMP Resource Centre, Genome Campus, Hinxton, Cambridge CB10 1SB, UK *Correspondence to: Abstract MRC UK HGMP Resource Centre, Genome Campus, The Industrialization Workshop Series aims to promote and discuss integration, automa- Hinxton, Cambridge CB10 1SB, tion, simulation, quality, availability and standards in the high-throughput life sciences. UK. The main issues addressed being the transformation of bioinformatics and bioinformatics- based drug design into a robust discipline in industry, the government, research institutes and academia. The latest workshop emphasized the influence of the post-genomic era on medicine and healthcare with reference to advanced biological systems modeling and simulation, protein structure research, protein-protein interactions, metabolism and physiology. Speakers included Michael Ashburner, Kenneth Buetow, Francois Cambien, Cyrus Chothia, Jean Garnier, Francois Iris, Matthias Mann, Maya Natarajan, Peter Murray-Rust, Richard Mushlin, Barry Robson, David Rubin, Kosta Steliou, John Todd, Janet Thornton, Pim van der Eijk, Michael Vieth and Richard Ward. Copyright # 2002 John Wiley & Sons, Ltd. Received: 22 April 2002 Keywords: bioinformatics;
  • Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

    Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

    EMBL-European Bioinformatics Institute Annual Scientific Report 2013 On the cover Structure 3fof in the Protein Data Bank, determined by Laponogov, I. et al. (2009) Structural insight into the quinolone-DNA cleavage complex of type IIA topoisomerases. Nature Structural & Molecular Biology 16, 667-669. © 2014 European Molecular Biology Laboratory This publication was produced by the External Relations team at the European Bioinformatics Institute (EMBL-EBI) A digital version of the brochure can be found at www.ebi.ac.uk/about/brochures For more information about EMBL-EBI please contact: [email protected] Contents Introduction & overview 3 Services 8 Genes, genomes and variation 8 Molecular atlas 12 Proteins and protein families 14 Molecular and cellular structures 18 Chemical biology 20 Molecular systems 22 Cross-domain tools and resources 24 Research 26 Support 32 ELIXIR 36 Facts and figures 38 Funding & resource allocation 38 Growth of core resources 40 Collaborations 42 Our staff in 2013 44 Scientific advisory committees 46 Major database collaborations 50 Publications 52 Organisation of EMBL-EBI leadership 61 2013 EMBL-EBI Annual Scientific Report 1 Foreword Welcome to EMBL-EBI’s 2013 Annual Scientific Report. Here we look back on our major achievements during the year, reflecting on the delivery of our world-class services, research, training, industry collaboration and European coordination of life-science data. The past year has been one full of exciting changes, both scientifically and organisationally. We unveiled a new website that helps users explore our resources more seamlessly, saw the publication of ground-breaking work in data storage and synthetic biology, joined the global alliance for global health, built important new relationships with our partners in industry and celebrated the launch of ELIXIR.
  • The Ethos and Effects of Data-Sharing Rules: Examining The

    The Ethos and Effects of Data-Sharing Rules: Examining The

    Informed consent for: "The ethos and effects of data-sharing rules: Examining the history of the 'Bermuda principles' and their effects on 21 st century science" University of Adelaide Duke University Researchers at the University of Adelaide, Australia, and the IGSP Center for Genome Ethics, Law & Policy, Duke University, are engaged in research on the Bermuda Principles for sharing DNA sequence data from high-volume sequencing centers. You have been selected for an interview because we believe that the recollections you may have of your experiences with the International Strategy Meetings for Human Genome Sequencing (1996-1998) will be interesting and helpful for our project. We expect that interviews will last from 30 minutes to much longer, but you may stop your interview at any time. Your participation is strictly voluntary, and you do not have to answer every question asked. Your interview is being recorded and we may take written notes during the interview. After your interview, we may prepare a typed transcript of the interview. If we prepare a transcript, you will have an opportunity to review it and to make deletions and corrections. Unless you indicate otherwise, the information that you provide in this interview will be "on the record"-that is, it can be attributed to you in the various articles and chapters that we plan to write, and thus could become public through these channels. Jf, however, at some point in the interview you want to provide us with information that might be useful for us to know, but which you do not want to have attributed to you, you should tell us that you wish to go "off the record" and we will stop the recording.
  • The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

    The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

    Comparative and Functional Genomics Comp Funct Genom 2003; 4: 16–19. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.232 Feature Meeting Review: The HUPO Proteomics Standards Initiative meeting: towards common standards for exchanging proteomics data Hinxton, Cambridge, UK, 19–20 October 2002 Sandra Orchard, Paul Kersey, Henning Hermjakob* and Rolf Apweiler EMBL Outstation–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK *Correspondence to: Abstract Henning Hermjakob, EMBL Outstation–European The Proteomics Standards Initiative (PSI) aims to define community standards Bioinformatics Institute, for data representation in proteomics and to facilitate data comparison, exchange Wellcome Trust Genome and verification. Initially the fields of protein–protein interactions (PPI) and mass Campus, Hinxton, Cambridge, spectroscopy have been targeted and the inaugural meeting of the PSI addressed the UK. questions of data storage and exchange in both of these areas. The PPI group rapidly E-mail: reached consensus as to the minimum requirements for a data exchange model; an [email protected] XML draft is now being produced. The mass spectroscopy group have achieved major advances in the definition of a required data model and working groups are currently taking these discussions further. A further meeting is planned in January 2003 to Received: 14 November 2002 advance both these projects. Copyright 2003 John Wiley & Sons, Ltd. Accepted: 14 November 2002 Keywords: proteomics; spectroscopy; protein–protein interactions Introduction process, before splitting into two working parties to address the issues facing their respective fields. The Proteomics Standards Initiative was estab- lished following a meeting in April 2002, jointly organized by HUPO and NAS, at which the urgent Protein–protein interactions (PPI) group need for standardization of proteomics data was recognized.
  • The European Bioinformatics Institute in 2020: Building a Global Infrastructure of Interconnected Data Resources for the Life Sciences Charles E

    The European Bioinformatics Institute in 2020: Building a Global Infrastructure of Interconnected Data Resources for the Life Sciences Charles E

    Published online 8 November 2019 Nucleic Acids Research, 2020, Vol. 48, Database issue D17–D23 doi: 10.1093/nar/gkz1033 The European Bioinformatics Institute in 2020: building a global infrastructure of interconnected data resources for the life sciences Charles E. Cook *, Oana Stroe, Guy Cochrane ,EwanBirney and Rolf Apweiler European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received September 21, 2019; Revised October 18, 2019; Editorial Decision October 21, 2019; Accepted November 06, 2019 ABSTRACT ature. EMBL-EBI’s data resources collate, integrate, curate and make freely available to the public the world’s scientific Data resources at the European Bioinformatics In- data. stitute (EMBL-EBI, https://www.ebi.ac.uk/)archive, Our resources (www.ebi.ac.uk/services) include archival organize and provide added-value analysis of re- or deposition databases that store primary experimental search data produced around the world. This year’s data submitted by researchers, as well as knowledgebases update for EMBL-EBI focuses on data exchanges that integrate and add value to experimental data, with among resources, both within the institute and with many having both functions (1,2). All EMBL-EBI data re- a wider global infrastructure. Within EMBL-EBI, data sources, are open access and freely available to any user resources exchange data through a rich network of worldwide at any time, and EMBL-EBI strongly supports data flows mediated by automated systems. This net- the concept of FAIR data (findable, accessible, interopera- work ensures that users are served with as much ble, and resuable) (3).
  • 2003 Mulder Nucl Acids Res {22

    2003 Mulder Nucl Acids Res {22

    The InterPro Database, 2003 brings increased coverage and new features Nicola J Mulder, Rolf Apweiler, Teresa K Attwood, Amos Bairoch, Daniel Barrell, Alex Bateman, David Binns, Margaret Biswas, Paul Bradley, Peer Bork, et al. To cite this version: Nicola J Mulder, Rolf Apweiler, Teresa K Attwood, Amos Bairoch, Daniel Barrell, et al.. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Research, Oxford University Press, 2003, 31 (1), pp.315-318. 10.1093/nar/gkg046. hal-01214149 HAL Id: hal-01214149 https://hal.archives-ouvertes.fr/hal-01214149 Submitted on 9 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. # 2003 Oxford University Press Nucleic Acids Research, 2003, Vol. 31, No. 1 315–318 DOI: 10.1093/nar/gkg046 The InterPro Database, 2003 brings increased coverage and new features Nicola J. Mulder1,*, Rolf Apweiler1, Teresa K. Attwood3, Amos Bairoch4, Daniel Barrell1, Alex Bateman2, David Binns1, Margaret Biswas5, Paul Bradley1,3, Peer Bork6, Phillip Bucher7, Richard R. Copley8, Emmanuel Courcelle9, Ujjwal Das1, Richard Durbin2, Laurent Falquet7, Wolfgang Fleischmann1, Sam Griffiths-Jones2, Downloaded from Daniel Haft10, Nicola Harte1, Nicolas Hulo4, Daniel Kahn9, Alexander Kanapin1, Maria Krestyaninova1, Rodrigo Lopez1, Ivica Letunic6, David Lonsdale1, Ville Silventoinen1, Sandra E.
  • Contcenter for Genomic Regul

    Contcenter for Genomic Regul

    CONTCENTER FOR GENOMIC REGUL CRG SCIENTIFIC STRUCTURE . 4 CRG MANAGEMENT STRUCTURE . 6 CRG SCIENTIFIC ADVISORY BOARD (SAB) . 8 CRG BUSINESS BOARD . 9 YEAR RETROSPECT BY THE DIRECTOR OF THE CRG: MIGUEL BEATO . 10 GENE REGULATION. 14 p Chromatin and gene expression .....................16 p Transcriptional regulation and chromatin remodelling .....19 p Regulation of alternative pre-mRNA splicing during cell . 22 differentiation, development and disease p RNA interference and chromatin regulation . 26 p RNA-protein interactions and regulation . 30 p Regulation of protein synthesis in eukaryotes . 33 p Translational control of gene expression . 36 DIFFERENTIATION AND CANCER ...........................40 p Hematopoietic differentiation and stem cell biology..........42 p Myogenesis.....................................46 p Epigenetics events in cancer.......................49 p Epithelial homeostasis and cancer ...................52 ENTSATION ANNUAL REPORT 2006 GENES AND DISEASE .................................56 p Genetic causes of disease .............................58 p Gene therapy ......................................63 p Murine models of disease .............................66 p Neurobehavioral phenotyping of mouse models of disease .....68 p Gene function ......................................73 p Associated Core Facility: Genotyping Unit..................76 BIOINFORMATICS AND GENOMICS ..........................80 p Bioinformatics and genomics ...........................82 p Genomic analysis of development and disease ..............86
  • Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

    Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

    1 Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective Attwood, T.K.1, Gisel, A.2, Eriksson, N-E.3 and Bongcam-Rudloff, E.4 1Faculty of Life Sciences & School of Computer Science, University of Manchester 2Institute for Biomedical Technologies, CNR 3Uppsala Biomedical Centre (BMC), University of Uppsala 4Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences 1UK 2Italy 3,4Sweden 1. Introduction The origins of bioinformatics, both as a term and as a discipline, are difficult to pinpoint. The expression was used as early as 1977 by Dutch theoretical biologist Paulien Hogeweg when she described her main field of research as bioinformatics, and established a bioinformatics group at the University of Utrecht (Hogeweg, 1978; Hogeweg & Hesper, 1978). Nevertheless, the term had little traction in the community for at least another decade. In Europe, the turning point seems to have been circa 1990, with the planning of the “Bioinformatics in the 90s” conference, which was held in Maastricht in 1991. At this time, the National Center for Biotechnology Information (NCBI) had been newly established in the United States of America (USA) (Benson et al., 1990). Despite this, there was still a sense that the nation lacked a “long-term biology ‘informatics’ strategy”, particularly regarding postdoctoral interdisciplinary training in computer science and molecular biology (Smith, 1990). Interestingly, Smith spoke here of ‘biology informatics’, not bioinformatics; and the NCBI was a ‘center for biotechnology information’, not a bioinformatics centre. The discipline itself ultimately grew organically from the needs of researchers to access and analyse (primarily biomedical) data, which appeared to be accumulating at alarming rates simultaneously in different parts of the world.
  • Computational Biology: Plus C'est La Même Chose, Plus Ça Change

    Computational Biology: Plus C'est La Même Chose, Plus Ça Change

    Computational biology: plus c'est la même chose, plus ça change The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Huttenhower, Curtis. 2011. Computational biology: plus c'est la même chose, plus ça change. Genome Biology 12(8): 307. Published Version doi:10.1186/gb-2011-12-8-307 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:10576037 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Huttenhower Genome Biology 2011, 12:307 http://genomebiology.com/2011/12/8/307 MEETING REPORT Computational biology: plus c’est la même chose, plus ça change Curtis Huttenhower* The data deluge: still keeping our heads above water Abstract Bioinformatics has been dealing with an exponential A report on the joint 19th Annual International growth in data since its coalescence as a field in the Conference on Intelligent Systems for Molecular 1980s, making the Senior Scientist Award keynote with Biology (ISMB)/10th Annual European Conference which Michael Ashburner closed the conference on Computational Biology (ECCB) meetings and particularly appropriate. This retrospective by the ‘father the 7th International Society for Computational of ontologies in biology’, to quote the introduction by Biology Student Council Symposium, Vienna, Austria, ISCB president Burkhard Rost, detailed the remarkable 15‑19 July 2011. expansion of computational biology since Ashburner’s start as a Cambridge undergraduate 50 years ago.
  • MPGM: Scalable and Accurate Multiple Network Alignment

    MPGM: Scalable and Accurate Multiple Network Alignment

    MPGM: Scalable and Accurate Multiple Network Alignment Ehsan Kazemi1 and Matthias Grossglauser2 1Yale Institute for Network Science, Yale University 2School of Computer and Communication Sciences, EPFL Abstract Protein-protein interaction (PPI) network alignment is a canonical operation to transfer biological knowledge among species. The alignment of PPI-networks has many applica- tions, such as the prediction of protein function, detection of conserved network motifs, and the reconstruction of species’ phylogenetic relationships. A good multiple-network align- ment (MNA), by considering the data related to several species, provides a deep understand- ing of biological networks and system-level cellular processes. With the massive amounts of available PPI data and the increasing number of known PPI networks, the problem of MNA is gaining more attention in the systems-biology studies. In this paper, we introduce a new scalable and accurate algorithm, called MPGM, for aligning multiple networks. The MPGM algorithm has two main steps: (i) SEEDGENERA- TION and (ii) MULTIPLEPERCOLATION. In the first step, to generate an initial set of seed tuples, the SEEDGENERATION algorithm uses only protein sequence similarities. In the second step, to align remaining unmatched nodes, the MULTIPLEPERCOLATION algorithm uses network structures and the seed tuples generated from the first step. We show that, with respect to different evaluation criteria, MPGM outperforms the other state-of-the-art algo- rithms. In addition, we guarantee the performance of MPGM under certain classes of net- work models. We introduce a sampling-based stochastic model for generating k correlated networks. We prove that for this model if a sufficient number of seed tuples are available, the MULTIPLEPERCOLATION algorithm correctly aligns almost all the nodes.