Introduction to Genome Informatics

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Genome Informatics GenomeGenome InformaticsInformatics Systems Biology and the Omics Cascade (Course 2143) Day 3, June 11th, 2008 Kiyoko F. Aoki-Kinoshita IntroductionIntroduction GenomeGenome informaticsinformatics coverscovers thethe computercomputer-- basedbased modelingmodeling andand datadata processingprocessing ofof genomegenome--basedbased data.data. ThisThis includesincludes databasesdatabases andand resourcesresources forfor genomicgenomic analysis.analysis. YouYou werewere introducedintroduced toto KEGGKEGG onon DayDay 2.2. SomeSome otherother usefuluseful databasesdatabases andand resourcesresources willwill bebe coveredcovered today.today. ButBut first!first! DataData formatsformats ItIt isis usuallyusually notnot enoughenough toto simplysimply looklook atat thethe datadata providedprovided byby databasesdatabases ToTo actuallyactually useuse thethe datadata forfor analysis,analysis, oneone oftenoften needsneeds toto savesave thethe retrievedretrieved datadata ThisThis requiresrequires knowledgeknowledge aboutabout thethe datadata formatsformats usedused byby eacheach databasedatabase SoSo wewe willwill covercover thethe majormajor datadata formatsformats usedused inin bioinformaticsbioinformatics DataData FormatsFormats MajorMajor datadata formats:formats: –– GenBankGenBank –– EMBLEMBL/UniProt/UniProt –– FASTAFASTA –– PDBPDB FormatsFormats suitedsuited forfor programming:programming: –– ASN.1ASN.1 (Abstract(Abstract SyntaxSyntax NotationNotation One)One) –– XMLXML (eXtensible(eXtensible MarkupMarkup Language)Language) GenBankGenBank formatformat EachEach lineline startsstarts withwith aa keywordkeyword inin capitalcapital letters.letters. EachEach keywordkeyword isis followedfollowed byby aa tabtab andand thethe informationinformation correspondingcorresponding toto it.it. SomeSome keywordskeywords areare hierarchical:hierarchical: EMBLEMBL formatformat SimilarSimilar toto GenBank,GenBank, exceptexcept thatthat keywordskeywords areare twotwo--letterletter IDs.IDs. UniProtUniProt’’ss formatformat isis similarsimilar toto thisthis format.format. FASTAFASTA sequencesequence formatformat > Randseq1 first randomly generated seq GGTGGTTACTAACCGTAAGAGATGATGTCGCCGTGGTCGCGTGGC GCCGCGGACCCAGATTGTACTTCTCTGAGTCGTTCTAGATCGACC AGTCTTCTAGCTTGCCCGTGAGGTATGGGG AGCCGCATATTGCCCACAAT > Randseq2 second randomly generated seq GCGACGCGTCTCTACACCAGACGCTTCTGTTGAGGAAGAGTGCCT GAGTGCAGGTCCTCGAGAACCCACTGGAACTTGAAGGGCGCGTCT CACTGGTCGTGAGAAGGCTCCGTCGATACG AAAGTCCATGCCAAGGACAT > Randseq3 third randomly generated seq GGCGAGTCTGAACTCACAAATATTGCACGAGAGTTTAGTGTATGT TCCTCTTAGGCTGATAACAATAGTTTAGTGAGCGGAAATGCAACC GCGAGGCGGTCCCCTGCGCTTGTAATGGCC ACCTGTTGCCCGTCGGATAT NucleicNucleic acidacid codecode forfor FASTAFASTA AA ÆÆ adenosineadenosine MM ÆÆ AA CC (amino)(amino) CC ÆÆ cytidinecytidine SS ÆÆ GG CC (strong)(strong) GG ÆÆ guanineguanine WW ÆÆ AA TT (weak)(weak) TT ÆÆ thymidinethymidine BB ÆÆ GG TT CC UU ÆÆ uridineuridine DD ÆÆ GG AA TT RR ÆÆ GG AA (purine)(purine) HH ÆÆ AA CC TT YY ÆÆ TT CC (pyrimidine)(pyrimidine) VV ÆÆ GG CC AA KK ÆÆ GG TT (keto)(keto) NN ÆÆ AA GG CC TT (any)(any) -- ÆÆ gapgap ofof indeterminateindeterminate lengthlength AminoAmino acidacid codecode forfor FASTAFASTA A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length PDBPDB formatformat SimilarSimilar toto GenBank,GenBank, usingusing differentdifferent keywordskeywords IncludesIncludes 33--dimensionaldimensional coordinatescoordinates ofof aminoamino acidsacids ASN.1ASN.1 formatformat HierarchicalHierarchical datadata formatformat GroupsGroups areare delineateddelineated byby curlycurly bracketsbrackets DataData typetype namesnames precedeprecede thethe bracketsbrackets DataData withinwithin aa groupgroup areare separatedseparated byby commascommas XMLXML formatformat HierarchicalHierarchical datadata formatformat TagsTags definedefine thethe typetype ofof data:data: – Opening tag: <name> – Closing tag: </name> DataData areare delineateddelineated byby openingopening andand closingclosing tagstags DataData formatformat converterconverter READSEQREADSEQ –– URLURL::http://thr.cit.nih.gov/molbio/readseq/http://thr.cit.nih.gov/molbio/readseq/ TypesTypes ofof databasesdatabases DataData resourcesresources ofof multiplemultiple typestypes ofof datadata – EBI (European Bioinformatics Institute) – NCBI (National Center for Biotechnology Information) – KEGG (Kyoto Encyclopedia of Genes and Genomes) GeneGene andand proteinprotein informationinformation – GenBank, UniProt, and PDB – Species specific: FlyBase, dictyBase, etc. OntologicalOntological datadata – Gene Ontology PathwayPathway datadata – KEGG PATHWAY, Reactome, BRENDA, etc. ProteinProtein--proteinprotein interactioninteraction datadata – IntAct, BioGRID, etc. EBIEBI http://www.ebi.ac.uk/http://www.ebi.ac.uk/ EuropeanEuropean basebase ofof molecularmolecular biologybiology information,information, includingincluding genomic,genomic, genegene expression,expression, andand literatureliterature informationinformation NCBINCBI http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ ContainsContains publicpublic databasesdatabases ofof molecularmolecular biologybiology informationinformation includingincluding genomes,genomes, microarraymicroarray genegene expression,expression, proteinprotein sequencesequence domains,domains, etc.etc. DevelopsDevelops softwaresoftware forfor analyzinganalyzing genomegenome data,data, includingincluding BLASTBLAST ProvidesProvides PubMed,PubMed, anan archivearchive ofof biomedicalbiomedical andand lifelife sciencescience journalsjournals GeneGene andand proteinprotein databasesdatabases UniProtUniProt –– UniversalUniversal ProteinProtein ResourceResource –– http://beta.uniprot.org/http://beta.uniprot.org/ –– ConsistsConsists ofof threethree components:components: The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences. The UniRef databases provide clustered sets of sequences from UniProt (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records. TypesTypes ofof databasesdatabases DataData resourcesresources ofof multiplemultiple typestypes ofof datadata – EBI (European Bioinformatics Institute) – NCBI (National Center for Biotechnology Information) – KEGG (Kyoto Encyclopedia of Genes and Genomes) GeneGene andand proteinprotein informationinformation – GenBank, UniProt, and PDB – Species specific: FlyBase, dictyBase, etc. OntologicalOntological datadata – Gene Ontology PathwayPathway datadata – KEGG PATHWAY, Reactome, BRENDA, etc. ProteinProtein--proteinprotein interactioninteraction datadata – IntAct, BioGRID, etc. GeneGene andand proteinprotein databasesdatabases GenBankGenBank –– AA partpart ofof NCBINCBI –– http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ –– SearchSearch cancan bebe performedperformed throughthrough thethe EntrezEntrez interface,interface, whichwhich searchessearches forfor thethe queryquery inin allall databasesdatabases availableavailable atat thethe NCBINCBI GeneGene andand proteinprotein databasesdatabases PDB:PDB: ProteinProtein DataBankDataBank –– ContainsContains 33--dimensionaldimensional proteinprotein structuresstructures –– http://www.rcsb.orghttp://www.rcsb.org –– DataData submittedsubmitted byby individualindividual researchersresearchers DatabasesDatabases ofof ModelModel OrganismsOrganisms ModelModel organismsorganisms areare thosethose whosewhose genomegenome hashas beenbeen extensivelyextensively studiedstudied suchsuch thatthat thethe workingsworkings ofof biologicalbiological phenomenaphenomena forfor moremore complexcomplex organismsorganisms cancan bebe inferred.inferred. ForFor example,example, thethe mousemouse hashas beenbeen mostmost studiedstudied toto understandunderstand otherother mammalianmammalian systems.systems. ForFor plantplant species,species, ArabidopsisArabidopsis thalianathaliana isis mostmost oftenoften usedused asas aa modelmodel organism.organism. DatabasesDatabases ofof modelmodel organismsorganisms ArabidobsisArabidobsis (mustard(mustard plant)plant) – TAIR: The Arabidopsis Information Resource http://www.arabidopsis.org/ – The Carnegie Institution of Washington, the National Center for Genome Resources C.C. eleganselegans – WormBase http://www.wormbase.org/ – Cold Spring Harbor Laboratory DictyosteliumDictyostelium (slime(slime mold)mold) – dictyBase http://dictybase.org/ – Northwestern University DatabasesDatabases ofof modelmodel organismsorganisms DrosophilaDrosophila (Fruit(Fruit fly)fly) – FlyBase http://www.flybase.org/ – Indiana University MouseMouse
Recommended publications
  • University of Maryland Establishes Center for Health Related Informatics and Bioimaging
    University of Maryland Establishes Center for Health Related Informatics and Bioimaging News & Events UNIVERSITY OF MARYLAND ESTABLISHES CENTER FOR HEALTH CONTACT RELATED INFORMATICS AND BIOIMAGING Monday, June 10, 2013 Karen Robinson, Media Relations New Center, Led By Distinguished Bioinformatics Scientist Owen White, (410) 706-7590 Ph.D., Unites Scientists Across Disciplines to Integrate Genomics and Personalized Medicine into Research and Clinical Care karobinson@som.umaryland.edu University of Maryland, Baltimore campus President Jay A. Perman, M.D., and Media Relations Team University of Maryland School of Medicine Dean E. Albert Reece, M.D., Ph.D., M.B.A., wish to announce the establishment of a new center to unite research scientists and physicians across disciplines. The center will employ these SEARCH ARTICLES interdisciplinary connections to enhance the use of cutting edge medical science such as genomics and personalized medicine to accelerate research discoveries and improve health care outcomes. Participants in the new News Archives University of Maryland Center for Health-Related Informatics and Bioimaging (CHIB) will collaborate with computer scientists, engineers, life scientists and others at a similar center at the University of Maryland, College Park campus, CHIB Co-Director Owen White, Ph.D. together forming a joint center supported by the M-Power Maryland initiative. LEARN MORE University of Maryland School of Medicine Dean E. Albert Reece, M.D., Ph.D., M.B.A., with the concurrence of President Perman, has appointed as co-director of the new center Owen White, Ph.D., Professor of Epidemiology and Public Health and Director of Bioinformatics at the University of Maryland School of Medicine Institute for Genome Sciences.
    [Show full text]
  • The Rat Genome Database at 20: a Multi-Species Knowledgebase and Analysis Platform Jennifer R
    Published online 12 November 2019 Nucleic Acids Research, 2020, Vol. 48, Database issue D731–D742 doi: 10.1093/nar/gkz1041 The Year of the Rat: The Rat Genome Database at 20: a multi-species knowledgebase and analysis platform Jennifer R. Smith 1,*, G. Thomas Hayman1, Shur-Jen Wang1, Stanley J.F. Laulederkind 1, Matthew J. Hoffman1,2, Mary L. Kaldunski 1, Monika Tutaj1, Jyothi Thota1,Harika S. Nalabolu1, Santoshi L.R. Ellanki1, Marek A. Tutaj1, Jeffrey L. De Pons1, Anne E. Kwitek2, Melinda R. Dwinell2 and Mary E. Shimoyama1 1Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin, Milwaukee, WI 53226, USA and 2Genomic Sciences and Precision Medicine Center and Department of Physiology, Medical College of Wisconsin, Milwaukee, WI 53226, USA Received September 15, 2019; Revised October 21, 2019; Editorial Decision October 22, 2019; Accepted October 24, 2019 ABSTRACT In contrast to the current wealth of data and tools, the ‘bare bones’ first iteration (Figure 1,Table1) focused on just a few Formed in late 1999, the Rat Genome Database (RGD, data types––most notably a handful of known genes, plus https://rgd.mcw.edu) will be 20 in 2020, the Year of the EST and SSLP markers––localized only on genetic and RH Rat. Because the laboratory rat, Rattus norvegicus, maps, and offered a minimal number of tools for searching has been used as a model for complex human dis- and analyzing that data. eases such as cardiovascular disease, diabetes, can- Even before the first public release of the rat reference cer, neurological disorders and arthritis, among oth- genome in 2002 (1), RGD was focused on mapping disease- ers, for >150 years, RGD has always been disease- related regions and on comparative genomics between rat, focused and committed to providing data and tools human and mouse to support the use of rat as a model for for researchers doing comparative genomics and human disease.
    [Show full text]
  • Computational Analysis and Predictive Cheminformatics Modeling of Small Molecule Inhibitors of Epigenetic Modifiers
    RESEARCH ARTICLE Computational Analysis and Predictive Cheminformatics Modeling of Small Molecule Inhibitors of Epigenetic Modifiers Salma Jamal1☯‡, Sonam Arora2☯‡, Vinod Scaria3* 1 CSIR Open Source Drug Discovery Unit (CSIR-OSDD), Anusandhan Bhawan, Delhi, India, 2 Delhi Technological University, Delhi, India, 3 GN Ramachandran Knowledge Center for Genome Informatics, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India ☯ These authors contributed equally to this work. ‡ These authors are joint first authors on this work. * vinods@igib.in a11111 Abstract Background The dynamic and differential regulation and expression of genes is majorly governed by the OPEN ACCESS complex interactions of a subset of biomolecules in the cell operating at multiple levels start- Citation: Jamal S, Arora S, Scaria V (2016) ing from genome organisation to protein post-translational regulation. The regulatory layer Computational Analysis and Predictive contributed by the epigenetic layer has been one of the favourite areas of interest recently. Cheminformatics Modeling of Small Molecule Inhibitors of Epigenetic Modifiers. PLoS ONE 11(9): This layer of regulation as we know today largely comprises of DNA modifications, histone e0083032. doi:10.1371/journal.pone.0083032 modifications and noncoding RNA regulation and the interplay between each of these major Editor: Wei Yan, University of Nevada School of components. Epigenetic regulation has been recently shown to be central to development Medicine, UNITED STATES of a number of disease processes. The availability of datasets of high-throughput screens Received: June 10, 2013 for molecules for biological properties offer a new opportunity to develop computational methodologies which would enable in-silico screening of large molecular libraries.
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • Genome Informatics 2016 Davide Chicco1 and Michael M
    Chicco and Hoffman Genome Biology (2017) 18:5 DOI 10.1186/s13059-016-1135-5 MEETINGREPORT Open Access Genome Informatics 2016 Davide Chicco1 and Michael M. Hoffman1,2,3* Abstract of sequence variants. Konrad Karczewski (Massachusetts General Hospital, USA) presented the Loss Of Func- A report on the Genome Informatics conference, held tion Transcript Effect Estimator (LOFTEE, https://github. at the Wellcome Genome Campus Conference Centre, com/konradjk/loftee). LOFTEE uses a support vector Hinxton, United Kingdom, 19–22 September 2016. machine to identify sequence variants that significantly disrupt a gene and potentially affect biological processes. We report a sampling of the advances in computational Martin Kircher (University of Washington, USA) dis- genomics presented at the most recent Genome Informat- cussed a massively parallel reporter assay (MPRA) that ics conference. As in Genome Informatics 2014 [1], speak- uses a lentivirus for genomic integration, called lentiM- ers presented research on personal and medical genomics, PRA [3]. He used lentiMPRA to predict enhancer activity, transcriptomics, epigenomics, and metagenomics, new and to more generally measure the functional effect of sequencing techniques, and new computational algo- non-coding variants. William McLaren (European Bioin- rithms to crunch ever-larger genomic datasets. Two formatics Institute, UK) presented Haplosaurus, a variant changes were notable. First, there was a marked increase effect predictor that uses haplotype-phased data (https:// in the number of projects involving single-cell analy- github.com/willmclaren/ensembl-vep). ses, especially single-cell RNA-seq (scRNA-seq). Second, Two presenters discussed genome informatics while participants continued the practice of present- approaches to the analysis of cancer immunotherapy ing unpublished results, a large number of the presen- response.
    [Show full text]
  • Webnetcoffee
    Hu et al. BMC Bioinformatics (2018) 19:422 https://doi.org/10.1186/s12859-018-2443-4 SOFTWARE Open Access WebNetCoffee: a web-based application to identify functionally conserved proteins from Multiple PPI networks Jialu Hu1,2, Yiqun Gao1, Junhao He1, Yan Zheng1 and Xuequn Shang1* Abstract Background: The discovery of functionally conserved proteins is a tough and important task in system biology. Global network alignment provides a systematic framework to search for these proteins from multiple protein-protein interaction (PPI) networks. Although there exist many web servers for network alignment, no one allows to perform global multiple network alignment tasks on users’ test datasets. Results: Here, we developed a web server WebNetcoffee based on the algorithm of NetCoffee to search for a global network alignment from multiple networks. To build a series of online test datasets, we manually collected 218,339 proteins, 4,009,541 interactions and many other associated protein annotations from several public databases. All these datasets and alignment results are available for download, which can support users to perform algorithm comparison and downstream analyses. Conclusion: WebNetCoffee provides a versatile, interactive and user-friendly interface for easily running alignment tasks on both online datasets and users’ test datasets, managing submitted jobs and visualizing the alignment results through a web browser. Additionally, our web server also facilitates graphical visualization of induced subnetworks for a given protein and its neighborhood. To the best of our knowledge, it is the first web server that facilitates the performing of global alignment for multiple PPI networks. Availability: http://www.nwpu-bioinformatics.com/WebNetCoffee Keywords: Multiple network alignment, Webserver, PPI networks, Protein databases, Gene ontology Background tools [7–10] have been developed to understand molec- Proteins are involved in almost all life processes.
    [Show full text]
  • The Biogrid Interaction Database
    D470–D478 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 26 November 2014 doi: 10.1093/nar/gku1204 The BioGRID interaction database: 2015 update Andrew Chatr-aryamontri1, Bobby-Joe Breitkreutz2, Rose Oughtred3, Lorrie Boucher2, Sven Heinicke3, Daici Chen1, Chris Stark2, Ashton Breitkreutz2, Nadine Kolas2, Lara O’Donnell2, Teresa Reguly2, Julie Nixon4, Lindsay Ramage4, Andrew Winter4, Adnane Sellam5, Christie Chang3, Jodi Hirschman3, Chandra Theesfeld3, Jennifer Rust3, Michael S. Livstone3, Kara Dolinski3 and Mike Tyers1,2,4,* 1Institute for Research in Immunology and Cancer, Universite´ de Montreal,´ Montreal,´ Quebec H3C 3J7, Canada, 2The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA, 4School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK and 5Centre Hospitalier de l’UniversiteLaval´ (CHUL), Quebec,´ Quebec´ G1V 4G2, Canada Received September 26, 2014; Revised November 4, 2014; Accepted November 5, 2014 ABSTRACT semi-automated text-mining approaches, and to en- hance curation quality control. The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein in- INTRODUCTION teractions curated from the primary biomedical lit- Massive increases in high-throughput DNA sequencing erature for all major model organism species and technologies (1) have enabled an unprecedented level of humans. As of September 2014, the BioGRID con- genome annotation for many hundreds of species (2–6), tains 749 912 interactions as drawn from 43 149 pub- which has led to tremendous progress in the understand- lications that represent 30 model organisms.
    [Show full text]
  • Biocuration 2016 - Posters
    Biocuration 2016 - Posters Source: http://www.sib.swiss/events/biocuration2016/posters 1 RAM: A standards-based database for extracting and analyzing disease-specified concepts from the multitude of biomedical resources Jinmeng Jia and Tieliu Shi Each year, millions of people around world suffer from the consequence of the misdiagnosis and ineffective treatment of various disease, especially those intractable diseases and rare diseases. Integration of various data related to human diseases help us not only for identifying drug targets, connecting genetic variations of phenotypes and understanding molecular pathways relevant to novel treatment, but also for coupling clinical care and biomedical researches. To this end, we built the Rare disease Annotation & Medicine (RAM) standards-based database which can provide reference to map and extract disease-specified information from multitude of biomedical resources such as free text articles in MEDLINE and Electronic Medical Records (EMRs). RAM integrates disease-specified concepts from ICD-9, ICD-10, SNOMED-CT and MeSH (http://www.nlm.nih.gov/mesh/MBrowser.html) extracted from the Unified Medical Language System (UMLS) based on the UMLS Concept Unique Identifiers for each Disease Term. We also integrated phenotypes from OMIM for each disease term, which link underlying mechanisms and clinical observation. Moreover, we used disease-manifestation (D-M) pairs from existing biomedical ontologies as prior knowledge to automatically recognize D-M-specific syntactic patterns from full text articles in MEDLINE. Considering that most of the record-based disease information in public databases are textual format, we extracted disease terms and their related biomedical descriptive phrases from Online Mendelian Inheritance in Man (OMIM), National Organization for Rare Disorders (NORD) and Orphanet using UMLS Thesaurus.
    [Show full text]
  • Genome Sequence of the Brown Norway Rat Yields Insights Into Mammalian Evolution
    articles Genome sequence of the Brown Norway rat yields insights into mammalian evolution Rat Genome Sequencing Project Consortium* *Lists of participants and affiliations appear at the end of the paper ........................................................................................................................................................................................................................... The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality ‘draft’ covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruc- tion of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage- independent evolutionary events such as expansion of gene families, orthology relations and protein evolution. Darwin believed that “natural selection will always act very slowly, primate: (1) These rodent genomic changes include approximately often only at long intervals of time”1.
    [Show full text]
  • PINOT: an Intuitive Resource for Integrating Protein-Protein Interactions James E
    Tomkins et al. Cell Communication and Signaling (2020) 18:92 https://doi.org/10.1186/s12964-020-00554-5 METHODOLOGY Open Access PINOT: an intuitive resource for integrating protein-protein interactions James E. Tomkins1, Raffaele Ferrari2, Nikoleta Vavouraki1, John Hardy2,3,4,5,6, Ruth C. Lovering7, Patrick A. Lewis1,2,8, Liam J. McGuffin9* and Claudia Manzoni1,10* Abstract Background: The past decade has seen the rise of omics data for the understanding of biological systems in health and disease. This wealth of information includes protein-protein interaction (PPI) data derived from both low- and high-throughput assays, which are curated into multiple databases that capture the extent of available information from the peer-reviewed literature. Although these curation efforts are extremely useful, reliably downloading and integrating PPI data from the variety of available repositories is challenging and time consuming. Methods: We here present a novel user-friendly web-resource called PINOT (Protein Interaction Network Online Tool; available at http://www.reading.ac.uk/bioinf/PINOT/PINOT_form.html) to optimise the collection and processing of PPI data from IMEx consortium associated repositories (members and observers) and WormBase, for constructing, respectively, human and Caenorhabditis elegans PPI networks. Results: Users submit a query containing a list of proteins of interest for which PINOT extracts data describing PPIs. At every query submission PPI data are downloaded, merged and quality assessed. Then each PPI is confidence scored based on the number of distinct methods used for interaction detection and the number of publications that report the specific interaction. Examples of how PINOT can be applied are provided to highlight the performance, ease of use and potential utility of this tool.
    [Show full text]
  • Genbank Is a Reliable Resource for 21St Century Biodiversity Research
    GenBank is a reliable resource for 21st century biodiversity research Matthieu Leraya, Nancy Knowltonb,1, Shian-Lei Hoc, Bryan N. Nguyenb,d,e, and Ryuji J. Machidac,1 aSmithsonian Tropical Research Institute, Smithsonian Institution, Panama City, 0843-03092, Republic of Panama; bNational Museum of Natural History, Smithsonian Institution, Washington, DC 20560; cBiodiversity Research Centre, Academia Sinica, 115-29 Taipei, Taiwan; dDepartment of Biological Sciences, The George Washington University, Washington, DC 20052; and eComputational Biology Institute, Milken Institute School of Public Health, The George Washington University, Washington, DC 20052 Contributed by Nancy Knowlton, September 15, 2019 (sent for review July 10, 2019; reviewed by Ann Bucklin and Simon Creer) Traditional methods of characterizing biodiversity are increasingly (13), the largest repository of genetic data for biodiversity (14, 15). being supplemented and replaced by approaches based on DNA In many cases, no vouchers are available to independently con- sequencing alone. These approaches commonly involve extraction firm identification, because the organisms are tiny, very difficult and high-throughput sequencing of bulk samples from biologically or impossible to identify, or lacking entirely (in the case of eDNA). complex communities or samples of environmental DNA (eDNA). In While concerns have been raised about biases and inaccuracies in such cases, vouchers for individual organisms are rarely obtained, often laboratory and analytical methods used in metabarcoding
    [Show full text]