The Uniprot Knowledgebase the Uniprot Knowledgebase Uniprotkb Flat-File Format

Total Page:16

File Type:pdf, Size:1020Kb

The Uniprot Knowledgebase the Uniprot Knowledgebase Uniprotkb Flat-File Format Introduction to bioinformatics The UniProt Knowledgebase The UniProt Knowledgebase UniProtKB Flat-file format A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to UniProtKB’s flat-file format UniProtKB’s flat-file format Overview This Critical Guide briefly presents the need for biological databases and for a standard format for storing and organising biological data. Web-based interfaces have made databases more user-friendly, but knowledge of the underlying file format offers a deeper un- derstanding of how to navigate and mine the information they contain, so that humans and machines can get the most out of them. This Guide explores the file format that underpins one of today’s most popular biological databases – UniProtKB. Teaching Goals & Learning Outcomes This Guide introduces the concept of database ‘flat-files’, and examines features of the UniProtKB flat-file format. On reading this Guide, you will be able to: • identify key fields within UniProtKB/Swiss-Prot and /TrEMBL flat-files; • explain what these fields mean, what information they contain and what the information is used for; • analyse the information in different fields and infer structural and functional features of a sequence; • examine and investigate the provenance of annotations; and • compare annotations at different time-points and evaluate the likely impact of annotation changes. This Guide introduces the EMBL flat-file format (henceforth, simp- 1 Introduction ly termed the EMBL format), which underpinned many of the earli- est bio-databases. Using UniProtKB/Swiss-Prot1 as an exemplar, the Guide explores some of the key database fields, and illustrates how Protein and genome sequencing were major technological some of these can be used to create links to other resources. achievements that are now being used to generate vast amounts of data. However, just gathering more data doesn’t make us miracu- lously more knowledgeable or confer instant understanding of the 2 About this Guide information we’re amassing. Gaining new biological insights from raw genomic data is a complex process that requires, for example: The following sections outline the basic structure of the EMBL • genes to be located and their structures to be assembled format. Some features of the UniProtKB Web interface are refer- • coding regions to be identified and translated enced (www.uniprot.org), but this isn’t a general interface guide – • functions to be assigned to genes and their products the front-end often changes, but the underlying flat-file is ‘standard’ • disease associations to be identified, etc. (a high-level content overview of the database is explored in the To make raw data useful and re-usable, all this information must companion Critical Guide, The UniProt Knowledgebase - UniProtKB). be stored systematically in databases. The first protein and nucleo- Exercises are provided to help understand where to retrieve particu- tide sequence databases were created in the 1980s. Access to data- lar types of information, and how to navigate entry histories to dis- bases was important not only because it gave scientists the ability to cover the provenance of their annotations. Throughout the text, key use public data in their own research, but also because it facilitated terms – rendered in bold type – are defined in boxes. Additional connections to data in related resources via their annotations. information is provided in supplementary boxes. Annotations are the ‘intelligence’ or clues we attach to raw data KEY TERMS to make them meaningful to, and re-usable by, other researchers: linear strings of nucleotide bases or amino acid residues are virtually Annotation: notes that make database entries informative & re-usable Coding region: region of an mRNA that is translatable into a polypeptide useless on their own; however, allied with information about their EMBL: European Molecular Biology Laboratory evolutionary relationships, biological functions, roles, interactions, Flat-file: a plain-text file containing a number of data entries that lack disease associations, and so on, they become building-blocks of structured inter-relationships knowledge and spring-boards for future discoveries. Raw data: experimental data that bear no annotation The first biological databases were simple formatted text or flat- Record: an entry in a database pertaining to a specific datum files. These were organised such that ‘records’ within them could be Sequencing: the lab process of determining the sequence of amino easily identified and linked. Although database architectures and acids or nucleotide bases in a protein or DNA sequence, respectively their interfaces have become more sophisticated as the complexity UniProtKB/Swiss-Prot: a largely hand-annotated protein sequence and types of available data have grown, awareness and understand- database; given the breadth & quality of the information it contains, ing of those original flat-file formats is still relevant. it is often regarded as the ‘gold standard’ for database annotation 2 A Critical Guide to The UniProtKB flat-file format ID PRIO_HUMAN STANDARD; PRT; 253 AA. 3 Flat-file databases AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) DT 01-NOV-1991 (REL. 20, LAST ANNOTATION UPDATE) Database annotations add value to raw data, in principle allow- DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) GN PRNP. ing them to be re-used quickly and conveniently. The more annota- OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; tions provided, the richer the database content. The problem is, the OC EUTHERIA; PRIMATES. RN [1] more information added, the greater the need for disciplined ap- RP SEQUENCE FROM N.A. RM 86300093 proaches to data archiving – if computers are to be able to access RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H., RA PRUSINER S.B., DEARMOND S.J.; particular annotations reliably, they must be stored in a structured RL DNA 5:315-324(1986). RN [2] way. This begs two questions: RP SEQUENCE OF 8-253 FROM N.A. RM 86261778 RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.; • What kinds of annotation are vital? RL SCIENCE 233:364-367(1986). … • How should those annotations be organised? CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. If we think about sequence data, adding notes about their biolog- CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND CC ANIMALS INFECTED WITH THE DEGENERATIVE NEUROLOGICAL DISEASES KURU, ical relationships, functions, structures, etc., is clearly useful. Data- CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME CC (GSS), SCRAPIE, BOVINE SPONGIFORM ENCEPHALOPATHY (BSE), ETC. base-specific details, like when the sequence was submitted and CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED CC "RODS". when the database entry was last updated, add further value; links CC -!- PRP CONTAINS 5 TANDEM REPEATS OF AN OCTAPEPTIDE P-H-G-G-G-W-G-Q. CC -!- SUBCELLULAR LOCATION: PRP IS ATTACHED TO THE EXTRACELLULAR SIDE OF to the literature (e.g., to articles describing the function and disease- CC THE CELL MEMBRANE BY A GPI-ANCHOR. DR EMBL; M13667; HSPRP0A. associations of a sequence) and cross-references to information in DR EMBL; M13899; HSPRP. DR PIR; A05017; A05017. related databases are also helpful. DR PIR; A24173; A24173. DR MIM; 176640; NINTH EDITION. Structuring such information systematically to facilitate computer DR MIM; 123400; NINTH EDITION. DR MIM; 13744O; NINTH EDITION. access is challenging. Plain text is meaningful to humans, but not to DR MIM; 245300; NINTH EDITION. DR PROSITE; PS00291; PRION. computers; nevertheless, the earliest databases were created as KW PRION; BRAIN; GLYCOPROTEIN; GPI-ANCHOR; TANDEM REPEAT; SIGNAL. FT SIGNAL 1 22 flat-files. This required bits of information to be pinpointed with FT CHAIN 23 253 PRION PROTEIN. FT DOMAIN 90 234 PRP27-30 (PROTEASE RESISTANT CORE). specific tags to allow the data being stored in particular database FT CARBOHYD 181 181 PROBABLE. FT CARBOHYD 197 197 PROBABLE. locations to be identified, as illustrated in Figure 1. FT DISULFID 179 214 BY SIMILARITY. FT VARIANT 102 102 P -> L (LINKED TO DEVELOPMENT OF ATAXIC FT GSS). FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OF FT DEMENTING GSS). FT VARIANT 129 129 M -> V. FT VARIANT 200 200 E -> K (IN SOME CJD PATIENTS). FT REPEAT 51 59 FT REPEAT 60 67 FT REPEAT 68 75 FT REPEAT 76 83 FT REPEAT 84 91 SQ SEQUENCE 253 AA; 27661 MW; 354945 CN; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG // Figure 2 Key features of the flat-file format of a UniProtKB/Swiss- Prot entry. Shown here is an excerpt from the human prion protein entry. The file begins with an identifying (ID) code and an accession (AC) number: the AC number (here, P04156) is computer readable; the ID code is more intelligible to humans – here, PRIO_HUMAN denotes the human prion protein. Entry information Figure 1 Creating a flat-file database. Multiple plain-text or flat-files ID PRIO_HUMAN STANDARD; PRT; 253 AA. are appended to create a flat-file database. Within the database, each AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) file, or record, contains various data fields, each identified by a specific DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) tag. The zoomed-in region shows several tags found in the EMBL format, DT 01-NOV-1991 (REL. 20, LAST ANNOTATION UPDATE) shown here with a range of fields typical of UniProtKB entries. The AC and ID codes specify a particular database entry. In princi- Inevitably, several disparate flat-file formats evolved to store dif- ple, the AC number is invariant, so a sequence should always be ferent types of biological data.
Recommended publications
  • Toolboxes for a Standardised and Systematic Study of Glycans
    Campbell et al. BMC Bioinformatics 2014, 15(Suppl 1):S9 http://www.biomedcentral.com/1471-2105/15/S1/S9 RESEARCH Open Access Toolboxes for a standardised and systematic study of glycans Matthew P Campbell1, René Ranzinger2, Thomas Lütteke3, Julien Mariethoz4, Catherine A Hayes5, Jingyu Zhang1, Yukie Akune6, Kiyoko F Aoki-Kinoshita6, David Damerell7,11, Giorgio Carta8, Will S York2, Stuart M Haslam7, Hisashi Narimatsu9, Pauline M Rudd8, Niclas G Karlsson4, Nicolle H Packer1, Frédérique Lisacek4,10* From Integrated Bio-Search: 12th International Workshop on Network Tools and Applications in Biology (NETTAB 2012) Como, Italy. 14-16 November 2012 Abstract Background: Recent progress in method development for characterising the branched structures of complex carbohydrates has now enabled higher throughput technology. Automation of structure analysis then calls for software development since adding meaning to large data collections in reasonable time requires corresponding bioinformatics methods and tools. Current glycobioinformatics resources do cover information on the structure and function of glycans, their interaction with proteins or their enzymatic synthesis. However, this information is partial, scattered and often difficult to find to for non-glycobiologists. Methods: Following our diagnosis of the causes of the slow development of glycobioinformatics, we review the “objective” difficulties encountered in defining adequate formats for representing complex entities and developing efficient analysis software. Results: Various solutions already implemented and strategies defined to bridge glycobiology with different fields and integrate the heterogeneous glyco-related information are presented. Conclusions: Despite the initial stage of our integrative efforts, this paper highlights the rapid expansion of glycomics, the validity of existing resources and the bright future of glycobioinformatics.
    [Show full text]
  • Life with 6000 Genes Author(S): A. Goffeau, B. G. Barrell, H. Bussey, R
    Life with 6000 Genes Author(s): A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann, F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin and S. G. Oliver Source: Science, Vol. 274, No. 5287, Genome Issue (Oct. 25, 1996), pp. 546+563-567 Published by: American Association for the Advancement of Science Stable URL: https://www.jstor.org/stable/2899628 Accessed: 28-08-2019 13:31 UTC REFERENCES Linked references are available on JSTOR for this article: https://www.jstor.org/stable/2899628?seq=1&cid=pdf-reference#references_tab_contents You may need to log in to JSTOR to access the linked references. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at https://about.jstor.org/terms American Association for the Advancement of Science is collaborating with JSTOR to digitize, preserve and extend access to Science This content downloaded from 152.3.43.41 on Wed, 28 Aug 2019 13:31:29 UTC All use subject to https://about.jstor.org/terms by FISH across the whole genome. A subset of the 37. G. D. Billingsleyet al., Am. J. Hum. Genet. 52, 343 (1994); L.
    [Show full text]
  • ELIXIR Poster Numbers: P El001 - 037 Application Posters: P El034 - 037
    POSTER LIST ORDERED ALPHABETICALLY BY POSTER TITLE GROUPED BY THEME/TRACK THEME/TRACK: ELIXIR Poster numbers: P_El001 - 037 Application posters: P_El034 - 037 Poster EasyChair Presenting Author list Title Abstract Theme/track Topics number number author P_El001 714 Joan Segura, Daniel Joan Segura 3DBIONOTES: Unifying molecular biology With the advent of next generation sequencing methods, the amount of proteomic and genomic information is growing faster than ever. Several projects have been undertaken to annotate the ELIXIR poster ELIXIR Tabas Madrid, Ruben information genomes of most important organisms, including human. For example, the GENECODE project seeks to enhance all human genes including protein-coding loci with alternatively splices Sanchez-Garcia, Jesús variants, non-coding loci and pseudogenes. Another example is the 1000 genomes, a repository of human genetic variations, including SNPs and structural variants, and their haplotype Cuenca, Carlos Oscar contexts. These projects feed most relevant biological databases as UNIPROT and ENSEMBL, extending the amount of available annotation for genes and proteins.Genomic and proteomic Sánchez Sorzano, Ardan annotations are a valuable contribution in the study of protein and gene functions. However, structural information is an essential key for a deeper understanding of the molecular properties Patwardhan and Jose that allow proteins and genes to perform specific tasks. Therefore, depicting genomic and proteomic information over structural data would offer a very complete picture in order to understand Maria Carazo how proteins and genes behave in the different cellular processes.In this work we present the second version of a web platform -3DBIONOTES- that aims to merge the different levels of molecular biology information, including genomics, proteomics and interactomics data into a unique graphical environment.
    [Show full text]
  • Comprehensive Analysis of CRISPR/Cas9-Mediated Mutagenesis in Arabidopsis Thaliana by Genome-Wide Sequencing
    International Journal of Molecular Sciences Article Comprehensive Analysis of CRISPR/Cas9-Mediated Mutagenesis in Arabidopsis thaliana by Genome-Wide Sequencing Wenjie Xu 1,2 , Wei Fu 2, Pengyu Zhu 2, Zhihong Li 1, Chenguang Wang 2, Chaonan Wang 1,2, Yongjiang Zhang 2 and Shuifang Zhu 1,2,* 1 College of Plant Protection, China Agricultural University, Beijing 100193 China 2 Institute of Plant Quarantine, Chinese Academy of Inspection and Quarantine, Beijing 100176, China * Correspondence: [email protected] Received: 9 July 2019; Accepted: 21 August 2019; Published: 23 August 2019 Abstract: The clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated protein (Cas) system has been widely applied in functional genomics research and plant breeding. In contrast to the off-target studies of mammalian cells, there is little evidence for the common occurrence of off-target sites in plants and a great need exists for accurate detection of editing sites. Here, we summarized the precision of CRISPR/Cas9-mediated mutations for 281 targets and found that there is a preference for single nucleotide deletions/insertions and longer deletions starting from 40 nt upstream or ending at 30 nt downstream of the cleavage site, which suggested the candidate sequences for editing sites detection by whole-genome sequencing (WGS). We analyzed the on-/off-target sites of 6 CRISPR/Cas9-mediated Arabidopsis plants by the optimized method. The results showed that the on-target editing frequency ranged from 38.1% to 100%, and one off target at a frequency of 9.8%–97.3% cannot be prevented by increasing the specificity or reducing the expression level of the Cas9 enzyme.
    [Show full text]
  • Introduction to Bioinformatics (Elective) – SBB1609
    SCHOOL OF BIO AND CHEMICAL ENGINEERING DEPARTMENT OF BIOTECHNOLOGY Unit 1 – Introduction to Bioinformatics (Elective) – SBB1609 1 I HISTORY OF BIOINFORMATICS Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biologicaldata. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. "Classical" bioinformatics: "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information.” The National Center for Biotechnology Information (NCBI 2001) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline.
    [Show full text]
  • Embnet Conference 2020 Programme
    EMBnet Conference 2020 Bioinformatics Approaches to Precision Research 23-24 September 2020 Zoom Platform - Time 3 pm to 8 pm (CEST) Registration: https://us02web.zoom.us/meeting/register/tZEvd-mupjwiE911qtAnYln4TrNwQlxDMYm4 Programme 1st Day: 23 September 9, 2020 15:00-15:10 Welcome & Programme Presentation Domenica D’Elia & Erik Bongcam-Rudloff 15:10-15:30 Exosomes in breast milk; a beneficial genetic trojan horse from mother to child Dimitrios Vlachakis, Aspasia Efthimiadou, Flora Bacopoulou, Elias Eliopoulos, George P. Chrousos 15:30-15:50 Precision dairy farming – A Phenomenal opportunity Tomas Klingström 15:50-16:10 Genome regulation by long non-coding RNAs Katerina Pierouli, George N. Goulielmos, Elias Eliopoulos, Dimitrios Vlachakis European Molecular Biology Network (EMBnet) and Global Bioinformatics Network since 1988 16:10-16:30 Precision Epidemiology of Multi-drug resistance bacteria: bioinformatics tools J.Donato, L. Lugo, H. Perez, H. Ballen, D. Talero, S. Prada, F.Brion, V. Rincon, L. Falquet, M.T. Reguero, E. Barreto-Hernandez 16:30-16:50 Mechanisms of epigenetic inheritance in children following exposure to abuse E. Damaskopoulou, G.P. Chrousos, E. Eliopoulos, D. Vlachakis 16:50-17:00 Break 17:00-17:20 Genome-wide association studies (GWAS) in an effort to provide insights into the complex interplay of nuclear receptor transcriptional networks and their contribution to the maintenance of homeostasis Thanassis Mitsis, G.P. Chrousos, E. Eliopoulos, D.Vlachakis 17:20-17:40 Computer aided drug design and pharmacophore modelling towards the discovery of novel anti-ebola agents Kalliopi Io Diakou, G.P. Chrousos, E. Eliopoulos, D. Vlachakis 17:40-18:00 3′-Tag RNA-sequencing Andreas Gisel 18:00-18:20 Expression profiling of non-coding RNA in coronaviruses provides clues for virus RNA interference with immune system response Arianna Consiglio, V.F.
    [Show full text]
  • The Uniprot Knowledgebase
    Bringing bioinformatics into the classroom The UniProtIntroducing Knowledgebase UniProtKB Computer-Aided Drug Design A PRACTICAL GUIDE 1 Version: 30 November 2020 A Practical Guide to Computer-Aided Drug Design Designing tomorrow’s drugs Overview This Practical Guide outlines basic computational approaches used in drug discovery. It highlights how bioinformatics can be harnessed to design drug candidates, to predict their affinity for their targets, their fate inside the body, their toxicity and possible side-effects. Teaching Goals & Learning Outcomes This Guide introduces bioinformatics tools for designing candidate drug molecules, and for predicting their likely target protein(s) and their drug-like properties. On reading the Guide and completing the exercises, you will be able to: • design drug-candidate molecules using the structures of known drugs as templates, and dock them to known protein targets; • compare the protein target-binding strengths of drug candidates with those of known drugs; • calculate properties of drug candidates and infer whether they need chemical modification to make them more drug-like; • predict the protein(s) that a drug candidate is likely to target; • create molecular fingerprints for known drugs, and use these to quantify their similarity. simple computational methodologies to conceive and evaluate mol- 13 1 Introduction ecules for their potential to become drugs . Although macromolec- ular entities (e.g., like antibodies) can act as therapeutic agents, here we consider drugs as small organic molecules (less than ~100 Over the past century, the design and production of drugs has had atoms) that activate or inhibit the functions of proteins, ultimately a beneficial impact on life expectancy and quality1,2.
    [Show full text]
  • Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective
    1 Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective Attwood, T.K.1, Gisel, A.2, Eriksson, N-E.3 and Bongcam-Rudloff, E.4 1Faculty of Life Sciences & School of Computer Science, University of Manchester 2Institute for Biomedical Technologies, CNR 3Uppsala Biomedical Centre (BMC), University of Uppsala 4Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences 1UK 2Italy 3,4Sweden 1. Introduction The origins of bioinformatics, both as a term and as a discipline, are difficult to pinpoint. The expression was used as early as 1977 by Dutch theoretical biologist Paulien Hogeweg when she described her main field of research as bioinformatics, and established a bioinformatics group at the University of Utrecht (Hogeweg, 1978; Hogeweg & Hesper, 1978). Nevertheless, the term had little traction in the community for at least another decade. In Europe, the turning point seems to have been circa 1990, with the planning of the “Bioinformatics in the 90s” conference, which was held in Maastricht in 1991. At this time, the National Center for Biotechnology Information (NCBI) had been newly established in the United States of America (USA) (Benson et al., 1990). Despite this, there was still a sense that the nation lacked a “long-term biology ‘informatics’ strategy”, particularly regarding postdoctoral interdisciplinary training in computer science and molecular biology (Smith, 1990). Interestingly, Smith spoke here of ‘biology informatics’, not bioinformatics; and the NCBI was a ‘center for biotechnology information’, not a bioinformatics centre. The discipline itself ultimately grew organically from the needs of researchers to access and analyse (primarily biomedical) data, which appeared to be accumulating at alarming rates simultaneously in different parts of the world.
    [Show full text]
  • The Uniprot Knowledgebase BLAST
    Introduction to bioinformatics The UniProt Knowledgebase BLAST UniProtKB Basic Local Alignment Search Tool A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to BLAST BLAST Overview This Critical Guide provides an overview of the BLAST similarity search tool, Briefly examining the underlying algorithm and its rise to popularity. Several WeB-based and stand-alone implementations are reviewed, and key features of typical search results are discussed. Teaching Goals & Learning Outcomes This Guide introduces concepts and theories emBodied in the sequence database search tool, BLAST, and examines features of search outputs important for understanding and interpreting BLAST results. On reading this Guide, you will Be aBle to: • search a variety of Web-based sequence databases with different query sequences, and alter search parameters; • explain a range of typical search parameters, and the likely impacts on search outputs of changing them; • analyse the information conveyed in search outputs and infer the significance of reported matches; • examine and investigate the annotations of reported matches, and their provenance; and • compare the outputs of different BLAST implementations and evaluate the implications of any differences. finding short words – k-tuples – common to the sequences Being 1 Introduction compared, and using heuristics to join those closest to each other, including the short mis-matched regions Between them. BLAST4 was the second major example of this type of algorithm, From the advent of the first molecular sequence repositories in and rapidly exceeded the popularity of FastA, owing to its efficiency the 1980s, tools for searching dataBases Became essential. DataBase searching is essentially a ‘pairwise alignment’ proBlem, in which the and Built-in statistics.
    [Show full text]
  • ISB Newsletter
    ISB June 2012 . INTERNATIONAL SOCIETY ISB FOR BIOCURATION June, 2012 In this issue: - Save the Date: ISB2013 Save the Date - B3CB Workshop Report - ISB Membership Renewal for ISB2013! - 2012 Elections: ISB Executive Committee The 6th International Biocuration Conference - Jack Leunissen will be held on April 7-10, - Latest Advances from ISB Research Teams 2013, at Churchill College, - Job Opportunities Cambridge, UK. Training Activities: B3CB Workshop Report B3CB was a workshop aimed at developing common resources and infrastructures for training of experts as well as users of bioinformatics tools and resources. It was held in Uppsala (Sweden), and chaired by Terri Attwood (Manchester University, ISB Executive Committee member). Pascale Gaudet introduced the ISB. We believe that this initiative will help the ISB provide more support for training, a part of the our mission statement. The workshop brought together members of the executive committees of several ISB Membership Renewal organizations in Bioinformatics, Biotechnology, Biocuration and Computational Biology (B3CB): For many of us, membership fees are due - EMBnet (Global Bioinformatics Network) – Terri Attwood this month! Notices for renewal of ISB - ISCB (International Society for Computational Biology) – Reinhard Schneider Memberships will be reaching inboxes in - APBioNet (Asia-Pacific Bioinformatics Network) – Christian Schönbach the coming days. Your ISB Membership - ASBCB (African Society for Bioinformatics & Computational Biology) – Nicky Mulder goes to support activities such as the - SoIBio (Iberoamerican Society for Bioinformatics) – Oswaldo Trelles International Biocuration Conference - ISB (International Society for Biocuration) – Pascale Gaudet (travel support was provided for 10 - EBI/ELIXIR (European Bioinformatics Institute) – Cath Brooksbank attendees to ISB2012), BioDBCore (in - NBIC (Netherlands Bioinformatics Centre) – Celia van Gelder collaboration with BioSharing), and to - SeqAhead (Next-Gen Sequencing Data Analysis Network) – Erik Bongcam-Rudloff establish links with other groups.
    [Show full text]
  • Integrative Analysis of Transcriptomic Data for Identification of T-Cell
    www.nature.com/scientificreports OPEN Integrative analysis of transcriptomic data for identifcation of T‑cell activation‑related mRNA signatures indicative of preterm birth Jae Young Yoo1,5, Do Young Hyeon2,5, Yourae Shin2,5, Soo Min Kim1, Young‑Ah You1,3, Daye Kim4, Daehee Hwang2* & Young Ju Kim1,3* Preterm birth (PTB), defned as birth at less than 37 weeks of gestation, is a major determinant of neonatal mortality and morbidity. Early diagnosis of PTB risk followed by protective interventions are essential to reduce adverse neonatal outcomes. However, due to the redundant nature of the clinical conditions with other diseases, PTB‑associated clinical parameters are poor predictors of PTB. To identify molecular signatures predictive of PTB with high accuracy, we performed mRNA sequencing analysis of PTB patients and full‑term birth (FTB) controls in Korean population and identifed diferentially expressed genes (DEGs) as well as cellular pathways represented by the DEGs between PTB and FTB. By integrating the gene expression profles of diferent ethnic groups from previous studies, we identifed the core T‑cell activation pathway associated with PTB, which was shared among all previous datasets, and selected three representative DEGs (CYLD, TFRC, and RIPK2) from the core pathway as mRNA signatures predictive of PTB. We confrmed the dysregulation of the candidate predictors and the core T‑cell activation pathway in an independent cohort. Our results suggest that CYLD, TFRC, and RIPK2 are potentially reliable predictors for PTB. Preterm birth (PTB) is the birth of a baby at less than 37 weeks of gestation, as opposed to the usual about 40 weeks, called full term birth (FTB)1.
    [Show full text]
  • Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D
    BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright ᭧ 2001 by John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher.
    [Show full text]