The Uniprot Knowledgebase the Uniprot Knowledgebase Uniprotkb Flat-File Format

Introduction to bioinformatics The UniProt Knowledgebase The UniProt Knowledgebase UniProtKB Flat-file format A CRITICAL GUIDE 1 Version: 1 August 2018 A Critical Guide to UniProtKB’s flat-file format UniProtKB’s flat-file format Overview This Critical Guide briefly presents the need for biological databases and for a standard format for storing and organising biological data. Web-based interfaces have made databases more user-friendly, but knowledge of the underlying file format offers a deeper understanding of how to navigate and mine the information they contain, so that humans and machines can get the most out of them. This Guide explores the file format that underpins one of today’s most popular biological databases – UniProtKB. Teaching Goals & Learning Outcomes This Guide introduces the concept of database ‘flat-files’, and examines features of the UniProtKB flat-file format. On reading this Guide, you will be able to: • identify key fields within UniProtKB/Swiss-Prot and /TrEMBL flat-files; • explain what these fields mean, what information they contain and what the information is used for; • analyse the information in different fields and infer structural and functional features of a sequence; • examine and investigate the provenance of annotations; and • compare annotations at different time-points and evaluate the likely impact of annotation changes. This Guide introduces the EMBL flat-file format (henceforth, simp- 1 Introduction ly termed the EMBL format), which underpinned many of the earliest bio-databases. Using UniProtKB/Swiss-Prot1 as an exemplar, the Guide explores some of the key database fields, and illustrates how Protein and genome sequencing were major technological some of these can be used to create links to other resources. achievements that are now being used to generate vast amounts of data. However, just gathering more data doesn’t make us miracu- lously more knowledgeable or confer instant understanding of the 2 About this Guide information we’re amassing. Gaining new biological insights from raw genomic data is a complex process that requires, for example: The following sections outline the basic structure of the EMBL • genes to be located and their structures to be assembled format. Some features of the UniProtKB Web interface are refer- • coding regions to be identified and translated enced (www.uniprot.org), but this isn’t a general interface guide – • functions to be assigned to genes and their products the front-end often changes, but the underlying flat-file is ‘standard’ • disease associations to be identified, etc. (a high-level content overview of the database is explored in the To make raw data useful and re-usable, all this information must companion Critical Guide, The UniProt Knowledgebase - UniProtKB). be stored systematically in databases. The first protein and nucleo- Exercises are provided to help understand where to retrieve particu- tide sequence databases were created in the 1980s. Access to data- lar types of information, and how to navigate entry histories to dis- bases was important not only because it gave scientists the ability to cover the provenance of their annotations. Throughout the text, key use public data in their own research, but also because it facilitated terms – rendered in bold type – are defined in boxes. Additional connections to data in related resources via their annotations. information is provided in supplementary boxes. Annotations are the ‘intelligence’ or clues we attach to raw data KEY TERMS to make them meaningful to, and re-usable by, other researchers: linear strings of nucleotide bases or amino acid residues are virtually Annotation: notes that make database entries informative & re-usable Coding region: region of an mRNA that is translatable into a polypeptide useless on their own; however, allied with information about their EMBL: European Molecular Biology Laboratory evolutionary relationships, biological functions, roles, interactions, Flat-file: a plain-text file containing a number of data entries that lack disease associations, and so on, they become building-blocks of structured inter-relationships knowledge and spring-boards for future discoveries. Raw data: experimental data that bear no annotation The first biological databases were simple formatted text or flat- Record: an entry in a database pertaining to a specific datum files. These were organised such that ‘records’ within them could be Sequencing: the lab process of determining the sequence of amino easily identified and linked. Although database architectures and acids or nucleotide bases in a protein or DNA sequence, respectively their interfaces have become more sophisticated as the complexity UniProtKB/Swiss-Prot: a largely hand-annotated protein sequence and types of available data have grown, awareness and understand- database; given the breadth & quality of the information it contains, ing of those original flat-file formats is still relevant. it is often regarded as the ‘gold standard’ for database annotation 2 A Critical Guide to The UniProtKB flat-file format ID PRIO_HUMAN STANDARD; PRT; 253 AA. 3 Flat-file databases AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) DT 01-NOV-1991 (REL. 20, LAST ANNOTATION UPDATE) Database annotations add value to raw data, in principle allow- DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) GN PRNP. ing them to be re-used quickly and conveniently. The more annota- OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; tions provided, the richer the database content. The problem is, the OC EUTHERIA; PRIMATES. RN [1] more information added, the greater the need for disciplined ap- RP SEQUENCE FROM N.A. RM 86300093 proaches to data archiving – if computers are to be able to access RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H., RA PRUSINER S.B., DEARMOND S.J.; particular annotations reliably, they must be stored in a structured RL DNA 5:315-324(1986). RN [2] way. This begs two questions: RP SEQUENCE OF 8-253 FROM N.A. RM 86261778 RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.; • What kinds of annotation are vital? RL SCIENCE 233:364-367(1986). … • How should those annotations be organised? CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. If we think about sequence data, adding notes about their biolog- CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND CC ANIMALS INFECTED WITH THE DEGENERATIVE NEUROLOGICAL DISEASES KURU, ical relationships, functions, structures, etc., is clearly useful. Data- CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME CC (GSS), SCRAPIE, BOVINE SPONGIFORM ENCEPHALOPATHY (BSE), ETC. base-specific details, like when the sequence was submitted and CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED CC "RODS". when the database entry was last updated, add further value; links CC -!- PRP CONTAINS 5 TANDEM REPEATS OF AN OCTAPEPTIDE P-H-G-G-G-W-G-Q. CC -!- SUBCELLULAR LOCATION: PRP IS ATTACHED TO THE EXTRACELLULAR SIDE OF to the literature (e.g., to articles describing the function and disease- CC THE CELL MEMBRANE BY A GPI-ANCHOR. DR EMBL; M13667; HSPRP0A. associations of a sequence) and cross-references to information in DR EMBL; M13899; HSPRP. DR PIR; A05017; A05017. related databases are also helpful. DR PIR; A24173; A24173. DR MIM; 176640; NINTH EDITION. Structuring such information systematically to facilitate computer DR MIM; 123400; NINTH EDITION. DR MIM; 13744O; NINTH EDITION. access is challenging. Plain text is meaningful to humans, but not to DR MIM; 245300; NINTH EDITION. DR PROSITE; PS00291; PRION. computers; nevertheless, the earliest databases were created as KW PRION; BRAIN; GLYCOPROTEIN; GPI-ANCHOR; TANDEM REPEAT; SIGNAL. FT SIGNAL 1 22 flat-files. This required bits of information to be pinpointed with FT CHAIN 23 253 PRION PROTEIN. FT DOMAIN 90 234 PRP27-30 (PROTEASE RESISTANT CORE). specific tags to allow the data being stored in particular database FT CARBOHYD 181 181 PROBABLE. FT CARBOHYD 197 197 PROBABLE. locations to be identified, as illustrated in Figure 1. FT DISULFID 179 214 BY SIMILARITY. FT VARIANT 102 102 P -> L (LINKED TO DEVELOPMENT OF ATAXIC FT GSS). FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OF FT DEMENTING GSS). FT VARIANT 129 129 M -> V. FT VARIANT 200 200 E -> K (IN SOME CJD PATIENTS). FT REPEAT 51 59 FT REPEAT 60 67 FT REPEAT 68 75 FT REPEAT 76 83 FT REPEAT 84 91 SQ SEQUENCE 253 AA; 27661 MW; 354945 CN; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG // Figure 2 Key features of the flat-file format of a UniProtKB/Swiss- Prot entry. Shown here is an excerpt from the human prion protein entry. The file begins with an identifying (ID) code and an accession (AC) number: the AC number (here, P04156) is computer readable; the ID code is more intelligible to humans – here, PRIO_HUMAN denotes the human prion protein. Entry information Figure 1 Creating a flat-file database. Multiple plain-text or flat-files ID PRIO_HUMAN STANDARD; PRT; 253 AA. are appended to create a flat-file database. Within the database, each AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) file, or record, contains various data fields, each identified by a specific DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) tag. The zoomed-in region shows several tags found in the EMBL format, DT 01-NOV-1991 (REL. 20, LAST ANNOTATION UPDATE) shown here with a range of fields typical of UniProtKB entries. The AC and ID codes specify a particular database entry. In princi- Inevitably, several disparate flat-file formats evolved to store dif- ple, the AC number is invariant, so a sequence should always be ferent types of biological data.

The Uniprot Knowledgebase the Uniprot Knowledgebase Uniprotkb Flat-File Format

Toolboxes for a Standardised and Systematic Study of Glycans

Life with 6000 Genes Author(S): A. Goffeau, B. G. Barrell, H. Bussey, R

ELIXIR Poster Numbers: P El001 - 037 Application Posters: P El034 - 037

Comprehensive Analysis of CRISPR/Cas9-Mediated Mutagenesis in Arabidopsis Thaliana by Genome-Wide Sequencing

Introduction to Bioinformatics (Elective) – SBB1609

Embnet Conference 2020 Programme

The Uniprot Knowledgebase

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: a European Perspective

The Uniprot Knowledgebase BLAST

ISB Newsletter

Integrative Analysis of Transcriptomic Data for Identification of T-Cell

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D