Introduction to

The UniProt Knowledgebase The UniProt Knowledgebase UniProtKB Flat-file format

A CRITICAL GUIDE

1

Version: 1 August 2018 A Critical Guide to UniProtKB’s flat-file format

UniProtKB’s flat-file format

Overview This Critical Guide briefly presents the need for biological databases and for a standard format for storing and organising biological data. Web-based interfaces have made databases more user-friendly, but knowledge of the underlying file format offers a deeper un- derstanding of how to navigate and mine the information they contain, so that humans and machines can get the most out of them. This Guide explores the file format that underpins one of today’s most popular biological databases – UniProtKB.

Teaching Goals & Learning Outcomes This Guide introduces the concept of ‘flat-files’, and examines features of the UniProtKB flat-file format. On reading this Guide, you will be able to: • identify key fields within UniProtKB/Swiss-Prot and /TrEMBL flat-files; • explain what these fields mean, what information they contain and what the information is used for; • analyse the information in different fields and infer structural and functional features of a sequence; • examine and investigate the provenance of annotations; and • compare annotations at different time-points and evaluate the likely impact of annotation changes.

This Guide introduces the EMBL flat-file format (henceforth, simp- 1 Introduction ly termed the EMBL format), which underpinned many of the earli- est bio-databases. Using UniProtKB/Swiss-Prot1 as an exemplar, the Guide explores some of the key database fields, and illustrates how Protein and genome were major technological some of these can be used to create links to other resources. achievements that are now being used to generate vast amounts of data. However, just gathering more data doesn’t make us miracu- lously more knowledgeable or confer instant understanding of the 2 About this Guide information we’re amassing. Gaining new biological insights from raw genomic data is a complex process that requires, for example: The following sections outline the basic structure of the EMBL • genes to be located and their structures to be assembled format. Some features of the UniProtKB Web interface are refer- • coding regions to be identified and translated enced (www..org), but this isn’t a general interface guide – • functions to be assigned to genes and their products the front-end often changes, but the underlying flat-file is ‘standard’ • disease associations to be identified, etc. (a high-level content overview of the database is explored in the To make raw data useful and re-usable, all this information must companion Critical Guide, The UniProt Knowledgebase - UniProtKB). be stored systematically in databases. The first protein and nucleo- Exercises are provided to help understand where to retrieve particu- tide sequence databases were created in the 1980s. Access to data- lar types of information, and how to navigate entry histories to dis- bases was important not only because it gave scientists the ability to cover the provenance of their annotations. Throughout the text, key use public data in their own research, but also because it facilitated terms – rendered in bold type – are defined in boxes. Additional connections to data in related resources via their annotations. information is provided in supplementary boxes. Annotations are the ‘intelligence’ or clues we attach to raw data KEY TERMS to make them meaningful to, and re-usable by, other researchers: linear strings of nucleotide bases or amino acid residues are virtually Annotation: notes that make database entries informative & re-usable Coding region: region of an mRNA that is translatable into a polypeptide useless on their own; however, allied with information about their EMBL: European Molecular Laboratory evolutionary relationships, biological functions, roles, interactions, Flat-file: a plain-text file containing a number of data entries that lack disease associations, and so on, they become building-blocks of structured inter-relationships knowledge and spring-boards for future discoveries. Raw data: experimental data that bear no annotation The first biological databases were simple formatted text or flat- Record: an entry in a database pertaining to a specific datum files. These were organised such that ‘records’ within them could be Sequencing: the lab process of determining the sequence of amino easily identified and linked. Although database architectures and acids or nucleotide bases in a protein or DNA sequence, respectively their interfaces have become more sophisticated as the complexity UniProtKB/Swiss-Prot: a largely hand-annotated protein sequence and types of available data have grown, awareness and understand- database; given the breadth & quality of the information it contains, ing of those original flat-file formats is still relevant. it is often regarded as the ‘gold standard’ for database annotation

2 A Critical Guide to The UniProtKB flat-file format

ID PRIO_HUMAN STANDARD; PRT; 253 AA. 3 Flat-file databases AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) DT 01-NOV-1991 (REL. 20, LAST ANNOTATION UPDATE) Database annotations add value to raw data, in principle allow- DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) GN PRNP. ing them to be re-used quickly and conveniently. The more annota- OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; tions provided, the richer the database content. The problem is, the OC EUTHERIA; PRIMATES. RN [1] more information added, the greater the need for disciplined ap- RP SEQUENCE FROM N.A. RM 86300093 proaches to data archiving – if computers are to be able to access RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H., RA PRUSINER S.B., DEARMOND S.J.; particular annotations reliably, they must be stored in a structured RL DNA 5:315-324(1986). RN [2] way. This begs two questions: RP SEQUENCE OF 8-253 FROM N.A. RM 86261778 RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.; • What kinds of annotation are vital? RL SCIENCE 233:364-367(1986). … • How should those annotations be organised? CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. If we think about sequence data, adding notes about their biolog- CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND CC ANIMALS INFECTED WITH THE DEGENERATIVE NEUROLOGICAL DISEASES KURU, ical relationships, functions, structures, etc., is clearly useful. Data- CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME CC (GSS), SCRAPIE, BOVINE SPONGIFORM ENCEPHALOPATHY (BSE), ETC. base-specific details, like when the sequence was submitted and CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED CC "RODS". when the database entry was last updated, add further value; links CC -!- PRP CONTAINS 5 TANDEM REPEATS OF AN OCTAPEPTIDE P-H-G-G-G-W-G-Q. CC -!- SUBCELLULAR LOCATION: PRP IS ATTACHED TO THE EXTRACELLULAR SIDE OF to the literature (e.g., to articles describing the function and disease- CC THE CELL MEMBRANE BY A GPI-ANCHOR. DR EMBL; M13667; HSPRP0A. associations of a sequence) and cross-references to information in DR EMBL; M13899; HSPRP. DR PIR; A05017; A05017. related databases are also helpful. DR PIR; A24173; A24173. DR MIM; 176640; NINTH EDITION. Structuring such information systematically to facilitate computer DR MIM; 123400; NINTH EDITION. DR MIM; 13744O; NINTH EDITION. access is challenging. Plain text is meaningful to humans, but not to DR MIM; 245300; NINTH EDITION. DR PROSITE; PS00291; PRION. computers; nevertheless, the earliest databases were created as KW PRION; BRAIN; GLYCOPROTEIN; GPI-ANCHOR; TANDEM REPEAT; SIGNAL. FT SIGNAL 1 22 flat-files. This required bits of information to be pinpointed with FT CHAIN 23 253 PRION PROTEIN. FT DOMAIN 90 234 PRP27-30 (PROTEASE RESISTANT CORE). specific tags to allow the data being stored in particular database FT CARBOHYD 181 181 PROBABLE. FT CARBOHYD 197 197 PROBABLE. locations to be identified, as illustrated in Figure 1. FT DISULFID 179 214 BY SIMILARITY. FT VARIANT 102 102 P -> L (LINKED TO DEVELOPMENT OF ATAXIC FT GSS). FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OF FT DEMENTING GSS). FT VARIANT 129 129 M -> V. FT VARIANT 200 200 E -> K (IN SOME CJD PATIENTS). FT REPEAT 51 59 FT REPEAT 60 67 FT REPEAT 68 75 FT REPEAT 76 83 FT REPEAT 84 91 SQ SEQUENCE 253 AA; 27661 MW; 354945 CN; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL IVG // Figure 2 Key features of the flat-file format of a UniProtKB/Swiss- Prot entry. Shown here is an excerpt from the human prion protein entry. The file begins with an identifying (ID) code and an accession (AC) number: the AC number (here, P04156) is computer readable; the ID code is more intelligible to humans – here, PRIO_HUMAN denotes the human prion protein.

Entry information Figure 1 Creating a flat-file database. Multiple plain-text or flat-files ID PRIO_HUMAN STANDARD; PRT; 253 AA. are appended to create a flat-file database. Within the database, each AC P04156; DT 01-NOV-1986 (REL. 03, CREATED) file, or record, contains various data fields, each identified by a specific DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE) tag. The zoomed-in region shows several tags found in the EMBL format, DT 01-NOV-1991 (REL. 20, LAST ANNOTATION UPDATE) shown here with a range of fields typical of UniProtKB entries. The AC and ID codes specify a particular database entry. In princi- Inevitably, several disparate flat-file formats evolved to store dif- ple, the AC number is invariant, so a sequence should always be ferent types of biological data. Of these, one particular format be- traceable in any version of the database: in practice, however, this came popular, essentially because its structure was relatively sim- isn’t always the case – AC numbers do change. In situations where ple; it was therefore adapted for a variety of data-types by a number of databases that are still in use today (e.g., UniProtKB/Swiss-Prot, KEY TERMS UniProtKB/TrEMBL1, PROSITE2). This simple flat-file format was the EMBL data library: the core European nucleotide one originally devised to store nucleotide sequences in the EMBL Prion protein: a membrane protein that, when mis-folded, aggregates data library3. in brain & neural tissues, leading to neurodegenerative diseases 3.1 Flat-file fields & tags PROSITE: a diagnostic database of protein families & functional sites The EMBL format uses a series of marginal two-letter tags to de- Tag: a code identifying the type of data that appears in a given database scribe the type of data stored on each line of the file, as illustrated field (e.g., AC, which identifies an entry’s accession number) in the inset in Figure 1 and shown in more detail in Figure 24. UniProtKB/TrEMBL: a computer-annotated protein sequence database, the companion resource to UniProtKB/Swiss-Prot

3 A Critical Guide to UniProtKB’s flat-file format multiple AC numbers point to the same sequence, they are listed in Database cross-references the AC field, the first being the primary accession number. Following DR EMBL; M13667; HSPRP0A. the ID and AC fields, the date (DT) field contains information about DR EMBL; M13899; HSPRP. DR PIR; A05017; A05017. when the sequence entered the database, and when changes were DR PIR; A24173; A24173. DR MIM; 176640; NINTH EDITION. last made to its entry. This example shows that the entry was creat- DR MIM; 123400; NINTH EDITION. ed on 1 November 1986 (when its sequence was deposited), and its DR MIM; 13744O; NINTH EDITION. DR MIM; 245300; NINTH EDITION. annotation was later updated on 1 November 1991. DR PROSITE; PS00291; PRION. Information about the name of the protein, and synonyms and abbreviations it has, is reflected in the DE line (here, the major pro- To further expedite file processing, and database searching, some tein precursor, also referred to as PRP). Gene and locus names are of the terms used in the entry are also included as a controlled vo- given in the GN line (here, prnp). More precise species and taxo- cabulary of keywords (KW). Here, there are six keywords, including nomic specifications are given in the OS and OC lines: here, the or- prion, brain and tandem repeat. ganism species is confirmed as Homo sapiens, with classification Keywords Eukaryota…Primates. KW PRION; BRAIN; GLYCOPROTEIN; GPI-ANCHOR; TANDEM REPEAT; SIGNAL.

Name & Origin Further enriching the entry, various characteristics of the se- DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) quence are documented in the Feature Table (FT). In this example, GN PRNP. OS HOMO SAPIENS (HUMAN). we discover the locations of several features: a signal peptide OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. (SIGNAL), from amino acids 1 to 22; possible carbohydrate attach- ment sites (CARBOHYD), at residues 181 and 197; a disulphide Each entry also includes bibliographic references. Standard tags bridge (DISULFID), between residues 179 and 214; and several oc- point to different parts of the reference structure: these include the tapeptide repeats (REPEAT). The positions of several sequence varia- number of the reference in the list (RN) – only two are shown here; tions (VARIANT) are also noted, at residues 102, 117, 129 and 200. the type of information taken from the reference (RP) – e.g., se- Where relevant, the status of such features is denoted by different quence from nucleic acid; the literature database (Medline) cross- qualifiers: e.g., PROBABLE and BY SIMILARITY indicate that a feature reference (RM) – e.g., 86300093 (note, this has since been replaced has been predicted and not been verified experimentally. by the RX tag); the authors of the paper (RA) – here, Kretzschmar et al. and Liao et al.; and finally, the place of publication (RL) – here, Feature Table the journals DNA and Science. FT SIGNAL 1 22 FT CHAIN 23 253 PRION PROTEIN. FT DOMAIN 90 234 PRP27-30 (PROTEASE RESISTANT CORE). References FT CARBOHYD 181 181 PROBABLE. RN [1] FT CARBOHYD 197 197 PROBABLE. RP SEQUENCE FROM N.A. FT DISULFID 179 214 BY SIMILARITY. RM 86300093 FT VARIANT 102 102 P -> L (LINKED TO DEVELOPMENT OF RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H., FT ATAXIC GSS). RA PRUSINER S.B., DEARMOND S.J.; FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OF RL DNA 5:315-324(1986). FT DEMENTING GSS). RN [2] FT VARIANT 129 129 M -> V. RP SEQUENCE OF 8-253 FROM N.A. FT VARIANT 200 200 E -> K (IN SOME CJD PATIENTS). RM 86261778 FT REPEAT 51 59 RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.; FT REPEAT 60 67 RL SCIENCE 233:364-367(1986). FT REPEAT 68 75 FT REPEAT 76 83 A rich set of annotations can be found in the semi-structured FT REPEAT 84 91 comment (CC) field. This gives details, for example, of the protein’s The amino acid sequence itself is stored in the SQ field, using the function, disease associations, subunit structure, subcellular loca- standard single-letter code, alongside attributes such as its length tion, and so on – the additional -!- flags highlight specific subhead- (here, 253 amino acids) and molecular weight (here, 27,661). A com- ings, thereby facilitating computational parsing of this free-text field (this is helpful because, for well-studied proteins, the field can fill Sequence information several pages). SQ SEQUENCE 253 AA; 27661 MW; 354945 CN; MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA Comments VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE ILLISFLIFL IVG CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS. // CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND CC ANIMALS INFECTED WITH THE DEGENERATIVE NEUROLOGICAL DISEASES KURU, CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME CC (GSS), SCRAPIE, BOVINE SPONGIFORM ENCEPHALOPATHY (BSE), ETC. CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED KEY TERMS CC "RODS". CC -!- PRP CONTAINS 5 TANDEM REPEATS OF AN OCTAPEPTIDE P-H-G-G-G-W-G-Q. Medline: an online bibliographic database that includes information on CC -!- SUBCELLULAR LOCATION: PRP IS ATTACHED TO THE EXTRACELLULAR SIDE OF CC THE CELL MEMBRANE BY A GPI-ANCHOR. articles published in biomedical & life-science journals MIM: or OMIM (Online Mendelian Inheritance in Man), a comprehen- Links to related information in a range of other databases are sive database of human genes & genetic disorders made in the database cross-reference (DR) lines. In this example, PIR: Protein Information Resource, an integrated bioinformatics re- several cross-links are made to entries in EMBL, PIR, MIM and source supporting genomic, proteomic & systems biology research; PROSITE via their respective accession numbers and/or ID codes, the PIR-PSD (protein sequence database) grew out of the Atlas of where applicable (e.g., the associated entry in PROSITE has AC num- Protein Sequence & Structure ber PS00291 and ID code PRION). Single-letter code: individual letters used to denote the amino acids – Q for glutamine, N for asparagine, W for tryptophan, etc.

4 A Critical Guide to The UniProtKB flat-file format putation number (CN) validates the accuracy of the sequence and So far, this Guide has explored features of the EMBL format, as facilitates identification of identical sequences (this has since been exemplified by UniProtKB/Swiss-Prot. Next, we look at how some of replaced by a 64-bit Cyclic Redundancy Check (CRC64)). Finally, the the fields can be used to link to related data in other resources. entry terminates with the // symbol. EXERCISES 4 Database interoperability 1 Figure 2 shows part of the UniProtKB/Swiss-Prot entry for the human prion protein. When did the sequence first enter the database? Storing information in databases isn’t especially useful unless What is the ‘current’ date of the entry shown in the Figure? computers are able to access the data, and can help users to explore 2 To which database does the entry link? Does this da- and analyse the information they contain. However, computers can tabase confirm the identity of the protein? only effectively and efficiently access data if they are stored using 3 Which domains or features flank the octapeptide repeats? standard formats and are accessible using standard protocols. The EMBL format was originally devised to store nucleotide se- 4 Use the identifier (ID) to locate the current version of this protein in quences. At that time, an important goal was to create a common UniProtKB/Swiss-Prot. Feature Table format, so that key information could be shared with 5 Explore the entry online. What is the date of the current version, & other databases around the world (notably, with GenBank6 in the which version number is it? USA and DDBJ7 in Japan). Regulating both the content and vocabu- 3.2 The UniProtKB/Swiss-Prot Web interface lary and syntax used to describe the documented features facilitates access to, and manipulation of, the data by computers. The Web interface to UniProtKB/Swiss-Prot has changed signif- The principal way in which computers access database infor- icantly in the last 10 years. The original Swiss-Prot interface echoed mation is via the AC numbers and ID codes unique to their entries – the underlying flat-file fairly intuitively; today, however, the Uni- see Figure 3. This allows data from very different resources to be ProtKB interface is overwhelmingly complex and feature-rich, and connected, whether from a nucleotide or protein sequence data- traces of the original flat-file have virtually gone – finding relevant base, a protein family or protein structure database, a literature information quickly and efficiently can therefore be challenging. database, or whatever. The more internal cross-references an entry The current UniProtKB/Swiss-Prot entry for the human prion stores, the greater the web of connectivity that’s possible from it. protein runs to many pages (www.uniprot.org/uniprot/P04156). The features sequestered in this lengthy entry may be navigated via a menu in the left-hand margin; here, we are more interested in the set of tabs beneath the entry title: these give access to i) software tools for database searching (notably, BLAST5) and sequence align- ment; ii) different formats for viewing the entry (the text option reveals the underlying flat-file); and, arguably, most importantly iii) the complete history of an entry since its first appearance in the database. The History tab thus allows users to reconstruct the evolu- tion both of a sequence and of its annotations, which is vital for understanding how a particular entry arrived at its current form. Clicking on the History tab gives brief information about the original database entry and its current version, plus a link to Previ- ous versions. Following this link displays a table, listing each version of the entry for every release of the database. Clicking on the .txt option adjacent to each version number links directly to the flat-file for that database release. For convenience, there is an option (pe- nultimate column of the table) to compare two versions of an entry, Figure 3 Flat-file indexing. Databases are linked via their AC numbers allowing the differences between them to be swiftly reviewed. &/or ID codes. UniProtKB/Swiss-Prot is a hub of connectivity, linking to its progenitor nucleotide sequence (EMBL), related literature (Medline), 3D EXERCISES structures (PDB8), protein families and domains (InterPro9, PROSITE, 10 11 1 Via the ‘History’ tab, view flat-file version 224 of the human prion PRINTS , ), etc. For clarity, only some connections are shown. protein. How many sequences (AC numbers) has this replaced? KEY TERMS 2 How was the sequence described when it first entered the database? How is it described in version 224? BLAST: Basic Local Alignment Search Tool, a program for searching 3 How was the function of the sequence described when it first en- nucleotide or protein sequence databases with a query sequence tered the database? How is it described in version 224? DDBJ: the DNA Data Bank of Japan 4 Which metal or metals does the protein bind? Were metal-binding GenBank: the principal nucleotide sequence database of the NCBI, USA sites annotated in the original entry or in that shown in Figure 2? Index: an address or set of coordinates that allows query software to access specific parts of a flat-file database via designated tags 5 Which parts or features of the sequence bind metal ions? InterPro: the integrated database of protein families & domains 6 Which sequence did this entry replace in 2009? What was that se- PDB: , a database of 3D macromolecular structures quence (i.e., what is its description)? Pfam: a diagnostic database of protein families & domains 7 Which version is the entry illustrated in Figure 2? PRINTS: a diagnostic database that provides ‘fingerprints’ of protein 8 What slight anomaly lies in the DT lines of version 1 compared to the families, subfamilies & superfamilies version shown in Figure 2?

5 A Critical Guide to UniProtKB’s flat-file format

Flat-file databases were devised long before the advent of the dens. An altered AC number or ID code, for example, need only be Web. Then, searching for information both within and between changed once within its associated table for all tables that point to it them required their constituent fields to be indexed, and specialised to pick up the change. New tables can also be added more easily, programs to be written that were capable of exploiting those indi- without breaking the underlying data structure. Most bio-databases ces. The first widely used indexing-tool, which allowed any flat-file are now maintained within relational systems; however, many still database to be indexed to any other, was SRS, the Sequence Re- provide access to their data in flat-file formats. This is immensely trieval System. The popularity of SRS was due to the fact that the useful, because it allows users both to develop relatively simple software permitted highly focused queries across very different scripts to parse the data automatically, and, where necessary, to databases via a single interface, regardless of their specific file for- analyse and compare individual entries by hand. mats or their underlying data-types. Figure 3 illustrates how inte- grated access to diverse information across different flat-file data- EXERCISES bases is made possible via links to and from their entries’ respective 1 What was the identifier (ID) of the human prion protein sequence AC numbers and ID codes (not all formats use both AC numbers and when it first entered Swiss-Prot? What is it in version 224? ID codes; however, where they do, entries should be connected via 2 What impact is this change likely to have had on related databases their AC numbers because, designed to be machine-readable, these that connected to this entry via this ID? are generally more stable than IDs). 3 Consider again how this sequence was described when it first en- 4.1 The need for relational database management tered the database & how it’s described in version 224. What is the systems major difference in the DE fields? 4 How might this difference affect parsing software that makes use of Flat-file databases can effectively achieve interoperability if they the DE field? include standard cross-references to the contents of other resources 5 Have any lines or fields remained the same between versions 1 and and their fields have been appropriately indexed. This renders que- 224? If so, what information has stayed static in the last ~30 years? ries fast and efficient, because they can be directed to specific fields rather than searching the entire database. However, while simple, this approach is also very brittle. The example in Figure 3 shows KEY TERMS numerous database inter-connections via an assortment of AC Interoperability: the ability of software systems & databases to com- numbers and ID codes. If one of these changes in a given database municate or exchange information (i.e., to interoperate) seamlessly entry, its associated record will suddenly become invisible to all Relational database: a database in which data & their attributes are resources that connect to it; to remain visible, the new AC or ID structured & stored in non-redundant tables code must be propagated to every occurrence of this entity within Relational table: a table in which rows have unique identifying keys, every connecting database. As can be seen via the History tab in the allowing them to be linked to rows in other tables; tables represent UniProtKB/Swiss-Prot entry for the human prion protein, the ID entity ‘types’ (e.g., client, pet); each row represents an instance of code and AC numbers have changed several times since the se- that type (e.g., George, dog); & columns represent values attributed quence was first deposited in Swiss-Prot. Such changes will have to that instance (e.g., breed, weight) caused significant headaches for the curators of databases that Symmetry operators: the set of permutations that fix the origin of pointed to this entry. crystallographic unit cells in one or more coordinate directions Bio-databases were typically maintained as flat-files, partly be- cause the formats were easy to use. However, as the pace of data acquisition increased, and the accompanying body of scientific TAKE HOMES knowledge grew, keeping the data and their annotations up to date became increasingly time-consuming and error-prone (as much of 1 Molecular sequence information was originally stored in ‘flat-files’ – the work was manual). Sometimes, changes in the data required the essentially plain-text files; addition of new database fields. This either meant the flat-file for- 2 The EMBL flat-file format was adopted by several different data- mat had to be modified to accommodate the new field, with knock- bases because its structure made it easy to use & adapt; on consequences for parsing or indexing software, which would be 3 The EMBL format uses a series of two-letter tags to denote different ignorant of the change; or it ‘forced’ curators to shoe-horn the new database fields (ID & AC tags for the entry identifier & accession data into existing fields (which weren’t designed to house them), to number, DE for the descriptive title, CC for comments, FT for the avoid breaking the ‘standard format’ on whose stability other data- Feature Table, etc.); base curators rely. An example of this was the retrospective addition 4 Use of standard tags & ACs/IDs allows flat-file databases to be cross- of crystallographic symmetry operators to the PDB. To avoid break- linked to related information in other resources; ing the original format, the only way to include these data was to 5 Flat-file databases are simple to create, their contents are human- embed them in the human-readable, free-text comment field, oblig- readable & easy to understand, but they’re brittle to changes; ing parsing software to be modified to search for symmetry infor- 6 The original forms of the Swiss-Prot, TrEMBL & PROSITE databases mation in a field that, by definition, wasn’t designed to be machine- exploited the EMBL format; readable, and would therefore normally have been ignored. The rapidly growing quantities and complexity of biological data 7 Sophisticated relational database management systems can store & eventually prompted the use of relational database management manage biological data more efficiently & robustly; systems, in order to help structure the data more formally, and to 8 The origins & evolution of sequences in UniProtKB, and the prove- manage it more efficiently. Within such databases, data are organ- nance of their annotations, can be accessed & explored via the His- ised in relational tables in such a way that changes in one table can tory tab in the Web interface of each entry. be readily propagated to others, thereby easing management bur-

6 A Critical Guide to The UniProtKB flat-file format

5 References & further reading

1 The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45(D1), D158-D169. 2 Sigrist CJ et al. (2013) New and continuing developments at PROSITE. Nucleic Acids Res., 41(Database issue), D344-D347. 3 Silvester N et al. (2018) The European Nucleotide Archive in 2017. Nucleic Acids Res., 46(D1), D36-D40. 4 Attwood TK et al. (2016) Bioinformatics Challenges at the Inter- face of Biology and Computer Science. Wiley-Blackwell. ISBN: 978-0-470-03548-1 (see Chapter 4). 5 Altschul SF et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215(3), 403-410. 6 Benson DA et al. (2018) GenBank. Nucleic Acids Res., 46(D1), D41-D47. 7 Kodama Y et al. (2018) DNA Data Bank of Japan: 30th anniver- sary. Nucleic Acids Res., 46(D1), D30-D35. 8 Rose PW et al. (2017) The RCSB protein data bank: integrative view of protein, gene & 3D structural information. Nucleic Ac- ids Res., 45(D1), D271-D281. 9 Finn RD et al. (2017) InterPro in 2017 – beyond protein family and domain annotations. Nucleic Acids Res., 45(D1), D190-199. 10 Attwood TK et al. (2012) The PRINTS database: a fine-grained protein sequence annotation and analysis resource – its status in 2012. Database, 10.1093/database/base019. 11 Finn RD et al. (2016) The Pfam protein families database: to- wards a more sustainable future. Nucleic Acids Res., 44(D1), D279-D285.

6 Acknowledgements & funding

GOBLET Critical Guides marry ideas from the Higher Apprentice- ship specification for college-level students in England (www.contentextra.com/lifesciences/unit12/unit12home.aspx) with the EMBnet Quick Guide concept. The original Quick Guide to UniProtKB Swiss-Prot and TrEMBL was written by Brigitte Boeck- mann (from the SIB Swiss Institute of Bioinformatics) for distribution by EMBnet in 2006: www..org/shared/quickguides/17- guideSPTR.pdf. This Guide was developed with the support of a donation from EMBnet to the GOBLET Foundation. Design concepts and the Guide’s front-cover image were contrib- uted by CREACTIVE.

7 Licensing & availability

This Guide is freely accessible under creative commons licence CC-BY-SA 2.5. The contents may be re-used and adapted for educa- tion and training purposes. The Guide is freely available for download via the GOBLET portal (www.mygoblet.org) and EMBnet website (www.embnet.org).

8 Disclaimer

Every effort has been made to ensure the accuracy of this Guide; GOBLET cannot be held responsible for any errors/omissions it may contain, and cannot accept liability arising from reliance placed on the information herein.

7 A Critical Guide to UniProtKB’s flat-file format

About the organisations About the author

GOBLET Teresa K Attwood (orcid.org/0000-0003-2409-4235) GOBLET (Global Organisation for Bioinformatics Learning, Educa- Teresa (Terri) Attwood is a Professor of Bioinformat- tion & Training) was established in 2012 to unite, inspire and equip ics with more than 25 years’ experience teaching in- bioinformatics trainers worldwide; its mission, to cultivate the global troductory bioinformatics, in undergraduate and post- bioinformatics trainer community, set standards and provide high- graduate degree programmes, and in ad hoc courses, quality resources to support learning, education and training. workshops and summer schools, in the UK and abroad. GOBLET’s ethos embraces: With primary expertise in protein sequence analysis, she created • inclusivity: welcoming all relevant organisations & people the PRINTS protein family database and co-founded InterPro (her particular interest is in the analysis of G protein-coupled receptors). • sharing: expertise, best practices, materials, resources She has also been involved in the development of software tools for • openness: using Creative Commons Licences protein sequence analysis, and for improving links between research

• innovation: welcoming imaginative ideas & approaches data and the scientific literature (most notably, ). • tolerance: transcending national, political, cultural, social & She wrote the first introductory bioinformatics text-book; her disciplinary boundaries third book was published in 2016: Further information about GOBLET and its Training Portal can be • Attwood TK & Parry-Smith DJ. (1999) Introduction to Bioin- found at www.mygoblet.org and in the following references: formatics. Prentice Hall. • Attwood et al. (2015) GOBLET: the Global Organisation for • Higgs P & Attwood TK. (2005) Bioinformatics & Molecular Bioinformatics Learning, Education & Training. PLoS Com- Evolution. Wiley-Blackwell.

put. Biol., 11(5), e1004281. • Attwood TK, Pettifer SR & Thorne D. (2016) Bioinformatics challenges at the interface of biology and computer science: • Corpas et al. (2014) The GOBLET training portal: a global re- Mind the Gap. Wiley-Blackwell. pository of bioinformatics training materials, courses & trainers. Bioinformatics, 31(1), 140-142. Affiliation GOBLET is a not-for-profit foundation, legally registered in the School of Computer Science, The , Ox- : CMBI Radboud University, Nijmegen Medical Centre, ford Road, Manchester M13 9PL (UK). Geert Grooteplein 26-28, 6581 GB Nijmegen. For general enquiries, contact [email protected]. EMBnet

EMBnet, the Global Bioinformatics Network, is a not-for-profit or- ganisation, founded in 1988 as a network of institutions, to establish and maintain bioinformatics services across Europe. As the network grew, its reach expanded beyond European borders, creating an international membership to support and deliver bioinformatics services across the life sciences: www.embnet.org. Since its establishment, a focus of EMBnet's work has been bioin- formatics Education and Training (E&T), and the network therefore has a long track record in delivering tutorials and courses world- wide. Perceiving a need to unite and galvanise international E&T activities, EMBnet was one of the principal founders of GOBLET. For more information and general enquiries, contact [email protected].

CREACTIVE

CREACTIVE, by Antonio Santovito, specialises in communication and Web marketing, helping its customers to create and manage their online presence: www.gocreactive.com.

8