Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A

D412–D419 Nucleic Acids Research, 2021, Vol. 49, Database issue Published online 30 October 2020 doi: 10.1093/nar/gkaa913 Pfam: The protein families database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A. Salazar1, Erik L.L. Sonnhammer2, Silvio C.E. Tosatto 3, Lisanna Paladin 3, Shriya Raj 1, Lorna J. Richardson 1, Robert D. Finn 1 and Alex Bateman 1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK, 2Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden and 3Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy Received September 11, 2020; Revised October 01, 2020; Editorial Decision October 02, 2020; Accepted October 06, 2020 ABSTRACT per-family gathering thresholds. Pfam entries are manually annotated with functional information from the literature The Pfam database is a widely used resource for clas- where available. sifying protein sequences into families and domains. Since Pfam release 29.0, pfamseq is based on UniPro- Since Pfam was last described in this journal, over tKB reference proteomes, whilst prior to that, it was based 350 new families have been added in Pfam 33.1 and on the whole of UniProtKB (3,4). Although the underly- numerous improvements have been made to existing ing sequence database is based on reference proteomes, all entries. To facilitate research on COVID-19, we have of the profile HMMs are also searched against UniProtKB revised the Pfam entries that cover the SARS-CoV-2 and the resulting matches are made available on the Pfam proteome, and built new entries for regions that were website and in a flatfile format. Similarly, the match data not covered by Pfam. We have reintroduced Pfam-B for representative proteomes (5) are provided in the same which provides an automatically generated supple- way. Pfam can be accessed via the website at https://pfam. xfam.org. The flatfiles and exports of the Pfam MySQL ment to Pfam and contains 136 730 novel clusters of database are provided under a CC0 license for each re- sequences that are not yet matched by a Pfam fam- lease of the database, and can be found on the FTP site ily. The new Pfam-B is based on a clustering by the ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases. MMseqs2 software. We have compared all of the re- When a Pfam entry is built, we typically search it itera- gions in the RepeatsDB to those in Pfam and have tively against pfamseq in order to find more distant homo- started to use the results to build and reﬁne Pfam logues. Pfam entries are built such that there are no over- repeat families. Pfam is freely available for browsing laps between them; this means that the same region of a se- and download at http://pfam.xfam.org/. quence should not match more than one family. This non- overlap rule proves to be an excellent quality control cri- teria which helps to avoid including false positive matches INTRODUCTION into a family. From Pfam 28.0, we relaxed this rule to allow Pfam is a database of protein families and domains that is small overlaps between families as it had become increas- widely used to analyse novel genomes, metagenomes and ingly time-consuming to resolve all such overlaps each time to guide experimental work on particular proteins and sys- we updated pfamseq (see (3) for more details). tems (1,2). Each Pfam family has a seed alignment that Sets of Pfam entries that we believe to be evolutionarily contains a representative set of sequences for the entry. related are grouped together into clans (6). Relationships A profile hidden Markov model (HMM) is automatically between entries are identified through sequence similarity, built from the seed alignment and searched against a se- structural similarity, functional similarity and/or profile- quence database called pfamseq using the HMMER soft- profile comparisons using software such as HHsearch7 ( ). ware (http://hmmer.org/). All sequence regions that satisfy Where possible, we build a single comprehensive profile a family-specific curated threshold, also known as the gath- HMM to detect all members of a family. For some of the ering threshold, are aligned to the profile HMM to create larger superfamilies where this is not possible, we build the full alignment. It is worth noting that a common mis- multiple profile HMMs and put them in the same clan. use of Pfam is to use a single E-value threshold across all As families in a clan are evolutionarily related, we allow Pfam HMMs, which results in lower sensitivity and an in- them to overlap with other members of the same clan. The crease in false positive matches when compared to using the clans are competed such that if there are multiple over- *To whom correspondence should be addressed. Tel: +44 1223 494100; Fax: +44 1223 494468; Email: [email protected] C The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2021, Vol. 49, Database issue D413 lapping profile HMM matches to families from a single FAMILY IMPROVEMENTS clan, only the match with the lowest E-value is displayed Despite almost 25 years of active curation, there remain on the website, or included in the full alignment of the many protein domains and families that are as yet unclassi- entry. fied by Pfam. New families added to Pfam are created from Here we describe Pfam 33.1 and some of the family up- a range of sources, including Pfam-B families and protein dates we have made since the last release. We also detail the structure. Pfam-B alignments in particular have been a very Pfam coverage of the SARS-CoV-2 proteome, and the new fruitful substrate for families, with historically nearly a third method to create Pfam-B which we re-introduced in Pfam of all Pfam entries being built from them. The new Pfam- 33.1. Finally we describe the results of the analysis we car- B (described below) was used to build 18 entries in Pfam ried out on comparing repeats in the Pfam database to Re- 33.1. We expect that Pfam-B will again become a very use- peatsDB. ful source of additional families in the coming years. Protein structures from databases such as the Protein PFAM 33.1 RELEASE Data Bank (PDB) (9,10) are particularly amenable as a source of new Pfam entries because the Pfam domain Pfam 33.0 was due to be released in March 2020, however boundaries can be defined precisely from the structure. due to the global COVID-19 pandemic, we delayed its re- There is often an associated paper to the structure which lease so that we could focus on improving our SARS-CoV- we use to help annotate the Pfam entry. Building families 2 models (see ‘COVID-19 updates’ section for more de- using protein structures is an ongoing activity, and some of tail). The updated SARS-CoV-2 models, and a handful of the new SARS-CoV-2 families were built in this way. Ad- other new Pfam entries were added to Pfam 33.0 to create ditionally, we have made 37 new families based on clus- Pfam 33.1, which was released in May 2020. Release 33.0 ters of sequences from the MGnify metagenomic protein was never officially released on the Pfam website, although database (11). The MGnify clusters may be another large the database files and flat files are made available onthe source from which we could build families from in the fu- ftp site. ture. Using metagenomics clusters can help to cover regions Pfam 33.1 contains 18 259 families and 635 clans. Since of sequence space that are not well covered by existing ge- Pfam 32.0, we have built 355 new families, deleted 25 fami- nomic sequencing projects. lies and built 8 new clans. Just over 39% of all Pfam families Pfam has created numerous entries for domains of un- are placed within a clan. Of the sequences in UniProtKB, known function (DUF) and uncharacterized protein fami- 77.0% have at least one match to a Pfam entry, and 53.2% lies (UPF). Over time the functions of some of these are dis- residues in UniProtKB fall within a Pfam entry. These fig- covered. We continue to scan the literature to identify newly ures, termed the sequence and residue coverage respectively, identified functions as well as receiving updates from the have remained fairly constant over the past five Pfam re- community via our helpdesk. Many functions are also iden- leases despite a 240% increase in UniProtKB size during tified by the InterPro database12 ( ) update process which that time (see Figure 1). Since 2015, highly redundant bac- checks whether the UniProtKB/Swiss-Prot descriptions of terial proteomes have been identified and removed from proteins in each InterPro entry have changed between re- UniProtKB (8). This process ensures a level of diversity in leases. When these changes identify a function, Pfam is no- the sequences added to UniProtKB, and prevents, for ex- tified. To date we have changed the identifiers of 1132 DUF ample, multiple strains of a particular species being added. or UPF families indicating that a function has been iden- Pfam is able to maintain a sequence coverage of ∼77%, and tified for these families.

Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A

Enhanced Representation of Natural Product Metabolism in Uniprotkb

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice

Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences

Molecular Genetics & Genomics

Rapid Identification of Novel Protein Families Using Similarity Searches [Version 1; Peer Review: 2 Approved]

Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

Generating Functional Protein Variants with Variational Autoencoders

Genome Informatics

Enhanced Protein Domain Discovery by Using Language Modeling Techniques from Speech Recognition

Introduction

CV Burkhard Rost