The Pfam Protein Families Database Marco Punta1,*, Penny C

D290–D301 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 29 November 2011 doi:10.1093/nar/gkr1065 The Pfam protein families database Marco Punta1,*, Penny C. Coggill1, Ruth Y. Eberhardt1, Jaina Mistry1, John Tate1, Chris Boursnell1, Ningze Pang1, Kristoffer Forslund2, Goran Ceric3, Jody Clements3, Andreas Heger4, Liisa Holm5, Erik L. L. Sonnhammer2, Sean R. Eddy3, Alex Bateman1 and Robert D. Finn3 1Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, 2Stockholm Bioinformatics Center, Swedish eScience Research Center, Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, SE-17121 Solna, Sweden, 3HHMI Janelia Farm Research Campus, 19700 Helix Drive, Ashburn, VA 20147, USA, 4Department of Physiology, Anatomy and Genetics, MRC Functional Genomics Unit, University of Oxford, Oxford, OX1 3QX, UK and 5Institute of Biotechnology and Department of Biological and Environmental Sciences, University of Helsinki, PO Box 56 Downloaded from (Viikinkaari 5), 00014 Helsinki, Finland Received October 7, 2011; Revised October 26, 2011; Accepted October 27, 2011 http://nar.oxfordjournals.org/ ABSTRACT INTRODUCTION Pfam is a widely used database of protein families, Pfam is a database of protein families, where families are currently containing more than 13 000 manually sets of protein regions that share a significant degree of curated protein families as of release 26.0. Pfam is sequence similarity, thereby suggesting homology. available via servers in the UK (http://pfam.sanger Similarity is detected using the HMMER3 (http:// hmmer.janelia.org/) suite of programs. .ac.uk/), the USA (http://pfam.janelia.org/) and Pfam contains two types of families: high quality, Sweden (http://pfam.sbc.su.se/). Here, we report manually curated Pfam-A families and automatically at Helsinki University Library on May 19, 2016 on changes that have occurred since our 2010 generated Pfam-B families. The latter are derived from NAR paper (release 24.0). Over the last 2 years, we clusters produced by the ADDA algorithm (1), followed have generated 1840 new families and by the subtraction of overlapping Pfam-A regions at each increased coverage of the UniProt Knowledgebase release. Pfam-A families are built following what is, in (UniProtKB) to nearly 80%. Notably, we have essence, a four-step process: taken the step of opening up the annotation of our (i) building of a high-quality multiple sequence align- families to the Wikipedia community, by linking ment (the so-called seed alignment); Pfam families to relevant Wikipedia pages and (ii) constructing a profile hidden Markov model (HMM) encouraging the Pfam and Wikipedia communities from the seed alignment (using HMMER3); to improve and expand those pages. We continue (iii) searching the profile HMM against the UniProtKB to improve the Pfam website and add new visualiza- sequence database (2) and tions, such as the ‘sunburst’ representation of (iv) choosing family-specific sequence and domain taxonomic distribution of families. In this work we gathering thresholds (GAs); all sequence regions additionally address two topics that will be of that score above the GAs are included in the full particular interest to the Pfam community. First, alignment for the family (GAs are described in we explain the definition and use of family-specific, detail in a later section of this paper). manually curated gathering thresholds. Second, we In addition to providing matches to UniProtKB, Pfam discuss some of the features of domains of also provides matches for the NCBI non-redundant unknown function (also known as DUFs), which con- database, as well as a collection of metagenomic stitute a rapidly growing class of families within samples. We generate a variety of data downstream, Pfam. including, among others, a family sequence-conservation *To whom correspondence should be addressed. Tel: +44 1223 497399; Fax: +44 1223 494919; Email: [email protected] ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2012, Vol. 40, Database issue D291 logo based on the HMM, a description of domain archi- give users an overview of the function(s) of the family or tectures, where all co-occurrences with other domains are domain. However, rather than have the Pfam curators reported, and a species tree summarizing the taxonomic describe our families, we would strongly prefer to have range in the family. annotations written by those who know the proteins and The quality of the seed alignment is the crucial factor in families best, namely the biologists and informaticians determining the quality of the Pfam resource, influencing who work with them on a day-to-day basis. Harnessing not only all data generated within the database but also the knowledge of these experts remains a significant the outcome of external searches that use our profile challenge. HMMs, e.g. to assign domains to proteins which are One recent approach has been to use Wikipedia as a part of newly sequenced genomes. For this reason, a con- source of scientific information (7,8). Wikipedia is the siderable curatorial effort goes into seed alignment world’s largest online encyclopedia, with over 3.7 million generation. English language articles, and is widely acknowledged to Members of the same Pfam family are expected to share be the most popular general reference work on the a common evolutionary history and thus at least some internet. A cornerstone of Wikipedia is that anyone can functional aspect. Ideally, our families should represent edit the content. functional units, which, when combined in different The Rfam database moved all of its family annota- Downloaded from ways, can generate proteins with unique functions. The tion into articles in Wikipedia in 2009, thereby ultimate goal of Pfam is to create a collection of function- allowing anyone to freely edit and improve their ally annotated families that is as representative as possible content. This experiment has proved successful, of protein sequence-space, such that our families can be engaging the wider scientific community to provide expert annotations and improving the overall quality of used effectively for both genome-annotation and http://nar.oxfordjournals.org/ small-scale protein studies. It must be stressed, however, the Rfam annotation (7). In light of this positive experi- that homology is no guarantee of functional similarity and ence, we decided to adopt the same approach and use transfer of functional annotation based solely on family Wikipedia as the primary source of Pfam annotation membership should always be undertaken with caution. (Figure 1A). On the other hand, additional data that are available The Rfam database included around 600 families when from Pfam, such as conservation of family signature the switch to Wikipedia was made. This made it feasible to residues or conservation of common domain architec- assign existing articles where possible and to generate new tures, can increase confidence in a given functional ‘stub’ articles in Wikipedia for any family that still lacked hypothesis. For more background on how to any relevant article. These stubs have since been gradually at Helsinki University Library on May 19, 2016 query and use our web interface please refer to Coggill expanded and improved by the Rfam and Wikipedia et al. (3). communities. At the time when Pfam began using In this paper, we report on the most recent Pfam release Wikipedia for annotations, release 25.0, there were (26.0) as well as on important changes that have been 12 273 Pfam-A families. We initially identified existing introduced over the last 2 years, since our 2010 NAR articles that described protein domains or families and database issue paper (where we presented release 24.0) which provided useful information about the Pfam (4). Arguably, the change carrying the most significant family. These articles are now assigned as the primary philosophical implications has been the decision to annotation for the appropriate families. Given the follow the lead of the Rfam database (5) and out-source number of families that remain without an article, functional annotation of Pfam families to Wikipedia. We however, it is simply not feasible to manually gener- will discuss the background to this decision and give ate articles for all of them. For these families we details of the progress towards Wikipedia coverage of continue to show the original annotation comments, Pfam families. Another important development has been which were written by the curator of the family, while the adoption of the iterative sequence-search program encouraging our users to tell us about appropriate jackhmmer (6) as our principal tool for generating new Wikipedia articles or to create them on our behalf. families. In addition, we have extended our mechanism Furthermore, it is likely that there will be some families for family curation, which now allows trained and that are not sufficiently notable for inclusion in Wikipedia trusted external collaborators to create and add their and we anticipate that many of these will remain without own families to Pfam. Finally, we will take this opportun- Wikipedia annotations for the long term, perhaps ity to address and present fresh analysis on two topics that indefinitely. we consider of particular importance: family-specific GAs As of release 26.0, there are 4909 Pfam families that link and Domains of Unknown Function (DUFs). to 1016 Wikipedia articles. We invite readers and users of the Pfam website to edit and improve these articles in Wikipedia. Mapping of Pfam-A families to Wikipedia WHAT’S NEW articles is available in JSON format, from: http://pfamsrv.sanger.ac.uk/cgi-bin/mapping.cgi?db Community annotation =pfam.

The Pfam Protein Families Database Marco Punta1,*, Penny C

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

Methods in and Applications of the Sequencing of Short Non-Coding Rnas" (2013)

Comparing Tools for Non-Coding RNA Multiple Sequence Alignment Based On

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

EMBL-EBI Now and in the Future

A Unicellular Relative of Animals Generates a Layer of Polarized Cells

Origin of a Folded Repeat Protein from an Intrinsically Disordered Ancestor

Strategic Plan 2011-2016