Perspective The Rat Genome Database Curators: Who, What, Where, Why

Mary Shimoyama*, G. Thomas Hayman, Stanley J. F. Laulederkind, Rajni Nigam, Timothy F. Lowry, Victoria Petri, Jennifer R. Smith, Shur-Jen Wang, Diane H. Munzenmaier, Melinda R. Dwinell, Simon N. Twigger, Howard J. Jacob, the RGD Team" Rat Genome Database, Human and Molecular Genetics Center, Medical College of Wisconsin, Milwaukee, Wisconsin, United States of America

Biological databases have become ubiq- Who Are the Curators? curation projects for several months before uitous as demonstrated by the 1,170 curating independently. Curators begin molecular biology databases catalogued Currently, RGD has seven full-time with a single data type, such as in the 2009 Nucleic Acids Research online curators who bring diverse educational curation, and a single type of ontology- Molecular Biology Database Collection specialties and research expertise to the based biological annotation in order to [1]. While researchers have come to rely curation process. Five of the curators have develop adequate skills in this area before on the valuable data in these resources, PhDs and two have master’s degrees. training in other data types. In addition to there often is little understanding of how Educational specialties include molecular the curation manual and one-on-one the data are acquired and integrated, and and cellular biology, physiology, biochem- training, there is a weekly curation meet- what roles professional curators play in istry, microbiology, experimental patholo- ing in which standards, policies, and new these processes. The increasing number of gy, and organic chemistry. As is common data types are discussed in order to curators, their importance to the research in the field of curation, the RGD curators maintain existing standards and develop world, and growing recognition of their bring a vast amount of experience in wet new ones. Standards for new data types professional contributions have resulted in lab research ranging from 9 to 26 years are developed through consultations with two important events this year: the official with a mean of 16.9 years. They are well researchers having expertise in those creation of the International Society for versed in research methods for cell and areas, other database Biocuration [2] and the launch of DA- tissue culture, protein biochemistry, mo- groups, and the curation group at RGD. TABASE–The Journal of Biological Da- lecular biology, large and small animal Inter-curator disagreements, which do tabases and Curation by Oxford Journals physiology and surgery, developmental occur even between highly experienced [3]. These will provide valuable profes- biology, infectious disease, microbiology, curators, are discussed in weekly meetings sional platforms to publish the research, virology, biophysics, and bio-computation. with pertinent examples from the litera- data developments, and other advances Because of the wide array of experiences of ture and, when necessary, consultation created by thousands of biocurators curators, RGD developed a comprehen- with outside experts. Consensus decisions around the world and to educate the sive training program and curation man- are added to the curation manual. There public about this vital, developing profes- ual to ensure that curators follow rigorous are additional mechanisms through which sion. One database where such curation standards in identifying data elements, curators can submit new terms to specific work takes place is the Rat Genome assigning nomenclature, and annotating ontologies, often hosted at SourceForge to Database (RGD; http://rgd.mcw.edu) biological information to , QTLs, accommodate data not covered by current [4]. and strains. RGD adheres to the Gene procedures. Periodically all curators at Established in 1999, RGD is the Ontology (GO) (www.geneontology.org) RGD curate the same paper and these premier repository of rat genomic and guidelines and has adapted similar guide- results are compared and discussed in the genetic data and currently houses over lines for other types of biological informa- meeting to ensure consistency across 33,421 rat genes as well as human and tion. Biological annotations are based on curators. It takes approximately 1 year mouse homologs, 1,698 rat (all that have experimental results published in peer- for a curator to become competent in been published to date) and 1,579 human reviewed journals. New curators are curating one area of data and several years quantitative trait loci (QTL), and 2,114 rat trained by a senior curator and do test for real expertise to develop. While two of strains. Biological information curated for these data elements includes disease asso- Citation: Shimoyama M, Hayman GT, Laulederkind SJF, Nigam R, Lowry TF, et al. (2009) The Rat Genome ciations, phenotype data, pathways, mo- Database Curators: Who, What, Where, Why. PLoS Comput Biol 5(11): e1000582. doi:10.1371/journal.pcbi. lecular functions, biological processes, and 1000582 cellular components. Curators are in- Editor: Philip E. Bourne, University of California San Diego, United States of America volved in acquiring and validating data Published November 26, 2009 elements, attaching biological information Copyright: ß 2009 Shimoyama et al. This is an open-access article distributed under the terms of the to elements, identifying pathways, and Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any making connections among data types. medium, provided the original author and source are credited. The following glimpse at the curators and Funding: RGD is funded by grant HL64541 from the National Heart, Lung, and Blood Institute (http://www. curation processes at RGD is designed nhlbi.nih.gov) on behalf of the NIH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. to illustrate who curators are, what they do, where their results can be seen, and Competing Interests: The authors have declared that no competing interests exist. why their efforts make researchers’ lives * E-mail: [email protected] easier. " Members of the Rat Genome Database are noted in the Acknowledgments.

PLoS | www.ploscompbiol.org 1 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

RGD’s curators are new to the field, the dures ensure proper gene identification, strains, this accurate tracking of strains other five have between 2 and 8.5 years of presented on gene report pages available and their history will become more curation experience with an average of 5.7 to users on the RGD Web site (Figure 2A), important as researchers attempt to link years. Traits that RGD curators cite as and allow the curator to assign official phenotype data from multiple studies with valuable in being a curator include broad nomenclature (Figure 2B), proper orthol- the correct genome data. knowledge of scientific research and re- ogy relationships (Figure 2C), and provide search methods, attention to detail, perse- accurate alignment with data at other Biological Data Curation verance, and the ability to collaborate as a sources (Figure 2D). Gene identification Providing comprehensive biological in- group. Many point out that curation is a and nomenclature are exchanged with formation on rat genes, QTLs, and strains rewarding alternative career for scientists resources such as Entrez Gene, IPI, Mouse is a formidable task both in number of to one in bench research. While curators Genome Database (MGD), HGNC, and elements to be curated and the amount of are trained to serve in multiple roles and others to provide a shared accurate rat literature available. Of the more than handle various types of data, they do gene dataset. This allows researchers to 35,400 genes in RGD, 31,304 have specialize in particular areas to promote retrieve the same genes of interest at associated sequence data or RefSeqs, and standardization and facilitate acquisition multiple sites, to look at the genes across of these, 15,607 are protein-coding with of greater knowledge and expertise. Fol- species, and to navigate from RGD to curated RefSeqs [4,5]. In addition there lowing is a discussion of curation and outside resources with confidence that are 2,114 strains with approximately 200 other functions performed by RGD cura- they are reviewing the same genes. added per year and 1,698 QTLs with over tors with an emphasis on what is done, Because the same gene may have been 180 added per year. A search at PubMed where these results can be seen, and why referred to by a variety of symbols and (http://www.ncbi.nlm.nih.gov/Literature/) these efforts are important to researchers. names over the years, and may have reveals there are over 1,267,000 rat manu- multiple sequences and identifiers attached scripts. In order to provide functional What Curators Do and Where to it, it would be extremely difficult for information for as many genomic ele- Their Work Can Be Accessed researchers to determine whether a gene ments and strains as possible and to referenced in a paper or at a database is Genomic Element Identification– leverage the information in the vast identical to one named elsewhere without amount of rat literature, RGD has imple- Pipelines the efforts by RGD curators to validate mented a number of processes. RGD uses Curators play an important role in the identification and ensure accuracy. four ontologies to standardize biological identification and verification of genomic information for genes, QTLs, and strains. elements such as genes and QTLs, and of QTL and Strain Identification These include the GO [6], the Mamma- the multiple types of rat strains used in QTLs and strains are generally extract- lian Phenotype Ontology (MP) [7], a research today. While many are familiar ed from the literature or are submitted to disease ontology (DO) based on the with the role curators play in extracting RGD by researchers. Using data submis- Medical Subject Headings (MeSH) [8], data from the literature, less understood is sion forms on the RGD Web site, and the Pathway Ontology developed at the role they play in automated data researchers specify whether the work they RGD (PW) (Figure 3) [4]. In general, acquisition processes. At RGD, a number are submitting is published or not. If the RGD’s primary focus is manual annota- of automated pipelines are used for data latter is the case, they can elect to have tion of rat genes, while GO annotations acquisition from other databases and for their submitted data displayed only after for mouse and human orthologs are quality control. Examples include pipe- its publication. Credit is given for data brought in from the MGD and the Gene lines to import basic rat, mouse, and displayed prior to publication as a refer- Ontology Annotation Database (GOA), human gene and sequence data from the ence indicating data submitted via person- respectively. Disease and pathway anno- searchable gene database Entrez Gene, al communication by the researcher to tations, however, are added to the mouse and one which imports an orthology RGD. After publication, credit is given on and human orthologs during the manual relationship dataset shared by RGD, the report page as the cited reference curation process. Biological annotations Mouse Genome Informatics (MGI), and containing the data. As with genes, the consist of several parts including the the HUGO (Human Genome Organiza- curators ensure identity and determine ontology term, the evidence code, and tion) Gene Nomenclature Committee definitions and descriptions for QTLs and the reference. Evidence codes for the GO (HGNC; Figure 1). Other pipelines bring strains, as well as assign nomenclature. indicate that the annotation is based on in identifiers and links to genes and Curators record the flanking and peak experimental results, direct assays, physi- proteins from the International Protein QTL markers to ensure accurate genomic cal interactions, mutant phenotypes, ge- Index (IPI) and Ensemble, among others. positions, and indicate the associated netic interactions, or expression patterns. Curators identify the data sources, design traits, as well as the strains used in the All of this information, plus links to the quality control measures to be includ- cross. This provides an unambiguous additional related data, appears on the ed in the pipeline, and test the pipeline identity for each QTL, allowing research- gene report pages, accessible via searches output for validation. In addition, each ers to view QTLs on map tools and to from the RGD home page. weekly run of the pipeline creates a file of differentiate one QTL from another On average the curators at RGD data conflicts which is reviewed by a knowing with confidence whether a QTL extract information from approximately curator with the conflicts being resolved referenced in one paper is the same one 5,000 papers per year. From these, manually. mentioned in another paper. Curators also multiple types of biological information Approximately 1,500 gene conflicts are provide nomenclature and history for are annotated for approximately 2,000 resolved annually as more sequence be- strains and substrains to identify the genes, 180 rat QTLs, and 200 strains per comes available, gene models are refined, correct genetic background. As next gen- year. While QTL and strain data are and orthology relationships reviewed and eration sequencing makes it possible to curated as papers are published, RGD validated. These comprehensive proce- produce genome sequences for individual targets specific gene datasets for curation.

PLoS Computational Biology | www.ploscompbiol.org 2 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

Figure 1. Quality Control (QC) for orthology relationship pipeline. Overview of the automated pipeline decision-making process used to generate relationships between rat, human, and mouse genes. doi:10.1371/journal.pcbi.1000582.g001

PLoS Computational Biology | www.ploscompbiol.org 3 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

Figure 2. Gene report page showing curator contributions to gene identification and verification. doi:10.1371/journal.pcbi.1000582.g002

These include genes related to specific in Man (OMIM), disease and mutation aliases, and protein names, as well as those disease areas (which generate the most specific databases, and the literature. In a for the human and mouse orthologs. literature), gene families or pathways, as recent initiative, the first step of developing Audits have shown that 75%–85% of well as those done in conjunction with the searches and compiling, verifying, and papers returned in a search are associated Reference Genomes pro- processing the gene list took nearly with the targeted gene. A number of ject [9]. To commence curation, a gene list 7 hours. The list of genes is prioritized papers can be discarded by title and is created from searches in databases such and distributed among curators who another segment through review of the as GeneCards, the Genetic Association conduct a PubMed search for rat literature abstract. The remaining papers are Database, Online Mendelian Inheritance for each gene using symbols and names, opened and read. Some of these do not

PLoS Computational Biology | www.ploscompbiol.org 4 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

Figure 3. Curated biological annotations for Gene Ontology, diseases, phenotypes, and pathways. doi:10.1371/journal.pcbi.1000582.g003 result in annotations because the organism as their interactions with the community Why–Facilitating Research was not clearly indicated or the paper did often lead to proposals for new data Saving the Researcher Time not contain relevant data. An example of components and tools and new features To assess the value to researchers of the process of curating elements from a on existing tools. In other cases, the curator curation work done for genes, RGD manuscript is shown in Figure 4. Using is responsible for the primary design and undertook two audits. In the first, a review customized software, these manual anno- implementation process. For example, a of a subset of cancer genes showed that 3 tations are added to the report pages for curator creates and populates the interac- to 43 papers were curated per gene with the genes being curated. tive pathway diagrams that RGD is an average of 19 papers per gene. In the publishing using a content management second study, for a subset of 20 User Interfaces and Software Tools system and the Ariadne Pathway Studio genes, more comprehensive statistics were The RGD curators bring extensive, software. Curators play a major role in the tracked. Over 16,000 papers were re- diverse research backgrounds to the design development of all aspects of RGD tech- turned in searches, 410 abstracts reviewed, process for user interfaces and software nology from the home page design to the and 104 full papers read. This produced tools available at the RGD Web site. search functions and accompanying results an average of 20 abstracts and 5 full Curators map out use case scenarios for layouts to the genome tools. To coordinate papers read per gene. Of these, ontology searches, data analysis tools, and other Web curation and informatics projects, there is a annotations were made from 3–4 papers site functions based on the points of view of weekly RGD operations meeting attended per gene. Time spent reading abstracts multiple types of researchers. This allows by all curators, software developers, the and full papers averaged 106 min per RGD to create tools and interfaces that program manager, and co-PIs to address gene. If we extrapolate for the cancer fulfill the needs of naı¨ve users as well as deadlines and priorities. To minimize genes in the first study, this would have seasoned users and bioinformaticians. Cu- conflict and ensure timely completion of meant approximately 380 min for each of rators work closely with the projects, tracking and documentation is these more highly published genes. Al- staff to develop the conceptual and func- accomplished through Microsoft Project though three of the targeted genes did not tional design of curation software and they and Mindjet MindManager software. Tool receive any annotations during the pro- also test all new tools and features for bugs and issues are recorded and tracked cess, the remaining 17 received 195 usability and accuracy of results before they through JIRA software. Documentation of annotations. Because each annotation is are made available to the user via links from requirements, statements of work, and linked to the reference from which it was the RGD home page. In addition, their release procedures minimize confusion taken (Figure 5), the user can go directly to familiarity with emerging literature, new and conflict between curators and bioinfor- papers of most interest with little time research techniques, and data types, as well matics staff. spent on searches, reviewing titles, or

PLoS Computational Biology | www.ploscompbiol.org 5 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

researchers for nomenclature assignment to genes, QTLs, and strains, as well as construction of customized datasets.

The Curators’ Wish List The curators at RGD are representative of curators around the world. They are highly trained and experienced research- ers dedicated to interpreting and present- ing critical information to other research- ers via innovative data mining and presentation tools in order to facilitate and enhance their colleagues’ work. Their presence and contributions can be seen on RGD Web pages, in the software tools, and in the accurate, well represented data that will lead other researchers to impor- tant discoveries. There are, however, a number of developments that would not only make the lives of curators at RGD and around the world better, but improve the curated data they compile and most importantly have a strong, positive impact on all other databases and all researchers at large. Some of these are highlighted here.

External Requests 1. Timely access to full text papers. Even institutions with comprehensive journal subscription programs are not able to provide easy access to every pertinent paper, resulting in delays due to requests for interlibrary loans and other methods used to access these papers. PubMed Central (http://www. pubmedcentral.nih.gov/) currently provides access to full text papers from 450 journals plus those submitted under the NIH Public Access Policy. Figure 4. Manuscript curation example. The highlighted manuscript data are annotated as This policy requires authors funded by the indicated Gene and Disease ontology terms. doi:10.1371/journal.pcbi.1000582.g004 NIH to submit full papers published in peer-reviewed journal to PubMed Cen- tral within 12 months of publication abstracts. As can be seen by the amount of Us button on each page, and through [10]. While these developments have time spent in curation, the time savings for telephone calls and the Rat Community made access easier, immediate open researchers are substantial. Forum. RGD has published seven tutorial access at time of publication would videos at SciVee (http://www.scivee.tv/), a allow curators to provide more timely Education and Outreach Web site for video publications for re- updates to data. An essential part of the curator’s job is to search, and videos are available at RGD as provide education and training for RGD well. Curators present tutorials at major 2. Author submission of data prior users and potential users. This is accom- conferences such as Experimental Biology, to publication. Those rat researchers plished in several ways. The RGD Web site Society of Toxicology, Neuroscience and who submit data to RGD prior to contains a ‘‘Help’’ section developed by the Rat Genomics and Models, as well as publication receive RGD IDs for their curators and which is accessible from all individual rat research laboratories. In data and nomenclature review and pages. This component contains a Glossary addition, RGD is well represented at major approval. This ensures that the infor- of Terms, general information on how to conferences such as Biology of Genomes, mation in their papers is synchronized use the searches and tools, a Frequently Genome Informatics, Intelligent Systems with that at RGD and is made Asked Questions (FAQ) section, and a for Molecular Biology, and the Interna- available to the public at the time of component that walks users through typical tional Mammalian Genome Conference publication. use case scenarios. Curators also handle with presentations and posters highlighting 3. Use of unique identifiers in pub- individual questions through the user new tools and datasets. Other outreach lications. A time-consuming activity request system accessible via the Contact activities involve contact with individual undertaken by curators around the

PLoS Computational Biology | www.ploscompbiol.org 6 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

Figure 5. An example of Biological Process annotations with links to references. doi:10.1371/journal.pcbi.1000582.g005

world is verifying identity of genomic their publications or presentations their allowing them to incorporate the data such as genes and QTLs, as well use of the data and tools made necessary functions in ways most useful as rat strains. Use of unique identifiers available through the work of curators. to them. It would allow them to query from recognized, international data- Referencing the sources of the data and multiple databases simultaneously, in- bases and correct nomenclature would tools used in publications would pro- corporate sophisticated text mining make this a much easier task as well as vide recognition of the professional tools, and run quality control checks ensure that data are not assigned to the accomplishments of curators just as at will. wrong gene, QTL, or strain on the referencing a colleague’s paper pro- 7.2. Text mining tools. Text mining basis of an incorrect identifier used in a vides acknowledgement of the use of tools are becoming more sophisticated publication. his or her work. and have gone beyond simple literature 4. Correct identification of organ- search tools. Current work has been ism used. As mentioned above, some focusing on identifying and highlight- papers are rejected for curation be- Internal Requests ing ontology terms and other data cause the organism used is not identi- 6.1. More sophisticated curation soft- within the papers. Taking this one step fied. ware. Each development in curation further, an example of a dream tool 5. Acknowledgment of their work. software at RGD has resulted in would be one that takes the annota- Thousands of researchers worldwide substantial increases in productivity tions that already exist for a gene and use data from databases in their and accuracy. The next generation matches those against emerging litera- research on a daily basis. Only a small software would allow curators to cus- ture to identify novel information, percentage of them acknowledge in tomize interfaces for particular tasks, more specific information, or data with

PLoS Computational Biology | www.ploscompbiol.org 7 November 2009 | Volume 5 | Issue 11 | e1000582 The Rat Genome Database Curators

better evidence and then would alert date data and tools to help them make Acknowledgments the curator to update the gene’s exciting discoveries. Implementing the information. developments listed above would contrib- We wish to thank RGD members ute to increased coverage, comprehensive- Elizabeth A. Worthey, Jeffrey De Pons, Each time researchers access RGD, ness, and ease of use of RGD and Alexander Stoddard, Burcu Bakir Gungor, they are presented with the results of databases worldwide. and Stacy Zacher for their contributions. countless hours of work by curators dedicated to providing accurate, up-to-

References 1. Galperin MY, Cochrane GR (2009) Nucleic 5. Sayers EW, Barrett T, Benson DA, Bryant SH, 8. Nelson SJ, Johnston D, Humphreys BL (2001) Acids Research annual Database Issue and the Canese K, et al. (2009) Database resources of the Relationships in Medical Subject Headings. In NAR online Molecular Biology Database Collec- National Center for Biotechnology Information. Bean CA, Green R, eds (2001) Relationships in tion in 2009. Nucleic Acids Res 37: D1–D4. Nucleic Acids Res 37: D417–D422. the organization of knowledge. NY: Kluwer 2. The International Society for Biocuration, Avail- 6. Blake JA, Harris MA (2008) The Gene Ontology Academic Publishers. pp 171–184. able: http://www.biocurator.org. (GO) project: structured vocabularies for molec- 9. Gene Ontology Consortium (2008) The Gene 3. Available: http://www.oxfordjournals.org/ ular biology and their application to genome and Ontology project in 2008. Nucleic Acids Res 36: our_journals/databa. January 2009. expression analysis. Curr Protc Bioinformatics, D440–D444. 4. Dwinell MR, Worthey EA, Shimoyama M, Bakir- Chapt 7, Unit 7.2. 10. NIH Public Access policy. Available: http:// Gungor B, DePons J, et al. (2009) The Rat 7. Smith CL, Goldsmith CA, Eppig JT (2005) The publicaccess.nih.gov. Accessed April 2009. Genome Database 2009: variation, ontologies Mammalian Phenotype Ontology as a tool for and pathways. Nucleic Acids Res 37: annotating, analyzing and comparing phenotypic D744–D749. information. Genome Biology 6: R7.

PLoS Computational Biology | www.ploscompbiol.org 8 November 2009 | Volume 5 | Issue 11 | e1000582