Developing and Implementing an Institute-Wide Data Sharing Policy Stephanie OM Dyke and Tim JP Hubbard*

Dyke and Hubbard Genome Medicine 2011, 3:60 http://genomemedicine.com/content/3/9/60 CORRESPONDENCE Developing and implementing an institute-wide data sharing policy Stephanie OM Dyke and Tim JP Hubbard* Abstract HapMap Project [7], also decided to follow HGP practices and to share data publicly as a resource for the The Wellcome Trust Sanger Institute has a strong research community before academic publications des- reputation for prepublication data sharing as a result crib ing analyses of the data sets had been prepared of its policy of rapid release of genome sequence (referred to as prepublication data sharing). data and particularly through its contribution to the Following the success of the first phase of the HGP [8] Human Genome Project. The practicalities of broad and of these other projects, the principles of rapid data data sharing remain largely uncharted, especially to release were reaffirmed and endorsed more widely at a cover the wide range of data types currently produced meeting of genomics funders, scientists, public archives by genomic studies and to adequately address and publishers in Fort Lauderdale in 2003 [9]. Meanwhile, ethical issues. This paper describes the processes the Organisation for Economic Co-operation and and challenges involved in implementing a data Develop ment (OECD) Committee on Scientific and sharing policy on an institute-wide scale. This includes Tech nology Policy had established a working group on questions of governance, practical aspects of applying issues of access to research information [10,11], which principles to diverse experimental contexts, building led to a Declaration on access to research data from enabling systems and infrastructure, incentives and public funding [12], and later to a set of OECD guidelines collaborative issues. based on commonly agreed principles [13]. ese initiatives, and those of other fora, firmly established data sharing as a priority in the minds of individuals involved, Introduction and in particular led to the development of funders’ e Wellcome Trust Sanger Institute (WTSI) played an policies in the UK and USA [14-17]. important role in the international public effort to However, by 2003 genomic science had diversified with sequence the human genome, the Human Genome Project a range of different data types being collected across (HGP), which has become a symbol of the benefits of multiple species. Funders were beginning to look at policies on early release of scientific data. e HGP data standards for large-scale data in other fields of the life release policy, known as the ‘Bermuda Agreement’, was sciences [18]. As WTSI shifted focus from a few large agreed to in 1996 by a group of genomic scientists and sequencing projects to multiple endeavors, coordination funders that included leaders from WTSI and the on data sharing for studies that involved different Wellcome Trust, and built on successful practices that funders, different technologies and diverse institutions had been in operation in other fields of genetics (for became increasingly complex. Efforts to maintain the example, the Caenorhabditis elegans Genome Project principles associated with HGP data release therefore led [1-3]). Other WTSI sequencing projects, whose structure to a range of project-specific adaptations. is approach easily fits the specifics of the HGP data release policy, worked well for large-scale studies that had sufficient followed suit and adopted similar practices that rapidly resources to manage data sharing plans, such as e became WTSI policy [4]. Large-scale international colla- Encyclopedia of DNA Elements (ENCODE; 2003 and bora tions, such as the SNP Consortium [5], Mouse 2008 [19,20]), Wellcome Trust Case Control Consortium Genome Sequencing Consortium [6] and International (WTCCC; 2005 [21]), Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl *Correspondence: [email protected] Resources (DECIPHER; 2006 [22]), 1000 Genomes Project Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, (2008 [23]), International Cancer Genome Consortium Cambridge CB10 1SA, UK (ICGC; 2008 [24]) and MalariaGen (2008 [25]), but led to disparities in adherence to data sharing for smaller © 2010 BioMed Central Ltd © 2011 BioMed Central Ltd projects. Dyke and Hubbard Genome Medicine 2011, 3:60 Page 2 of 8 http://genomemedicine.com/content/3/9/60 Furthermore, projects were starting to use human data adoption of recognized data and interoperability stand‑ sets that engendered additional ethical considerations. ards, including submission to designated public As it became possible to study genomic data for large repositories. numbers of individuals, the genomics community, with For many aspects of data sharing policy, best practice its evolving data sharing standards, began to interact for implementation remained to be established. While more with the human genetics community, whose carrying out the review of data sharing policy, the practices placed greater emphasis on data confidentiality. Institute began to devote resources to support the imple‑ It became accepted that a reasonable way to ensure the mentation of the Wellcome Trust policy on open and benefits of data sharing, while managing the risks, was to unrestricted access to research articles (in brief: papers share data with controls to limit access to approved users describing research carried out at or in collaboration for approved purposes. In 2006, a purpose‑built ‘managed with WTSI must be made publicly available through UK access’ database, the database of Genotypes and Pheno‑ PubMed Central (UKPMC) as soon as possible and in types (dbGaP), was established in the USA for storing any event within 6 months of the journal publisher’s and sharing genotypes and associated phenotypes that official date of final publication [31]). This effort focused could not be published through existing public archives on the development of ‘how‑to‑comply’ guidelines, [26]. In 2007, a similar repository was set up at the European including information for collaborators [32] and institut‑ Bioinformatics Institute (EBI): the European Genome‑ ing records of submissions and compliance tracking, with phenome Archive (EGA) [27]. WTSI has con tinued to support from research administrators and library staff. actively participate in relevant policy discus sions with the Based on this experience, it was agreed that successful Wellcome Trust and other funders, such as the Toronto policy implementation would depend on working out International Data Release Workshop in 2009, which led to detailed requirements (guidance), devoting efforts and the development of the Toronto State ment [28]. resources to alleviate disincentives (facilitation), institut‑ In summary, at the same time as these complexities ing monitoring processes (oversight), and leadership. evolved, it became more widely accepted that increased These are discussed in detail below in the following data sharing was important. It has become recognized sections: Guidance, Facilitation and Oversight. that data sharing enables research, accelerates translation, safeguards good research conduct, and helps inform Guidance policy and regulation, thereby fostering a public climate A major challenge was to work out what the principles in which research can flourish. Being committed to these outlined in the text of the policy meant in practice for benefits spurred the Institute to develop and implement individual projects. Decisions were guided by the need to an institute‑wide data sharing policy. ensure that anticipated benefits from making data available would outweigh the costs associated with long‑ Developing and implementing the policy term archiving and the effort involved in preparing data A review of data sharing policy at WTSI, including a for submission. Timelines for submission were deter‑ consultation to identify issues of concern, was under‑ mined by evaluating the length of time required to allow taken. This allowed an institute‑wide data sharing policy adequate quality control to ensure value over time. For to be drafted that covers the diverse work being carried example, reference genome sequence data are valuable out. A working group that included faculty members with minimal quality control. The value of the draft representing every area of WTSI science was set up to human genome sequence data shared within 24 h of steer this effort. The process of review and policy revision sequencing is testament to this approach. On the other took a year and the drafting of policy followed a standard hand, certain cellular assays captured through sequencing course that has been described previously [29]. (for example, ChIP‑seq) may have little value if the The policy that resulted from this process addresses experiment failed and this may not be realized until ethical issues and differences in experimental contexts initial analysis has been carried out. and data types [30]. It includes a commitment to rapid The appropriate resolution of raw data submitted was sharing of data sets of use to the research community also considered in this way. Summary data sets can be (which include primary and processed data sets, research much smaller than the raw data sets they derive from, articles and software code), and encompasses elements to and in many cases satisfy the needs of other users. On the address the following: (1) protection of research partici‑ other hand, storing raw data is more important if samples pants; (2) promotion of respect for rights for data genera‑ are rare or

Developing and Implementing an Institute-Wide Data Sharing Policy Stephanie OM Dyke and Tim JP Hubbard*

VISPA: a Computational Pipeline for the Identification and Analysis Of

Sequencing Alignment I Outline: Sequence Alignment

Comparative Analysis of Multiple Sequence Alignment Tools

Chapter 6: Multiple Sequence Alignment Learning Objectives

How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)

"Phylogenetic Analysis of Protein Sequence Data Using The

Aligning Reads: Tools and Theory Genome Transcriptome Assembly Mapping Mapping

Alignment of Next-Generation Sequencing Data

INTELLIGENT MEDICINE the Wings of Global Health

ENCODE Regions Identification of Higher-Order Functional Domains in the Human

Cancer Genome Interpreter Annotates the Biological and Clinical Relevance of Tumor Alterations David Tamborero1,2, Carlota Rubio-Perez1, Jordi Deu-Pons1,2, Michael P

Estimating the Number of Unseen Variants in the Human Genome