Open Science and the Role of Publishers in Reproducible Research
Total Page:16
File Type:pdf, Size:1020Kb
Open science and the role of publishers in reproducible research Authors: Iain Hrynaszkiewicz (Faculty of 1000), Peter Li (BGI), Scott Edmunds (BGI) Chapter summary Reproducible computational research is and will be facilitated by the wide availability of scientific data, literature and code which is freely accessible and, furthermore, licensed such that it can be reused, inteGrated and built upon to drive new scientific discoveries without leGal impediments. Scholarly publishers have an important role in encouraGing and mandating the availability of data and code accordinG to community norms and best practices, and developinG innovative mechanisms and platforms for sharinG and publishinG products of research, beyond papers in journals. Open access publishers, in particular the first commercial open access publisher BioMed Central, have played a key role in the development of policies on open access and open data, and increasing the use by scientists of leGal tools – licenses and waivers – which maximize reproducibility. Collaborations, between publishers and funders of scientific research, are vital for the successful implementation of reproducible research policies. The genomics and, latterly, other ‘omics communities historically have been leaders in the creation and wide adoption of policies on public availability of data. This has been throuGh policies, such as Fort Lauderdale and the Bermuda Principles; infrastructure, such as the INSDC databases; and incentives, such as conditions of journal publication. We review some of these policies and practices, and how these events relate to the open access publishinG movement. We describe the implementation and adoption of licenses and waivers prepared by Creative Commons, in science publishinG, with a focus on licensing of research data published in scholarly journals and data repositories. Also, we describe how some publishers are evolvinG the copyriGht system to ensure published data are in the public domain under the Creative Commons CC0 waiver. Other cases where CC0 has been successfully implemented in science are discussed in particular by BGI, the world’s larGest genomics orGanisation. BGI have developed an advanced platform for publishing executable research objects – includinG larGe data packaGes and code – which is inteGrated with open-access article publishinG throuGh GigaScience and its database, GiGaDB. We look at journal and publisher policies which aim to encouraGe reproducible research, and the comparative influence and success of these policies. We discuss specific problems faced in data sharinG and reproducible research such as data standardization, and some of the solutions. Finally, we review the state of the art in scientific workflows and larGe-scale computation platforms – includinG Galaxy and myExperiment, and how current and future collaborations between the scientific and publishinG communities utilizinG these innovative tools will further drive reproducibility in science. The evolution of policies on open access and open data in the life sciences When we read about the claims made in scientific papers, we tend to believe that they have been written by their authors in good faith. The process of science therefore demands the hiGhest ethics and quality in order for the content of a scientific paper to be taken at face value. However, the increasinG number of retractions in the scientific literature sugGest that peer review is not of sufficient riGour to assess whether the results reported in papers can in fact be reproduced1. It is often only the results and conclusions of a study which are examined, whilst the methodoloGy, raw data and the source code used to Generate the results of a paper are usually not fully evaluated. Open access publishing, BioMed Central, and the literature as a resource for science Openness enables reproducibility, and reproducible computational research requires openness in all products of the research. Open data and code must be supported by full and accurate descriptions of the experiments as initially proposed (protocols) and as eventually carried out (methods and results)2. Openness in scientific papers (journal articles) is encompassed by open access publishinG. Open access to scholarly articles is Generally achieved throuGh two mechanisms. Firstly, scholars or their institutions can “self archive” and share peer-reviewed pre-publication versions of papers, which have been accepted for publication in journals. This happened before, and after, diGital scholarship was possible, and is known as “green” open access. Secondly, scholars can publish their papers in open access journals, known as Gold open access3, which is the focus of this chapter. The first free-to-access online journals emerGed soon after the introduction of the World Wide Web4 but characteristics of open access publishinG in the 21st century have helped the literature itself to become a scientific resource. Open access to journal articles enables them to be read online without a subscription but open access is about more than accessibility. Open access to journal articles is also about reusability, which means considerinG the format in which the literature is available and the copyriGht license under which it is published. This emerGed from three overlappinG definitions of open access resultinG from three influential meetinGs (in Budapest, February 2002, Bethesda, June 2003, and Berlin, October 2003), which subsequently released public statements3. The Budapest definition states: “By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself”5. This is the most pertinent definition when deriving reproducibility from the published literature. The Budapest, Bethesda and Berlin (BBB) definitions of open access unified policies on copyriGht in scholarly works, and unified practices in how electronic literature should be formatted and structured. These policies and practices had already been put into practice, in 2000, by the first commercial open access publisher BioMed Central, and the full text open access repository funded by the National Institutes of Health (NIH), PubMed Central. The idea for creatinG an online, open access life-science publisher emerged in 1998 from a meetinG between the publishing entrepreneur Vitek Tracz, Chairman of the Science NaviGation Group in London, UK and David Lipman, Director of the NCBI in the US. Lipman’s responsibilities include infrastructure for implementation of data sharinG policies in genomics – databases such as GenBank – and the bibliographic database PubMed. After Lipman discussed the idea with the then director of the NIH Harold Varmus, a proposal emerged for E-BIOMED, an NIH-sponsored free, full text research publishinG platform6. However, a larGe research funder’s potential conflict of interest in becominG, or being seen to be, a publisher meant that the NIH could only ever provide a repository for full text open access articles oriGinally published elsewhere. As a result, BioMed Central was conceived as a publisher of online biology and medical journals to support deposition of content in the repository. The repository was launched, as PubMed Central, in February 20007. BioMed Central beGan acceptinG its first submissions in May 2000, with an inclusive editorial and peer-review policy focusinG on scientific accuracy rather than interest, including publication of neGative results and single experiments8. Both PubMed Central and BioMed Central publish the full text of articles in an open standardized XML format with a document type definition (DTD) to enable efficient filtering and querying of content. This approach, under open access, enables the rapid development of computational analysis tools – so the literature itself becomes a scientific resource. Licensing the literature for reuse Efficient reuse of published research requires that the appropriate leGal tools – copyriGht licenses – be put in place by publishers and riGhts holders. LeGal restrictions, engrained in traditional copyright transfer aGreements, on the sharing and reuse of the products of scientific research are another barrier to reproducibility. In its first author license aGreement, BioMed Central authors retained copyriGht in their work, with the publisher actinG as a provider of layout, archivinG and peer-review co-ordination services. Authors were free to redistribute their work as they wished, with the only requirement being attribution of the oriGinal publisher. This was – and generally still is – in contrast to the traditional model of science publishing, where researchers typically work for several years on a piece of research and then hand exclusive riGhts to display and distribute that work to a publisher who controls access to the work. In 2004, BioMed Central’s license aGreement was made consistent with the Creative Commons Attribution License9 (CC-BY), which has emerGed as the gold standard for licensinG journal articles under an open access model in STM publishinG. There are several derivatives of the Creative Commons Attribution License, with CC-BY being the most liberal. The only requirement for sharinG, redistribution, reproduction, remixinG, reuse and translation