Community Page The Open Knowledge Foundation: Means Better

Jennifer C. Molloy* Department of Zoology, University of Oxford, Oxford,

Data provides the evidence for the In response to these problems, multiple data wranglers, lawyers, and other indi- published body of scientific knowledge, individuals, groups, and organisations are viduals with interests in both open data which is the foundation for all scientific involved in a major movement to reform and the broader concept of . progress. The more data is made openly the process of scientific communication. available in a useful manner, the greater The promotion of and open The Open Knowledge Definition the level of transparency and reproduc- data and the development of platforms ibility and hence the more efficient the that reduce the cost and difficulty of data The definition of ‘‘open’’, crystallised in scientific process becomes, to the benefit of handling play a principal role in this. the OKD, means the freedom to use, society. This viewpoint is becoming main- One such organisation is the Working reuse, and redistribute without restrictions stream among many funders, publishers, Group on Open Data in Science (also beyond a requirement for attribution and scientists, and other stakeholders in re- known as the Open Science Working share-alike. Any further restrictions make an item closed knowledge. It also empha- search, but barriers to achieving wide- Group) at the Open Knowledge Founda- sises the importance of usability and access spread publication of open data remain. tion (OKF). The OKF is a community- to the entire dataset or knowledge work: The Open Data in Science working group based organisation that promotes open at the Open Knowledge Foundation is a knowledge, which encompasses open data, community that works to develop tools, free culture, the public domain, and other ‘‘The work shall be available as a applications, datasets, and guidelines to areas of the . Found- whole and at no more than a rea- promote the open sharing of scientific ed in 2004, the organisation has grown sonable reproduction cost, prefera- data. This article focuses on the Open into an international network of commu- bly downloading via the Internet Knowledge Definition and the Panton nities that develop tools, applications, and without charge. The work must also Principles for Open Data in Science. We guidelines enabling the opening up of be available in a convenient and also discuss some of the tools the group has data, and subsequently the discovery and modifiable form.’’ developed to facilitate the generation and use of that data. Its working groups are in use of open data and the potential uses fields as broad as government, develop- that we hope will encourage further This is an important consideration for ment, science, economics, archaeology, scientific data where in some cases data is movement towards an open scientific and geodata. However, all are united by knowledge commons. accessible, for example, in online supple- the same organisational values and prin- ments to published papers, but is not ciples, and share a common understanding licensed to be reuseable; or it’s accessible Introduction of , as set out in the Open Know- and reuseable but in a form that inhibits Science is built on data: its collection, ledge Definition (OKD; http://www. capture and modification. Prior to online analysis, publication, reanalysis, critique, opendefinition.org/okd/). supplementary materials, requesting and and reuse. However, the current system of The OKF Working Group on Open obtaining permissions and data was an scientific publishing works against maxi- Data in Science (http://science.okfn.org/ extremely time-consuming process, but mum dissemination of the scientific data About/) began in 2009 with the purpose of even with instant downloads, deciding underlying publications. Barriers include developing guidelines, tools, and applica- what rights one has to reuse data can be inability to access data, restrictions on us- tions to promote open data in the confusing due to a lack of licensing and age applied by publishers or data provid- and enable scientists to maximise the use clear terms of use. In some cases, the ers, and publication of data that is difficult and impact of that data. It is now a diverse supplementary data associated with papers to reuse, for example, because it is poorly and international community of scientists, is open even if the article itself is not; but annotated or ‘‘hidden’’ in unmodifiable tables like PDF documents. In addition, Citation: Molloy JC (2011) The Open Knowledge Foundation: Open Data Means Better Science. PLoS Biol 9(12): there is a cultural reluctance to publish e1001195. doi:10.1371/journal.pbio.1001195 data openly, for multiple reasons—from Published December 6, 2011 researchers’ fears about releasing data Copyright: ß 2011 Jennifer C. Molloy. This is an open-access article distributed under the terms of the ‘‘into the wild’’ where they lack control Attribution License, which permits unrestricted use, distribution, and reproduction in any over its usage to a lack of incentive or medium, provided the original author and source are credited. credit for doing so. Funding: The author received no specific funding for this work. Competing Interests: I have read the journal’s policy and have the following conflicts: I volunteer with the OKF as the Coordinator of the Open Data in Science Working Group. The Community Page is a forum for organizations and societies to highlight their efforts to enhance Abbreviations: BMC, BioMedCentral; IIOD?, Is It Open Data?; OA, open access; OKD, Open Knowledge the dissemination and value of scientific knowledge. Definition; OKF, Open Knowledge Foundation; WDTK?, What Do They Know? * E-mail: [email protected]

PLoS | www.plosbiology.org 1 December 2011 | Volume 9 | Issue 12 | e1001195 The scope of the principles covers all primary experimental data published with- in or alongside research papers, including the data content of any table or graph and all images, audio, or video acting as the primary mechanism of data capture, e.g., protein gels or animal vocalisation record- ings. The crux of the Panton message is that all such data—with very few excep- tions—should be placed explicitly in the public domain. Good reasons for not releasing data would include the risk of violating patient privacy or revealing the precise location of an endangered species.

The Open Data Movement in Science The Panton Principles are not an iso- Figure 1. Screenshot of the CrystalEye entry for the structure of coenzyme lated initiative but part of a wider move- cob(II)alamin with a copy of the OKF Open Data button displayed on the site. ment to promote open data in science that doi:10.1371/journal.pbio.1001195.g001 is gathering momentum. Historically, sci- entific data has not been openly available, for a great variety of reasons. Some are this is often not explicit. Clear labelling extend the OKD with a new set of prin- technological—paper is not an efficient and licensing is vital to save scientists the ciples specific to the scientific field. form of sharing datasets—but the web has many hours they may spend discovering opened up not just new possibilities for the openness or otherwise of datasets and The Panton Principles for Open sharing, collaboration, and analysis, but becomes even more imperative as com- Data in Science also for exploring new forms of scientific puterised analysis of the enquiry. For example, automated text and increases, for example via data and text In collaboration with of Creative Commons, key members of the data mining of large swathes of the pub- mining. Websites such as the crystallogra- lished corpus of scientific knowledge is phy data aggregator CrystalEye (http:// OKF—Rufus Pollock (University of Cam- bridge), Peter Murray-Rust (University now feasible if such material is accessible. wwmm.ch.cam.ac.uk/crystaleye/) promi- Encouraging scientists to share their nently display an Open Data web button of Cambridge), and (STFC)—spent two years developing a data is a challenge, even when it directly on their website and link to the Public supports published work. A 2009 report by Domain Dedication and License (PDDL) set of principles for publishing open scientific data, using the OKD and the the Research Information Network [1] license as well as the OKD (Figure 1). Science Commons’ Protocol for Imple- found that some researchers were unwill- Deciding what constitutes open is par- menting Open Access Data (http:// ing to share their data openly due to fears ticularly pertinent to the movement in sciencecommons.org/projects/publishing of exploitation, particularly for datasets science towards open access, or OA, which /open-access-data-protocol/) as prece- where they felt they could extract multiple is related to open data but has different dents and guides. The result was the publications; another problem is the lack immediate goals. OA is defined in the Be- Panton Principles (see Box 1; http:// of career rewards, recognition, or incen- thesda Statement (http://www.earlham. www.pantonprinciples.org/), named after tives to publish data, which makes it , edu/ peters/fos/bethesda.htm) in terms the Panton Arms pub in Cambridge where difficult for researchers to justify the time that embrace open data. However, non- the majority of the drafting sessions and effort required to make data available. OA publishers often use the term to mean occurred. The principles were officially However, there is top-down pressure to ‘‘free’’ access to publications. An impor- launched in February 2010 and have since move towards open data publication from tant distinction is drawn within the open gained more than 150 endorsers. funders such as the Wellcome Trust and community between libre ‘‘free as in freedom’’, as expressed in the OKD, and gratis ‘‘free as in beer’’. The majority of Box 1. Panton Principles in Summary OA journals appear to be gratis rather than libre—as of August 2011 only 1,549 1. When publishing data, make an explicit and robust statement of your wishes. (22%) of the 6,922 journals in the Direc- 2. Use a recognised copyright waiver or license that is appropriate for data. tory of Open Access Journals (DOAJ) 3. If you want your data to be effectively used and added to by others, it should be were licensed under Creative Commons, open as defined by the Open Knowledge/Data Definition—in particular, non- and some of these licenses contained commercial and other restrictive clauses should not be used. non-commercial or non-derivative clauses. 4. Explicit dedication of data underlying published science into the public domain Therefore, the reader may not be free to via PDDL (http://opendatacommons.org/licenses/pddl/1-0/) or CCZero (http:// do what they wish with the text or data as creativecommons.org/publicdomain/zero/1.0/) is strongly recommended and per the OKD. ensures compliance with both the Science Commons Protocol for Implementing To reduce confusion about what open Open Access Data and the Open Knowledge/Data Definition. data should look like, there was a need to

PLoS Biology | www.plosbiology.org 2 December 2011 | Volume 9 | Issue 12 | e1001195 the United Kingdom Research Councils norm for such publications. Vikberg ad- documentation associated with published as well as the United States National mitted that ‘‘credit…must go to a persis- papers, and we would encourage others to Institutes of Health (NIH), which pub- tent, anonymous referee …who demand- contact their own journals of choice where lished a joint statement to that end in ed—twice—that we also publish the back- data policies are unclear. In our first round February 2011 [2]. The European Com- ground data’’ [6]. of enquiries, the openness of data in Public mission and the Royal Society are both A single individual’s persistence led to Library of Science (PLoS) and BMC leading major enquiries into the future of the open publication of data that would publications was confirmed, while the communication of scientific informa- otherwise have been more difficult for Publishing Group also stated that raw data tion, with reports due later this year. Open researchers to obtain, which Vikberg ac- extracted from their publications may be data in science has even appeared on knowledges will aid reanalysis as new and used as open data, with limited caveats. government agendas; a recent report from improved models emerge in the ecological Over time, extensive and systematic re- the UK House of Commons Select Com- phylogenetics field. In addition, the re- quests to journals and other data providers mittee on Science and Technology exam- search team gained recognition and re- are expected to build up a collection of ined research integrity and the ward from BMC and the members of the position statements on data reuse that are process and concluded that: Open Data in Science working group on currently unavailable without searching the judging panel. through the journal or publisher’s websites Networks such as the OKF working ‘‘Access to data is fundamental if individually. We hope this will result in group and other open data initiatives can fewer duplicated requests and save re- researchers are to reproduce, verify play an important role in bringing enthu- searchers valuable time. and build on results that are report- siastic individuals together to effect chan- ed in the literature … The presump- ge. Further to encouraging researchers to tion must be that, unless there is a What the Reuse of Open Data publish data openly, we are dedicated to Might Achieve strong reason otherwise, data should developing practical assistance in the form be fully disclosed and made publicly of tools and applications via our commu- There is little point in opening up data if available. In line with this principle, nity of scientists who provide the problems it is not used; it does not intrinsically lead where possible, data associated with and suggest possible solutions, and the to better science in and of itself, although it all publicly funded research should developers who build them. could be argued that the open publication be made widely and freely availa- of datasets will directly discourage fraud. It ble…The work of researchers who would be useful to evaluate the reuse of expend time and effort adding value Is It Open Data? current open data, but evidence is limited to their data, to make it usable by Requesting data from other researchers due to issues in tracking data citations. others, should be acknowledged as a can be a tortuous and sometimes fruitless However, it does appear that publicly valuable part of their role’’ [3]. process. In a 2006 survey, 50.8% of US sharing your data increases citation rate, at researchers reported that data withholding least in cancer microarray experiments Implementing open data more widely had exerted a negative effect on the pro- [8], which is positive encouragement that necessitates new infrastructure to support gress of their research [7]. This problem open biological data is being reused. data archiving, as well as a change to how could be overcome by sharing data freely Evidence is also emerging that data data fits within scientific publishing. Major online, but as discussed previously, discov- archiving leads to an impressive scientific OA publishers and their non-OA col- ering the terms of use of data can be a return per research dollar [9], which leagues are joining forces to discuss these difficult and time-consuming task as this corroborates the obvious benefits of shared issues through groups like the Publishing information is often not explicitly stated at data in established databases such as Open Data Working Group led by the point of data viewing or download. GenBank and the Protein Data Bank BioMedCentral (BMC). Some journals With this in mind, one of the first tools (PDB) that have had such a huge impact are participating in a Joint Data Archiving that the Open Data in Science working on the biological field. To maximise this Policy (JDAP), which requires deposition group created was ‘‘Is It Open Data?’’ discovery and reuse, tools are required to of data underlying papers in appropriate (IIOD?; http://www.isitopendata.org/), a assist in locating open data and making it public repositories such as Dryad (http:// web application based on civil society usable, for example, extracting data from datadryad.org/). Alternatively, direct pub- websites such as What Do They Know? unmodifiable formats like PDF. lishing of data as a peer-reviewed ‘‘data (WDTK?; http://www.whatdotheyknow. A current collaboration between the paper’’ is now possible in the fields of com). WDTK? allows users to make Free- Open Data in Science working group, the biodiversity (http://www.gbif.org/; [4]) dom of Information requests for public Joint Information Services Council (JISC) and ecology and environmental science sector or government information in the funded DevCSI project, and Semantic (http://www.pangaea.de/ and http:// UK and records the resulting correspon- Web Applications and Tools for Life www.earth-system-science-data.net). dence as a permanent and visible record in Sciences (SWAT4LS) is a free workshop There is also a role for individuals and the public domain. In much the same way, to generate semantic tools for the biological communities to drive the open data me- IIOD? enables interested parties to request sciences (http://www.ukoln.ac.uk/events/ ssage forward. Veli Vikberg, David R. the open or closed status of data and data devcsi/life-sciences-hackdays/index.html). Smith, and Jean-Luc Boeve´ won the 2011 licensing details from providers such as As part of this we hope to create some BMC Open Data Award for their efforts academic publishers, research organisa- Reports on infectious in publishing the full ecological back- tions, nongovernmental organisations, and diseases; collections of open publications ground data associated with a paper on all others making data available online. and datasets brought together using open the ecological phylogenetics of plant-feed- It has already been used to contact ma- bibliographic data and crowd-sourced su- ing insects [5], which was above and jor publishers regarding mmaries of non-open content. This would beyond the DNA sequences that are the the status of data in the supplementary be fully searchable and semantically link-

PLoS Biology | www.plosbiology.org 3 December 2011 | Volume 9 | Issue 12 | e1001195 ed and would enable discovery of open aims to connect previously unlinked results ating, discovering, and reusing open data, research by academics and others, with from clinical trials, gene expression assays, ideas are flowing continuously but require particular public interest likely to stem from and chemical testing [10]. This enables the input of the wider scientific community patient groups. Open Research Reports are researchers to more rapidly answer com- in identifying the problems they face in primarily being developed by David Shot- plex queries using a single interface rather publishing, discovering, and reusing data ton and Tanya Gray (University of Ox- than manually searching through the online and requesting assistance in solving ford), and we hope that this project will literature; one example would be to them. The working group aims to pro- expand in scope and grow into a valuable discover possible targets of a medicine by vide a community and network that can resource for the life sciences, fuelled by the searching for the possible targets of drugs respond to these needs and a hub for increasing availability of open data and with shared ingredients. Drawing together access to the resulting tools, which we content. diverse datasets for reuse in this manner hope all stakeholders in scientific data will Additionally, the working group has becomes complicated where their terms of find valuable. Better science—in terms of several members researching technologies use are restrictive or not interoperable, transparency, reproducibility, increased that will use open data to seek new making openness a valuable attribute. efficiency, and ultimately a greater benefit scientific discoveries, which nicely illus- The Open Data in Science working to society—depends on open data. trate its potential. In the semantic web group has a common goal of achieving a community, much effort has been made to world in which scientific data is open by Acknowledgments link life sciences data together in a way default according to the Panton Principles, that machines can understand the seman- with limited exceptions. As a diverse I am grateful to members of the OKF Open tic links between objects in datasets. This collection of individuals, the aims, objec- Data in Science working group for making this will not only assist in keeping track of the tives, and means to achieve this are a article possible and for their much appreciated rapidly expanding scientific literature, but matter of healthy debate and we encour- input to the . Particular thanks must also will enable novel analyses to be age others to join the discussion. go to Daniel Mietchen for his helpful comments and suggestions. performed and new connections discov- In terms of our primary aim of pro- ered, for example, linked open drug data viding tools, apps, and datasets for gener-

References 1. Research Information Network (2008) To share [press release]. Available: http://www.gbif.org/ the next generation of scientists: results of a or not to share: research data outputs. Avail- communications/news-and-events/showsingle/ national survey. Academic Medicine 81: 28–136. able: http://www.rin.ac.uk/our-work/data- article/new-incentive-for-biodiversity-data-publishing. doi:10.1097/00001888-200602000-00007. management-and-curation/share-or-not-share- Accessed 27 October 2011. 8. Piwowar HA, Day RS, Fridsma DB (2007) research-data-outputs. Accessed 27 October 2011. 5. Nyman T, Vikberg V, Smith DR, Boeve´ J (2010) Sharing detailed research data is associated with 2. Walport W, Brest P (2011) Sharing research data How common is ecological speciation in plant- increased citation rate. PLoS ONE 2: e308. to improve public health. Lancet 377: 537–539. feeding insects? A ‘Higher’ Nematinae perspec- doi:10.1371/journal.pone.0000308. doi:10.1016/S0140-6736(10)62234-9. tive. BMC Evolutionary Biology 10: 266. 9. Piwowar HA, Vision TJ, Whitlock MC (2011) 3. House of Commons Science and Technology doi:10.1186/1471-2148-10-266. Data archiving is a good investment. Nature 473: Committee (2011) Science and Technology Com- 6. Nyman T (25 May 2011) On the unbearable 285. doi:10.1038/473285a. mittee – eighth report. Peer review in scientific lightness of mandatory data sharing. BioMed Central 10. Samwald M, Jentzsch A, Bouton C, Stie publications. http://www.publications.parliament. Blog. Available: http://blogs.openaccesscentral. Kallesøe C, Willighagen E, et al. (2011) Linked uk/pa/cm201012/cmselect/cmsctech/856/85602. com/blogs/bmcblog/entry/on_the_unbearable_ open drug data for pharmaceutical research and htm. Accessed 27 October 2011. lightness_of. Accessed 27 October 2011. development. J Cheminform 3: 19. doi:10.1186/ 4. Global Biodiversity Information Facility (2011) 7.VogeliC,YucelR,BendavidE,JonesL, 1758-2946-3-19. New incentive for biodiversity data publishing Anderson M, et al. (2006) Data withholding and

PLoS Biology | www.plosbiology.org 4 December 2011 | Volume 9 | Issue 12 | e1001195