R eproducible Research

The Legal Framework for Reproducible Scientific Research Licensing and Copyright

As computational researchers increasingly make their results available in a reproducible way, questions naturally arise regarding copyright, subsequent use and citation, and ownership rights in general. The author proposes the Reproducible Research Standard for all components of scientific researchers’ scholarship, which should encourage replicable scientific investigation through attribution, the facilitation of greater collaboration, and the promotion of the engagement of the larger community in scientific learning and discovery.

n the US, when scientists put their original computational work because researchers often research on the Web, it automatically falls need more than just the published article to re- under copyright. However, copyright is produce the results. an unsuitable legal structure for scientific Although this trend seems to be changing, the Iworks. Scientific norms guide scientists to repro- typical journal publishing format doesn’t allow for duce and build on others’ research, and default transmission of supporting files such as accom- copyright law by its very purpose runs counter panying images, source code, or demonstrations to these goals. In this article, I present a meth- of work to interested readers—although some odology for scientists to rescind copyright from journals are beginning to require that code or their work in such a way that it realigns scientific data be released as a precondition for publication, information sharing with long-established scien- (for example, Nature [www.nature.com/authors/ tific norms. editorial_policies/availability.html], The Insight Journal [www.insight-journal.org], and Annals of Options for Researchers Internal Medicine [www.annals.org/cgi/content/ Computational research is becoming more per- full/0000605-200703200-00154v1]). Evidence ex- vasive across a growing number of fields—for ex- ists that reproducible research receives more cita- ample, in the June 1996 issue of The Journal of the tions than nonreproducible work,1,2 and releasing American Statistical Association, nine of 20 articles research on the Web is a growing trend that seems were computational, whereas a decade later, in to be gathering institutional support. On 12 Feb- the June 2006 issue, 33 of 35 were computation- ruary 2008, Harvard University’s faculty of arts al. Different journals have different agreements and sciences adopted a policy that requires faculty with article authors, but most require authors to members to let the university make their scholarly relinquish their ownership rights to articles, in- cluding copyright. Thus authors have very little 1521-9615/09/$25.00 © 2009 IEEE or—more typically—no say in how their work is Co p u b l i s h e d b y t h e IEEE CS a n d t h e AIP used after publication; they also find that it’s fre- Victoria Stodden quently bound away in journals that can be very expensive to access. This is especially tough for Harvard Law School

Co m p u t i n g in Sc i e n c e & En g i n e e r i n g This article has been peer-reviewed. 35 articles available freely online (rights are turned the underlying idea or discovery (www.copyright. over to the university, nonexclusively): gov/title17). Another option, quite different from copyright, is the patent, granted by the US Patent Each Faculty member grants to the President and Office only after reviewing an invention to make Fellows of Harvard College permission to make sure it’s relevant, useful, and not obvious. Where- available his or her scholarly articles and to exer- as authors don’t need to apply for copyright pro- cise the copyright in those articles. In legal terms, tection because it “follows the author’s pen across the permission granted by each Faculty member the page,”3 inventors must apply for a patent. Just is a nonexclusive, irrevocable, paid-up, world- as a copyright protects an author from plagiarism, wide license to exercise any and all rights under a patent’s raison d’être is to open the knowledge copyright relating to each of his or her scholarly publicly while granting the right to exclude oth- articles, in any medium, and to authorize others ers from making, using, offering for sale, or sell- to do the same, provided that the articles aren’t ing the invention in the United States (35 USC sold for a profit (www.fas.harvard.edu/~secfas/ 154(a)(1)). However, copyright and patent laws February_2008_Agenda.pdf, p. 3). work counter to prevailing scientific norms— copyright was intended to give authors of creative However, Stuart M. Shieber, the Harvard works (literature and music, for example) exclusive University computer science professor who pro- rights, such as the right to be credited, to deter- posed the new policy, said in a press release that mine who may adapt or perform the work, or who the decision “should be a very powerful message may benefit financially from it. A derivative work to the academic community that we want and is one that’s “based upon one or more preexist- should have more control over how our work is ing works” and gives the original copyright holder the exclusive right to prepare such works (17 USC Copyright and patent laws work counter to 106(2)). A scientific contribution is considered valuable if, among other things, researchers can prevailing scientific norms—copyright was reproduce the results successfully (verifiability), and the work is built upon it, thus uncovering new intended to give authors of creative works scientific discoveries. Copyright stands in the way of both actions. exclusive rights. As explained earlier, open licenses offer an op- tion to override default copyright law. Two of the most common types of open licenses that rescind used and disseminated” (www.news.harvard.edu/ copyright are those designed for code (for ex- gazette/2008/02.14/99-fasvote.html). Stanford’s ample, the GNU Public License or GPL and the School of Education soon followed suit with a Berkeley Software Distribution or BSD license) or mandate for : All faculty members media (for example, the family of Creative Com- will deposit a copy of their published work in an mons licenses). open access repository as of 26 July 2008 (www. news.harvard.edu/gazette/2008/02.14/99-fasvote. Licenses for Code html and www.earlham.edu/~peters/fos/2008/06/ Because copyright extends to code, Richard Stall- oa-mandate-at-stanford-school-of-ed.html). man began the movement in the General concern clearly exists about ownership early 1980s to encourage programmers to release rights to scholarly research. Tension arises be- their source code along with the software com- cause the scientific ethos guides scientists to both piled for end users (www.gnu.org/gnu/thegnu reproduce previous results and build on them, project.html). His license, which became the thereby generating further scientific understand- GPL, has two main components: ing. Copyright stands as a bar preventing the open sharing, dissemination, and use of work. To re- •• if publicly distributed, all software subject to scind copyright on scientific research, you must the license must also have its source code re- actively choose to do so, and this is typically done leased, and with a license. •• once the license is attached to code, it also at- taches to any body of code that uses the origi- Licensing and Rescinding Copyright nal code. Copyright is a set of rights that attach by default to “original works of authorship” although not to In brief, Stallman’s license has a viral effect de-

36 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g signed to propagate the release of all source code. the same or similar license to this one (see http:// This means that if you use GPL-licensed code in creativecommons.org/licenses/by-nc-sa/2.0, for the development of another body of code, your example). The CC BY license entire work must also carry the GPL unless you ensures attribution but doesn’t contain the Share negotiate an alternative with the original’s copy- Alike component. right holders. This is the Share Alike provision of the license. The Share Alike Provision BSD license is an attribution license and doesn’t in the Scientific Context contain the Share Alike provision. Software under The Share Alike concept is inappropriate in the a BSD license retains the original license when it’s scientific context because it can impose limits on used in a derivative work, but the entire derivative the use and reuse of others’ work, which in the work doesn’t necessarily become BSD-licensed scientific context, should be avoided whenever unless the downstream author chooses to do so. possible. This would render the research un- Typically, the computational researcher re- available for reuse by people who don’t want to leases code that consists of instruction scripts use the same license as the original research for for a proprietary compiled language, rather than their own resulting research compendia. Ideally, a compiled binary, although it might require downstream researchers will choose to license the proprietary binary code to run. There has been original components of their compendia so that a marked increase in the use of these types of researchers, even those working within a propri- quantitative programming environments such as etary context, can build upon the work without Matlab, SPSS, SAS, R, and Stata for experimenta- legal encumbrance, but scientific research should tion and data display across a variety of fields. To be encouraged over particular license use. rescind copyright from these instruction scripts, Furthermore, expanding the license to cover a license for code would be used. But all of a com- the entire derivative work product makes attempts putational scientist’s research can be considered at attribution more difficult. Under Share Alike, code—for example, figures, articles, data struc- it’s no longer clear how to give credit to upstream tures, and even pseudocode descriptions wouldn’t work in a derivative product because a single attri- be classified as code. Stallman created the GNU bution scheme could subsume and conflate work Free Documentation license to cover the docu- by different authors. In the scientific context, the mentation that accompanies code, but it wasn’t license should attach only to the derivative work’s intended to extend to media beyond text. components that the original author actually car- ried out, which isn’t possible under a Share Alike Creative Commons provision. Scientists should be free to license their In 2001, Larry Lessig founded Creative Commons compendia’s components as they see fit, and they to enable greater sharing of works and informa- shouldn’t be restricted in licensing, say, the fig- tion. The Creative Commons attribution license ures from their work in a particular way because (CC BY) was designed for nonsoftware works to they used or modified another researcher’s code to “share your creations with others and use music, build them. The Science Commons Open Access movies, images, and text online that’s been marked Data Protocol (sciencecommons.org/projects/ with a ” (http://creative publishing/open-access-data-protocol/) embodies commons.org/learnmore). Creative ­Commons many of these arguments. states explicitly that its licenses aren’t intended to cover code (see www.fsf.org/licensing/license/ The Gaping Hole agpl-3.0.html). Because Creative Commons li- The US National Science Foundation (NSF) censes are designed for media and not to code, funds a large proportion of scientific academic they make no reference to source code. Among research: In 2006, federally funded science and other things, this is so as not to create license engineering R&D comprised 63 percent of total incompatibilities and because of the lack of pat- academic research and development support.4 ent provisions in the Creative Commons licenses The NSF requires that researchers make available (patent is not an issue generally associated with any data and other supporting materials for the media). Many Creative Commons licenses also research it funds to other researchers at no more incorporate Stallman’s notion of viral attach- than incremental cost.5 So how do we, as academ- ment through their own Share Alike concept: ics, comply with our funding mandate to release If you alter, transform, or build upon this work, our research publicly? And less broadly, how do you can distribute the resulting work only under we, as computational researchers, comply with the

Ja n ua r y/Fe b r u a r y 2009 37 validation and verification demands as part of the ducible research: the lack of reward for producing scientific method? reproducible work (norms in the scientific com- Jon Claerbout, a Stanford geophysics profes- munity) and the legal obstacle to the full shar- sor, established a principle that many prominent ing of methodologies, writing, code, papers, and researchers are now following: “An article about data (copyright law). I propose the Reproducible computational science in a scientific publication Research Standard (RRS) to address the second isn’t the scholarship itself, it’s merely advertis- block: problems of copyright and reproducibility ing of the scholarship. The actual scholarship in scientific research.8 is the complete software development environ- ment and the complete set of instructions which The Reproducible Research Standard: generated the figures.”6 David Donoho and his Enabling Reproducibility research group have practiced this idea of repro- Computational research produces an entire re- ducible research for the past 15 years with the re- search compendium,9 which comprises lease of the Matlab toolboxes Wavelab (www-stat. stanford.edu/~wavelab), Beamlab (www-stat. •• the research paper—including all the source files stanford.edu/~beamlab), Symmlab (www-stat. from which the manuscript was built (for exam- stanford.edu/~symmlab), and Sparselab (http:// ple, LaTex, Word, or WordPerfect files); sparselab.stanford.edu/). These software packag- •• the data—including documentation completely es let anyone with access to Matlab reproduce fig- describing the data (sources, components, and ures from their articles, inspect the source code, interpretation); a description of how the data change parameters, and access their datasets. This was brought into the form used in the research; isn’t yet a common phenomenon, but researchers the code and instructions used to bring the data into the form used in the research; and docu- The Reproducible Research Standard (RRS) mentation of any code used for data processing; •• the experiment—the code and instructions used encourages scientific research by rescinding the in the experiment, including all source code; documentation of any code used, including aspects of copyright that prevent scientists from pseudocode; a clear listing of the parameters, settings, and operating system dependencies sharing important research information. used to achieve the results described in the pa- per; and a clear description of the experimental methodology; are increasingly proceeding this way, including •• the results of the experiment—any figures, data, Jalal Fadili at ENSI Caen (www.greyc.ensicaen. or the like produced from the experiment; any fr/~jfadili/software.html) and scientists at the Au- illustration source files; and documentation and dio Visual Communications Lab (LCAV) at Ecole explanation of the processing of the experimen- Polytechnique Fédérale de Lausanne, (www. tal results; and lcavwww.epfl.ch/reproducible_research/). •• any auxiliary material—materials used for presen- A tenet of the scientific method holds that ev- tation on the Web or an interface to the data or ery research finding should be reproducible before results; and documentation of any auxiliary code. it becomes accepted as a genuine contribution to human knowledge. But how to encourage this? Typically, researchers release only the research One way could be for scientists to use a license paper, which is all that traditional journals usu- that covers their entire research—for example, all ally publish. In contrast, to encourage scientists the components required for reproducibility. As to release their entire compendium, there could computational research becomes more pervasive, be a licensing methodology that applies to every details of the work often remain unpublished, so aspect except the data (discussed in the next sec- opportunity to hide poor scholarship increases. tion) and ensures attribution for any compendi- Without full publication of “a careful description um elements used in derivative scientific research of the methods used, in sufficient detail that oth- (meaning that any papers published using compo- ers can attempt to repeat the experiment,” com- nents of the research compendium must attribute putational research, a key to progress in modern the original author). Encouraging the release of science, could end up undermining the scientific the entire research compendium is the purpose of process and become “the last refuge of the scien- the RRS. I argue this is best done through ensur- tific scoundrel.”7 Two blocks exist to truly repro- ing attribution.

38 Co m p u t i n g in Sc i e n c e & En g i n e e r i n g Because citations are important evidence of im- The RRS Defined pact for scientists and often play a role in hiring The RRS suggests both licenses for the different and promotion decisions, the RRS is consistent aspects of the full research compendium and re- with scientific norms. It can also offset research- leasing data into the public domain. I believe that ers’ fears that parts of their released compendium the appropriate choices in the scientific context will be “stolen” or new publications made without are: attaching CC BY to the compendium’s media attribution. The RRS requires attribution for any components, modified BSD to code components, part of the upstream compendium used in deriva- and the Science Commons Database Proto- tive research and those (US) components of the col (http://commons.org/projects/publishing/ downstream research carry the license. This is less open-access-data-protocol) to the data if scientists restrictive than the GPL’s Share Alike component choose to release their data to the public domain. or some of the Creative Commons licenses in that The goal is to encourage scientists to release all the RRS doesn’t require all comingled works to components of their research, through viral at- carry the same license. One goal of the RRS is to tribution to help ensure that researchers receive ensure research attribution in any derivative com- credit for their work and provide a mechanism pendium. The reason for not requiring the en- defining and promoting the idea of full compen- tire derivative compendium to carry the RRS, as dium release. would be required by the Share Alike component, As an umbrella licensing algorithm, the RRS is to encourage scientific research that builds on is easier to use than the alternative—each time previous research without restriction and to re- scientists release scholarship, they would have to main consistent with the scientific ethos of attri- fashion a combination of licenses from a spec- bution solely for work done. trum of choices or accept the default full copy- right status of their work. A corollary benefit to Data Under the RRS the RRS’s relaxation of the Share Alike com- Raw data aren’t copyrightable, and thus it’s ponent is that it becomes easier for industry to meaningless to apply a copyright rescinding li- employ research as part of a technology without cense to them. However, original selection and ar- having all the (possibly) proprietary work come rangement of the data are copyrightable, as are under the RRS. the original metadata associated with dataset production such as documentation, arrangement What Does this Mean explanations, or data cleaning.10,11 Thus, the for Scientific Researchers? RRS can rescind copyright from these aspects of Rescinding copyright requires scientists to take the data process. active steps in securing the appropriate license A license that applies to a database’s selection for their compendia, but they can easily do this and arrangement, in a virally attributive way, by placing a notification on their Web pages that can encourage scientists to release the datasets the compendia are under the RRS. They can also they’ve compiled by providing a legal framework add machine-readable tags to the HTML that for attribution. A license that would protect their hosts the compendium’s components to facilitate claim to authorship is an important tool to as- machine readability of attribution and other re- suage researchers’ concerns about loss of attri- search facets, simplifying search and compendia bution and provide for greater transparency in element attribution.12 dataset construction. Copyright doesn’t attach to data, so it doesn’t An adequate licensing structure doesn’t exist that make sense to attach a license rescinding copy- intentionally applies to the structures that house the right to the data themselves. However the data used. Although the raw facts aren’t copyright- RRS is a compilation of other licenses and can able, often researchers put a phenomenal amount of treat individual compendium elements sepa- work into dataset preparation for research. Precisely rately. This means, for example, that scientists how researchers generated or gathered the data, any wouldn’t have to release private medical data, processing they did to clean or verify them, and the but they can still apply the appropriate license data’s current layout are all vital pieces of informa- to release the other aspects of their work from tion for a scientist to reproduce or understand the copyright. Because the RRS is an amalgamation final result. The fact that the RRS emphasizes the of commonly used existing licenses, there’s no importance of transmitting dataset construction license proliferation with its introduction, and details dovetails neatly with Claerbout’s aspirations it’s as compatible with other licenses as its com- of really reproducible research.8 ponent licenses.

Ja n ua r y/Fe b r u a r y 2009 39 omputational research is at a turn- required the release of research compendia that ing point—will it embrace the sci- qualify under the RRS. This would satisfy the entifi c values of reproducibility and requirement for researchers to make their pub- verifi ability? In such a young fi eld, licly funded work available to the public as well as Can immediate opportunity exists to set standards establish important structures and standards for for quality and replicability in our work, but at the nature of computational research. the same time, we face the problem of working within a copyright structure that wasn’t designed acknowledgments with scientifi c research in mind. The RRS is, in I’m very grateful for invaluable discussion with David part, a tool to explain the meaning of reproduc- Donoho, Danny Hillis, Larry Lessig, , ible research in the computational sciences and, Wendy Seltzer, and Melanie Dulong de Rosnay. I as a result, help communicate scientifi c standards also thank for his comments at the Yale for acceptable practice. The RRS is also an im- Law School CyberScholar Series on 22 April 2008. Of portant tool for encouraging scientifi c research course, any errors are mine alone. by rescinding the aspects of copyright that pre- vent scientists from sharing important research references information. If the scientifi c community adopts 1. D. Donoho, “How to Be a Highly Cited Author in the Math- the license, we can solve the dual problems of ematical Sciences,” in-cites, Mar. 2002; www.in-cites.com/ scientists/DrDavidDonoho.html. standards for computational science and ensure 2. P. Vandewalle et al., “Experiences with Reproducible that the scientifi c ethos in communicating and Research in Various Facets of Signal Processing Research,” disseminating our work continues. A step for- Proc. IEEE Conf. Acoustics, Speech, and Signal Processing, vol. ward for the computational science fi eld would 4, 2007, pp. 1253–1256. be if grant-giving agencies, such as the NSF, 3. E. von Hippel, Democratizing Innovation, MIT Press, 2005, p. 85. 4. Division of Science Resources Statistics, “Survey of Research and Development Expenditures at Universities and Colleges, FY 2006,” US Nat’l Science Foundation, Sept. 2007; www. nsf.gov/statistics/infbrief/nsf07336. 5. “Grant General Conditions (GC-1) #38—Sharing of Find- ings, Data, and Other Research Products,” US Nat’l Science Foundation, 1 June 2007; www.nsf.gov/pubs/policydocs/ gc1_607.pdf. 6. J. Buckheit and D. Donoho, Wavelab and Reproducible Re- search, tech. report, Dept. of Statistics, Stanford Univ., 1995. 7. R. LeVeque, “Wave Propagation Software, Computational Science, and Reproducible Research,” Proc. Int’l Congress Mathematicians, M Sanz-Sole et al., eds., Aug. 22–30, 2006, p. 1227–1254 IEEE Annals of the History 8. V. Stodden, Licensing in Scientifi c Research, Stanford Law School, Nov. 2007. of Computing is an active 9. R. Gentleman and D. Lang, “Statistical Analyses and Repro- ducible Research,” J. Computational & Graphical Statistics, center for the collection vol. 16, no. 1, 2007 , p. 1–23. 10. Feist Publications, Inc., v. Rural Telephone Service Company, and dissemination of 499 U.S. 340, 1991, pp. 363–364. 11. M. Bitton, “A New Outlook on the Economic Dimension information on historical of the Database Protection Debate,” IDEA: The Intellectual Property Rev., vol. 47, June 2006, p. 93. projects and organizations, 12. H. Abelson et al., “ccREL: The Creative Commons Rights Expression Language,” Mar. 2008; http://wiki.creative oral history activities, and commons.org/images/d/d6/Ccrel-1.0.pdf. international conferences. victoria stodden is a research fellow at the Berkman Center for Internet and Society at Harvard Universi- ty. Her current research includes understanding how www.computer.org/ new technologies and standards affect societal decision-making and welfare. Stodden has annals a PhD in statistics from Stanford University and an MLS from Stanford Law School. Contact her at vcs@ stodden.net; http://blog.stodden.net.

40 Co m p u t i n g in SC i e n C e & en g i n e e r i n g