Noname manuscript No. (will be inserted by the editor)

Dataverse 4.0: Defining Data Publishing

Merc`eCrosas

Received: date / Accepted: date

Abstract The research community needs reliable, stan- landscape is tremendously different. In 1665, the small dard ways to make the data produced by scientific re- amounts of data accompanying a research claim could search available to the community, while getting credit be in most cases easily described and included in the as data authors. As a result, a new form of scholarly publication in the form of drawings or tables. Now, the publication is emerging: data publishing. Data pub- vast amounts of digital data associated with most re- lishing - or making data long-term accessible, reusable search claims cannot any longer be printed or part of and citable - is more involved than simply providing the paper publication, and are usually stored in a lap- a link to a data file or posting the data to the re- top, server or the cloud. To continue making scholarly searchers web site. In this paper, we define what is publishing inclusive of all research products, as well as needed for proper data publishing and describe how the transparent and verifiable, we need access to those data. open-source Dataverse software helps define, enable and This has led to a new field within scholarly publication, enhance data publishing for all. data publishing, which is necessary and possible due to the technology advances of the last decades. Keywords research data · data publishing · reposito- ries · · Dataverse Here we propose a set of features that help define and enable data publishing. The first three are the pil- 1 Introduction lars of data publishing: 1) information about the data (metadata) to find, understand and reuse them, 2) a Three hundred fifty years ago (at the time of writing formal data citation to reference and find the data set, this paper), when the world’s first scientific journal was and 3) a trusted data repository to host the data and published (the first volume of the Philosophical Trans- provide long-term access. The next three are required in actions of the Royal Society of London in 1665, [1]), some contexts: 4) support not only for public but also data associated with the scientific findings reported in restricted data, 5) support for data publishing work- an article were usually included in the printed publica- flows and versioning of data sets, and 6) support for tion. The Royal Society’s motto was “Nullius in Verba” Application Programming Interface (API) to deposit (or “Take Nobody’s Word for it”), a motto aligned with and get data and metadata for interoperability and ex- its mission of informing society of the latest scientific tensibility. discoveries with clarity and transparency, and aimed at reproducibility of all claims through facts from ex- These six features are described in this paper in the periments or observations ([2]). The mission of schol- context of the Dataverse software project, and in partic- arly publishing today has not changed much, but the ular, as part of the latest version of the software (v 4.0), which solidifies data publishing with rigorous, but yet M. Crosas Harvard University flexible, publication workflows and extends support for Institute for Quantitative Social Science research data across all domains. The features are gen- 1737 Cambridge Street, Cambridge eralizable and can be applied to data publishing with E-mail: [email protected] other repository frameworks. 2 Merc`eCrosas

2 The Dataverse Software as a Data Publishing To understand metadata in a Dataverse repository, Framework first it is important to understand the hierarchical orga- nization of dataverses, datasets and data files. A data- The Dataverse project started in 2006 ([3], [4]) at the verse is a container of datasets, or even of other data- Harvards Institute for Quantitative Social Science ([5]) verses. A dataset is a container of metadata and files as an open-source software application to share, cite - often data files, accompanied by documentation and and archive data. From its beginnings, Dataverse (then code files needed to interpret the data. Dataverse de- referred as the Dataverse Network) has provided a ro- fines three levels of metadata for a dataset: 1) general bust infrastructure for data stewards to host and archive descriptive metadata (which could be referred as cita- data, while offering researchers an easy way to share tion metadata); 2) domain-specific metadata; and 3) and get credit for their data. Since then, thousands file-level metadata. of researchers have taken advantage of Dataverse to The first level, general metadata, includes citation archive and promote their data, making them accessi- information and applies to all datasets, independent of ble to millions of users. There are currently more than the type of data or discipline. The second level may ten Dataverse repositories installations hosted in insti- include metadata for social science data (based on the tutions around the world. These repositories are using Data Documentation Initiative [7]), metadata for life the Dataverse software in a variety of ways, from sup- sciences data (based on the ISA-Tab data descriptor porting existing large data archives to building insti- [8]), and metadata for astronomy data (based on the tutional or public repositories for research data. One Virtual Observatory metadata schema [9]), among oth- of these Dataverse repositories is the Harvard Data- ers. Even though following standards for each domain verse, which is open to all researchers across all re- community is strongly recommended, when a research search fields to submit and publish their own data, domain does not support a metadata standard, Data- in a collaboration between the Harvard Library, Har- verse can be extended with a custom set of metadata vard University Information Technology and the In- fields for that domain. stitute for Quantitative Social Science. The Harvard Finally, file-level metadata are specific to each data Dataverse alone hosts more than 1000 dataverses (con- type. For example, for tabular data files, the file-level tainers of datasets) owned and managed by either indi- metadata are generated automatically by Dataverse from vidual scholars, research groups, organizations, depart- data file formats from statistical packages (such as SPSS, ments or journals [6], with a total of more than 58,000 STATA and R) which include rich metadata. These file- datasets. The Harvard Dataverse has so far served more level metadata include information about each column than a million data downloads, allowing researchers around (or variable) in the data file. This is important for sev- the world to reuse those data, re-analyze or merge them eral reasons: 1) it allows deep search on the data file, 2) to obtain new findings, or verify previous work. the column-level metadata enable other applications to While the Dataverse project started from the social understand the data file, and have sufficient informa- sciences for the social sciences, it has now expanded tion about the contents and characteristics of each col- to benefit a wide range of scientific disciplines and do- umn (or variable) to enable visualization, exploration, mains, leveraging our progress in the social science do- or analysis of the data, and 3) separating column-level main to define and enhance data publishing across all metadata from the data values allows to re-format the research communities. As part of the new Dataverse data files to other formats, independent of the software release (version 4.0), we have evaluated the features that they were originally generated. Another example is needed in data publishing so data can be properly shared, astronomy data files (in Flexible Image Transport Sys- found, accessed and reused. These features are described tem or FITS format), for which metadata about the ob- below. servatory, the object and wavelength and other informa- tion about the observation are obtained automatically from the header of the file. Dataverse currently does not 3 Towards Standardizing Data Publishing extract much metadata (except for basic information about the size and mime-type of the file) or re-format 3.1 The three Pillars of Data Publishing other data types. However, rich file-level metadata can be extended to other data types, based on the specifics 3.1.1 Metadata of the software or instrument that generates the data files. Dataverse 4.0 has made significant accommodations to Metadata are also essential for discoverability. That enable the usage of rich metadata. is why Dataverse indexes all its metadata. As we will Dataverse 4.0: Defining Data Publishing 3 see below, in some cases, data files must be restricted data citation. It took another twenty years for metadata and only accessible when granted permissions by the and bibliographic standards to start including support data owner. Even in those cases, the dataset should be for datasets, when, in 1979 MARC records were used for discoverable and their should be sufficient information data and the American Standard for Bibliographic Ref- to be able to decide whether to pursue data access. erencing included data type. However, it was not until Metadata are the descriptions of the data, but where the last decade that data citation started to have some do metadata end and data begin? Sometimes there is weight, driven by the increase in size and number of not a clear divide. What is clear though is that data data repositories in the last decade, and emergent stan- values (numbers and characters, for example) without dards for data citation ([10]). These initial efforts cul- metadata have little value. Access to columns and rows minated with the Joint Declaration of Data Citations of numbers without knowledge of what the columns Principles in 2014 ([14]), eight principles synthesized and rows refer to has little value; access to an image by more than twenty-five organizations that state the from an instrument without knowledge of what is be- importance of data citation and propose guidelines for ing observed has little value. That is, without context publishers and data repositories to implement data ci- describing the dataset - what it is about, how it was tation. As a result of these principles, the international collected, from what area in the world, or organism in standard NISO - Journal Article Tag System has now nature or place in the sky it came - the dataset is not included support for citing data ([15]). Thus, publish- interpretable. ers and journals can now properly categorize citations in the bibliographic list as data citations, and citation 3.1.2 Data Citation metrics systems such as crossRef can differentially iden- tify data citations. From its beginning, Dataverse has automatically gen- erated a data citation for each dataset deposited in 3.1.3 Trusted Data Repositories the repository. The citation format is based on Alt- man and King’s proposed standard ([10]), and in Data- To support data publishing, it is not sufficient that a verse 4.0 its implementation has evolved to be compli- persistent identifier is given to the data set. It is nec- ant with the Joint Declaration of Data Citations Princi- essary to have a responsible entity that ensures that ples ([11]). Compliance means that the citation created the persistent identifier will link (usually referred as re- by the Dataverse: solve) to a page in a repository from which the data can be accessed. The responsible entity behind the repos- – provides attribution to the data authors and providers; itory needs to ensure that the data will be accessible – includes a persistent identifier (such as a Digital Ob- (although not necessarily available for immediate ac- ject Identifier or DOI, register in DataCite [12]) that cess) or otherwise should provide information about links to a landing page in the repository where there the status and accessibility of the data. The responsible is information about the dataset and how the data entity behind the repository also needs to ensure that can be accessed; the metadata and data will still be readable over time, – specifies the version of the dataset referenced, and or allow to export them to another repository, when that version can be accessed in the repository even that entity cannot be anymore responsible of the repos- when a more recent version is available; itory. Certifications, such as Data Seal of Approval, fa- – persists in the repository even in the rare case that cilitate to ensure trustworthiness. The Harvard Data- the data themselves are not available anymore (for verse, as well as other Dataverse repositories such as example, the data files have to be deaccessioned for the ODUM Dataverse at the University of North Car- legal issues); olina, are trusted repositories either by being compliant – and finally includes metadata in commonly used in- to preservation certifications or providing well-defined terchangeable formats (XML, JSON) so that the in- preservation policies. formation can be easily accessed by other systems.

The concept of data citation is not a new one ([13]). 3.2 Additional Features for Enhanced Data Publishing Scholarly citation was formally defined in 1906 by the Chicago Manual Style to give credit and attribution to 3.2.1 Public versus Restricted Data the creator and distributor of a published work. Fifty years later the first digital data repositories (such as The Dataverse project and its community support and the World Data Center and ICPSR, refs) realized that encourage easy access to data, and when possible, datasets scholarly tradition did not serve the emerging needs for should be made public to the world. However, data can 4 Merc`eCrosas be complex and come with associated sensitive infor- Dataverse, it cannot be unpublished2. This is necessary mation or agreements that make it difficult to publish because a published dataset has a public data citation them under full open access conditions. For this rea- with a persistent identifier (a DOI, in this case) reg- son, Dataverse enables use of the data while maintain- istered and released to the world. At that point, the ing compliance with such constraints. At the most open dataset can be cited by any other published work, and end of the spectrum, datasets can be made fully pub- therefore the citation needs to be able link to a public lic, under the Creative commons CC0 data waiver, with page with information about the dataset. Unpublishing both metadata and files public. When desired by data the dataset would break that link. owners or required by prior agreements, datasets can Unlike traditional scholarly published articles, pub- be public, including public metadata and data files, but lished datasets are not necessarily static. Data often with terms that define the conditions under which the change with time, new versions being created, with pub- data can be used. In this case, a CC0 waiver cannot be lished data files becoming outdated, and new data files applied to the dataset. At the most restrictive end, the being made available. Dataverse supports versioning of data files can be private, and access may be granted datasets, so after a dataset is published, a new version to collaborators or those who apply for access to the can be published, and the old version gets archived. The data. In this case, even though the files are private, data citation for that dataset includes the version, and the dataset can still be published under a CC0 waiver an old version can still be accessed from the dataset because CC0 allows any form of access to the data, page. although requires no restrictions on how the data are reused. For all cases, even when data files are private, 3.2.3 APIs and Interoperability with Data Applications once a dataset is published in Dataverse, the metadata are public1. This is necessary so when the dataset is ref- Dataverse 4.0 introduces a rich Application Program- erenced using the data citation, sufficient information ming Interface (API) for search, accessing metadata is provided to determine that the the dataset exists and and data and depositing data. In addition, the new how it may be accessed. Dataverse native API allows to perform programati- At the time of writing this paper, Dataverse is in cally any function that can be done through the user the process of integrating with the DataTags system to interface. Dataverse currently integrates with other ap- facilitate sharing sensitive data with confidence ([16], plications through the data deposit API, for example [17]). DataTags defines a set of labels to standardize the Framework ([18]) and the Open Jour- handling, storing and access of sensitive data. Each nal System ([19]). In this case, authors using those ap- DataTag lives with its dataset as a policy or set of re- plications have the option to deposit and publish their quirements that the repository must enforce. It also in- data directly into a Dataverse repository, and a perma- cludes datatagging tools, legal modeling and data redac- nent link between their project or paper with the data tion tools. DataTags will make possible in the future to gets established. easily publish sensitive research datasets in a Dataverse The basic features for a comprehensive data pub- repository. lishing software application are covered above, but data publishing can be enhanced further with features that 3.2.2 Publishing Workflows and Versioning enable data exploration, analysis and visualization. Data- verse integrates with two applications thus far for that Finally, a comprehensive data publishing software plat- purpose: 1) TwoRavens ([20]), a statistical data explo- form needs to support workflows for data deposit, peer ration and analysis web tool that integrates with R and review, curation and eventually dissemination to the the Zelig statistical package ([21]) to provide descrip- world. Dataverse supports this entire workflow, allow- tive statistics and access to a wide range of statistical ing different set of users to contribute at different stages models to any tabular data file in Dataverse, and 2) in the process based on their role (from original data WorldMap, an application that allows to build and vi- author, to a reviewer, to an editor or archivist). Data- sualize maps from a geospatial data file ([22]). verse also supports a workflow that allows authors to Dataverse will continue integrating with other sys- self-curate and publish data on their own. That is, an tems, as it expands its API to deposit, find and access author can deposit the data to her own dataverse, re- data. view the draft dataset and, when ready, publish the dataset to the world. Once a dataset is published in 2 Once published, a dataset can only be retracted or deac- cessioned in rare cases when there might be a conflict or legal 1 File-level metadata might need to be kept private in some issues with the data, and the repository cannot continue host- cases if it contains sensitive information ing the data) Dataverse 4.0: Defining Data Publishing 5

4 Conclusions Christoph Steinbeck, Anne Trefethen, Bryn Williams- Jones, Katherine Wolstencroft, Ioannis Xenarion, and Win- Data publishing is a reality of ston Hide. (2012) Toward interoperable bioscience data. Na- ture Genetics 44, 121126. doi:10.1038/ng.1054 in the 21st century. The Dataverse open-source project 9. Hanisch, R. (2007) Resource Meta- in the last decade has enabled researchers, journals and data for the Virtual Observatory. institutions to share and archive their data, while keep- http://www.ivoa.net/documents/latest/RM.html ing control about who and how to share them with and 10. Altman, M, and G. King. (2007) . A proposed standard for the scholarly citation of quantitative data. D-lib Maga- getting credit through data citations. zine 13.3/4. With Dataverse 4.0, the Dataverse repository soft- 11. Starr J, Castro E, Crosas M, Dumontier M, Downs RR, ware defines data publishing by supporting rigorous Duerr R, Haak LL, Haendel M, Herman I, Hodson S, publishing workflows (for example, a dataset cannot Hourcl J, Kratz JE, Lin J, Nielsen LH, Nurnberger A, Proell S, Rauber A, Sacchi S, Smith A, Taylor M, Clark T. be unpublished once published, and data can be peer- (2015) Achieving human and machine accessibility of cited reviewed before publication), while providing flexibility data in scholarly publications. PeerJ Computer Science 1:e1 to support heterogeneous data across scientific domains https://dx.doi.org/10.7717/peerj-cs.1 with rich and extensible metadata. 12. DataCite, (2013). DataCite Metadata for the Publication and Citation of Research Data doi:10.5438/0008 13. Altman M, Crosas M. 2013. The evolution of data cita- Acknowledgements I thank Gary King, director of the In- tion: from principles to implementation. IAssist Quarterly stitute for Quantitative Social Science (IQSS) for the initial (Spring)62-70 vision of and continuous support for the Dataverse project, 14. Data Citation Synthesis Group, (2014). and the IQSS Data Science team who made Dataverse 4.0 pos- Joint Declaration of Data Citation Principles, sible (in alphabetical order): Leonid Andreev, Sonia Barbosa, http://www.force11.org/datacitation Michael Bar-Sinai, Eleni Castro, Christine Choirat, Kevin 15. National Information Standards Organization. 2012. Condon, Vito D’Orazio, Gustavo Durand, Phil Durbin, Michael ANSI/NISO Z39.96-2012 JATS: Journal Article Tag Suite. Heppler, James Honaker, Ellen Kraffmiller, Steve Kraffmiller, 978-1-937522-09-4 (HTML) 978-1-937522-10-0 (PDF) Dwayne Liburd, Raman Prasad, Elizabeth Quigley, Robert 16. Crosas M, Honaker J, King G, Sweeney L. 2015. Treacy. I also thank S¨unjeDallmeier-Tiessen for her many Automating Open Science for Big Data. ANNALS of useful comments and suggestions. the American Academy of Political and Social Science, 659(1):260-273 17. Sweeney L, Crosas M, and Bar-sinai M. Sharing Sensitive References Data with Confidence: the DataTags System. Forthcoming 18. Nosek, B. and Yoav Bar-Anan. 2012. Scientific Utopia: I. Opening Scientific Communication. Psychological Inquiry: 1. Phil. Trans. 1665 1 rstl.1665.0001; An International Journal for the Advancement of Psycho- doi:10.1098/rstl.1665.0001 logical Theory, Volume 23, Issue 3, 2012 2. Cavendish, R (2012) The Royal Society’s First Charter. 19. Castro E, Garnett A. Building a Bridge Between Jour- History Today Volume 62 Issue 7 July 2012 nal Articles and Research Data: The PKP-Dataverse Inte- 3. King, G. (2007). An Introduction to the Dataverse Net- gration Project. International Journal of Digital Curation. work as an Infrastructure for . Sociological 2014;9(1):176-184. Methods and Research 36 (2): 17399. 20. Honaker, James, and Vito DOrazio. 2014. Statistical 4. Crosas, M. (2011). The Dataverse Network: An Open- modeling by gesture: A graphical, browser-based statisti- Source Application for Sharing, Discovering and Preserving cal interface for data repositories. In Extended Proceedings Data. D-Lib Magazine 17 (12). of ACM Hypertext. New York, NY: ACM. 5. King, G. (2014). Restructuring the Social Sciences: Reflec- 21. Choirat, Christine, James Honaker, Kosuke Imai, Gary tions from Harvards Institute for Quantitative Social Sci- King, and Olivia Lau. 2015. Zelig: Everyones statistical soft- ence. PS: Political Science and Politics 47, no. 1: 165-172. ware. Available from http://zeligproject.org. 6. Crosas, M. (2013). A Data Sharing Story. Journal of 22. Center for Geographic Analysis, eScience Librarianship 1 (3):17379. http://worldmap.harvard.edu 7. Data Documentation Initiative. 2012. Data documentation initiative specification. http://www.ddialliance.org/Specification 8. Susanna-Assunta Sansone, Philippe Rocca-Serra, Dawn Field, Eamonn Maguire, Chris Taylor, Oliver Hofmann, Hong Fang, Steffen Neumann, Weida Tong, Linda Amaral- Zettler, Kimberly Begley, Tim Booth, Lydie Bougueleret, Gully Burns, Brad Chapman, Tim Clark, Lee-Ann Cole- man, Jay Copeland, Sudeshna Das, Antoine de Daru- var, Paula de Matos, Ian Dix, Scott Edmunds, Chris T Evelo, Mark J Forster, Pascale Gaudet, Jack Gilbert, Ca- role Goble, Julian L Griffin, Daniel Jacob, Jos Kleinjans, Lee Harland, Kenneth Haug, Henning Hermjakob, Shan- nan J Ho Sui, Alain Laederach, Shaoguang Liang, Stephen Marshall, Annette McGrath, Emily Merrill, Dorothy Reilly, Magali Roux, Caroline E Shamu, Catherine A Shang,