Wikidata for Digital Preservation
Total Page:16
File Type:pdf, Size:1020Kb
Wikidata for Digital Preservation Katherine Thornton Kenneth Seals-Nutt Yale University Yale University [email protected] Euan Cochrane Carl Wilson Yale University Open Preservation Foundation ABSTRACT CCS CONCEPTS The Wikidata knowledge base provides a public infrastructure for • Human-centered computing → Wikis; • Information sys- machine-readable metadata about computing resources. To opti- tems → Digital libraries and archives; mize the Wikidata knowledge base for digital preservation pro- fessionals, we need to create additional structured descriptions KEYWORDS about file formats and software titles. To facilitate this process, we Wikidata, digital preservation, metadata introduce the Wikidata for Digital Preservation portal. This free ACM Reference Format: software portal allows people to browse, and to contribute data to, Katherine Thornton, Kenneth Seals-Nutt, Euan Cochrane, and Carl Wilson. the Wikidata knowledge base. The interface guides users in con- 2018. Wikidata for Digital Preservation. In Proceedings of iPres Conference tributing data in alignment with current data models for the domain (iPres’18). ACM, New York, NY, USA, 10 pages. https://doi.org/ of computing, which are collaboratively created by members of the Wikidata community. 1 INTRODUCTION Structured data about file formats, the many versions of soft- Wikidata is becoming a centralized repository for machine-readable, ware titles, and computing environments, are already available in structured data about resources in the domain of computing. Wiki- Wikidata. The content of Wikidata is licensed under the Creative data allows us to store cross-domain data, and provides a powerful Commons Zero (CC0) license, meaning that anyone can reuse the query interface. data for any purpose. The content in Wikidata is available in more than 350 human languages. The data in Wikidata is FAIR data, and it is linked open data. Our portal provides a streamlined interface designed for the needs of the digital preservation community. When using the Wikidata for Digital Preservation portal, users will be enriching a multilingual repository of data that is open for reuse across institutional boundaries. The vision we share of using Wiki- data as a registry of technical metadata for the domain of computing is a vision of semantic integration of data from multiple sources [6]. We publish the source code for the Wikidata for Digital Preser- vation portal under the Gnu General Public License, v31 to ensure that this tool will remain in the public domain. We aligned our license decision with those of other tools in the Wikidata ecosys- tem, and the enabling software infrastructure of Wikidata itself, to ensure the sustainability of this infrastructure over time. Active developer communities maintain and provision this public infras- tructure, and cultural heritage organizations are able to freely reuse the descriptive and technical metadata for software, file formats, Figure 1: A screenshot of the Wikidata item for Matroska. and all other resources in the domain of digital preservation in hundreds of human languages. Metadata about software, file formats and computing environ- ments is necessary for the identification and management of these 1https://github.com/WikiDP/portal-proto/blob/master/LICENSE.GPL entities. Creating machine-readable metadata about resources in the domain of computing allows digital preservation practitioners to then automate programmatic interactions with these entities. Permission to make digital or hard copies of part or all of this work for personal or People working in digital preservation have a shared need for accu- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation rate, reusable, technical and descriptive metadata about the domain on the first page. Copyrights for third-party components of this work must be honored. of computing. For all other uses, contact the owner/author(s). The data about resources in the domain of computing within iPres’18, September, Cambridge, MA USA © 2018 Copyright held by the owner/author(s). Wikidata fulfills the functions of a metadata registry [33]. A tech- nical registry in the domain of digital preservation is a data store iPres’18, September, Cambridge, MA USA Thornton et al. of descriptions of file formats, software used to create or interact resources in a flexible way. SPARQL queries enable us to search for with files, configured environments, operating systems and sus- file formats by media type, to search for software titles by their read- tainability factors. The Wikidata community has added structural, able file formats, or to search for software titles published within descriptive and technical metadata about these resources to the specific windows of time. We can use SPARQL queries to search for knowledge base. The combination of these different types of meta- software titles by genre, to search for technical specifications that data in a single repository enables us to query the data in more describe a particular file format, or to search for a digitized copy flexible ways. In Figure 1, there is a screenshot of a Wikidata item of a user guide for a piece of legacy software. Because Wikidata is and the properties used to describe the item. The metadata for file a cross-domain knowledge base, the range of data combinations formats in Wikidata includes information about who developed allow users to query data that span technical metadata as well as the format, how it is related to other formats, and the file format descriptive metadata aspects of these resources. Wikidata is the first signature. We contribute data about the domain of computing to Wikidata because it is multilingual, and anyone can contribute to the project. Many researchers and practitioners in the field of digital preserva- tion have identified the fragmented nature of technical registries [13]. When registry data is stored in multiple systems, communi- cation also takes place in distributed settings. In response to the fragmented landscape, several groups have presented plans for how to centralize and unify technical registry information [14, 27, 28]. Both GDFR and UDFR have concluded work. The sustainability of a technical registry is important set of factors to consider. The creation of a sustainability plan for the infrastructure of the registry is a crucial component to consider. In this paper, we explore an alternative to creating infrastructure for a repository of technical registry data through use of existing, independently supported, infrastructure provided by the knowledge base of structured data, Figure 2: Wikidata item and data organization. Wikidata. information system with a graphical user interface that supports 2 WHY WIKIDATA? public read and write access to the semantic web [4, 9]. This means Wikidata went live in late 2012 [30]. The infrastructure of Wikidata that anyone with information to contribute about file formats, soft- is collaboratively built via commons-based peer production [2, 3, ware, etc. can add that information directly to Wikidata, and it will 16]. Commons-based peer production is the name given to open be available as linked open data for others to use immediately after collaboration systems where users are creating content under the clicking the publish button. This also means that rather than in- agreement that all content will remain in the public domain. This dependently curating this information within our own institutions, means that all of the work products of the community are free to be we can collaboratively curate these resources, which reduces the reused by others. The peer production aspect refers to how users metadata workload for all. coordinate work themselves. Wikidata is edited by volunteers from In our previous work, we described the data modeling process all over the world in more than 350 languages [7]. that we participate in to collaboratively create data models for the In addition to a free software infrastructure, the Wikidata com- domain of computing [29]. The Wikidata for Digital Preservation munity also publishes all content in the knowledge base under a portal mirrors these data models, and we update the portal on a Creative Commons Zero License. The Wikidata community makes monthly basis as new properties get added to Wikidata, or changes dumps of previous versions of the content of the knowledge base are made to the constraint system. available. The infrastructure of the Wikidata knowledge base is maintained by an international community of people. For cultural 3 DATA MODEL heritage institutions who find the structured data in Wikidata use- The Wikidata data model is minimal, consisting of items and proper- ful for work flows, this mans that there will be much less staff ties. Properties are used to express statements about items. In Figure time necessary to design, build and maintain infrastructure for this 2, we see a screenshot of the Wikidata page for entity Q1676669 data. For cultural heritage institutions will limited digital preserva- JPEG File Interchange Format, version 1.02. Each item is tion budgets, this means that they can now access descriptive and allotted a page in Wikidata and has a unique identifier, with prefix technical metadata for tens of thousands software titles and more Q plus a string of numbers, ex. Q1676669, which is assigned to this than two thousand file formats without having to create manage item. The two properties instance of, and is part of, are used or maintain that data locally. to assert statements about the item. A claim and its references are The Wikidata community maintains a public SPARQL endpoint. considered to be a statement. A description of the Wikidata data As of October, 2017 the endpoint was consistently handling 8.5 model can be found in [31]. Sets of properties are not pre-defined for million SPARQL queries per day [20]. Writing SPARQL queries classes in Wikidata. Editors may freely select properties to express for the Wikidata endpoint allows users to search for data about statements on Wikidata items.