for Digital Preservation

Katherine Thornton Kenneth Seals-Nutt Yale University Yale University [email protected] Euan Cochrane Carl Wilson Yale University Open Preservation Foundation ABSTRACT CCS CONCEPTS The Wikidata knowledge base provides a public infrastructure for • Human-centered computing → Wikis; • Information sys- machine-readable metadata about computing resources. To opti- tems → Digital libraries and archives; mize the Wikidata knowledge base for digital preservation pro- fessionals, we need to create additional structured descriptions KEYWORDS about file formats and software titles. To facilitate this process, we Wikidata, digital preservation, metadata introduce the Wikidata for Digital Preservation portal. This free ACM Reference Format: software portal allows people to browse, and to contribute data to, Katherine Thornton, Kenneth Seals-Nutt, Euan Cochrane, and Carl Wilson. the Wikidata knowledge base. The interface guides users in con- 2018. Wikidata for Digital Preservation. In Proceedings of iPres Conference tributing data in alignment with current data models for the domain (iPres’18). ACM, New York, NY, USA, 10 pages. https://doi.org/ of computing, which are collaboratively created by members of the Wikidata community. 1 INTRODUCTION Structured data about file formats, the many versions of soft- Wikidata is becoming a centralized repository for machine-readable, ware titles, and computing environments, are already available in structured data about resources in the domain of computing. Wiki- Wikidata. The content of Wikidata is licensed under the Creative data allows us to store cross-domain data, and provides a powerful Commons Zero (CC0) license, meaning that anyone can reuse the query interface. data for any purpose. The content in Wikidata is available in more than 350 human languages. The data in Wikidata is FAIR data, and it is linked open data. Our portal provides a streamlined interface designed for the needs of the digital preservation community. When using the Wikidata for Digital Preservation portal, users will be enriching a multilingual repository of data that is open for reuse across institutional boundaries. The vision we share of using Wiki- data as a registry of technical metadata for the domain of computing is a vision of semantic integration of data from multiple sources [6]. We publish the source code for the Wikidata for Digital Preser- vation portal under the Gnu General Public License, v31 to ensure that this tool will remain in the public domain. We aligned our license decision with those of other tools in the Wikidata ecosys- tem, and the enabling software infrastructure of Wikidata itself, to ensure the sustainability of this infrastructure over time. Active developer communities maintain and provision this public infras- tructure, and cultural heritage organizations are able to freely reuse the descriptive and technical metadata for software, file formats, Figure 1: A screenshot of the Wikidata item for Matroska. and all other resources in the domain of digital preservation in hundreds of human languages. Metadata about software, file formats and computing environ- ments is necessary for the identification and management of these

1https://github.com/WikiDP/portal-proto/blob/master/LICENSE.GPL entities. Creating machine-readable metadata about resources in the domain of computing allows digital preservation practitioners to then automate programmatic interactions with these entities. Permission to make digital or hard copies of part or all of this work for personal or People working in digital preservation have a shared need for accu- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation rate, reusable, technical and descriptive metadata about the domain on the first page. Copyrights for third-party components of this work must be honored. of computing. For all other uses, contact the owner/author(s). The data about resources in the domain of computing within iPres’18, September, Cambridge, MA USA © 2018 Copyright held by the owner/author(s). Wikidata fulfills the functions of a metadata registry [33]. A tech- nical registry in the domain of digital preservation is a data store iPres’18, September, Cambridge, MA USA Thornton et al. of descriptions of file formats, software used to create or interact resources in a flexible way. SPARQL queries enable us to search for with files, configured environments, operating systems and sus- file formats by media type, to search for software titles by their read- tainability factors. The Wikidata community has added structural, able file formats, or to search for software titles published within descriptive and technical metadata about these resources to the specific windows of time. We can use SPARQL queries to search for knowledge base. The combination of these different types of meta- software titles by genre, to search for technical specifications that data in a single repository enables us to query the data in more describe a particular file format, or to search for a digitized copy flexible ways. In Figure 1, there is a screenshot of a Wikidata item of a user guide for a piece of legacy software. Because Wikidata is and the properties used to describe the item. The metadata for file a cross-domain knowledge base, the range of data combinations formats in Wikidata includes information about who developed allow users to query data that span technical metadata as well as the format, how it is related to other formats, and the file format descriptive metadata aspects of these resources. Wikidata is the first signature. We contribute data about the domain of computing to Wikidata because it is multilingual, and anyone can contribute to the project. Many researchers and practitioners in the field of digital preserva- tion have identified the fragmented nature of technical registries [13]. When registry data is stored in multiple systems, communi- cation also takes place in distributed settings. In response to the fragmented landscape, several groups have presented plans for how to centralize and unify technical registry information [14, 27, 28]. Both GDFR and UDFR have concluded work. The sustainability of a technical registry is important set of factors to consider. The creation of a sustainability plan for the infrastructure of the registry is a crucial component to consider. In this paper, we explore an alternative to creating infrastructure for a repository of technical registry data through use of existing, independently supported, infrastructure provided by the knowledge base of structured data, Figure 2: Wikidata item and data organization. Wikidata.

information system with a graphical user interface that supports 2 WHY WIKIDATA? public read and write access to the semantic web [4, 9]. This means Wikidata went live in late 2012 [30]. The infrastructure of Wikidata that anyone with information to contribute about file formats, soft- is collaboratively built via commons-based peer production [2, 3, ware, etc. can add that information directly to Wikidata, and it will 16]. Commons-based peer production is the name given to open be available as linked open data for others to use immediately after collaboration systems where users are creating content under the clicking the publish button. This also means that rather than in- agreement that all content will remain in the public domain. This dependently curating this information within our own institutions, means that all of the work products of the community are free to be we can collaboratively curate these resources, which reduces the reused by others. The peer production aspect refers to how users metadata workload for all. coordinate work themselves. Wikidata is edited by volunteers from In our previous work, we described the data modeling process all over the world in more than 350 languages [7]. that we participate in to collaboratively create data models for the In addition to a free software infrastructure, the Wikidata com- domain of computing [29]. The Wikidata for Digital Preservation munity also publishes all content in the knowledge base under a portal mirrors these data models, and we update the portal on a Creative Commons Zero License. The Wikidata community makes monthly basis as new properties get added to Wikidata, or changes dumps of previous versions of the content of the knowledge base are made to the constraint system. available. The infrastructure of the Wikidata knowledge base is maintained by an international community of people. For cultural 3 DATA MODEL heritage institutions who find the structured data in Wikidata use- The Wikidata data model is minimal, consisting of items and proper- ful for work flows, this mans that there will be much less staff ties. Properties are used to express statements about items. In Figure time necessary to design, build and maintain infrastructure for this 2, we see a screenshot of the Wikidata page for entity Q1676669 data. For cultural heritage institutions will limited digital preserva- JPEG File Interchange Format, version 1.02. Each item is tion budgets, this means that they can now access descriptive and allotted a page in Wikidata and has a unique identifier, with prefix technical metadata for tens of thousands software titles and more Q plus a string of numbers, ex. Q1676669, which is assigned to this than two thousand file formats without having to create manage item. The two properties instance of, and is part of, are used or maintain that data locally. to assert statements about the item. A claim and its references are The Wikidata community maintains a public SPARQL endpoint. considered to be a statement. A description of the Wikidata data As of October, 2017 the endpoint was consistently handling 8.5 model can be found in [31]. Sets of properties are not pre-defined for million SPARQL queries per day [20]. Writing SPARQL queries classes in Wikidata. Editors may freely select properties to express for the Wikidata endpoint allows users to search for data about statements on Wikidata items. Wikidata for Digital Preservation iPres’18, September, Cambridge, MA USA

3.1 Items 4.1 Technology Stack The Wikidata for Digital Preservation portal supports the creation First, we make the initial request using WikidataIntegrator (WDI), of items. When a user performs a search in the portal system, they a python module developed by the Su Lab, whose item search en- are presented with an overview of results which they can quickly gine model creates a great basis for searching across any domain2 scan to determine if the item they need already exists in Wikidata. This module has built-in capabilities of returning JSON and Python If a user does not find the item, they can then create a new item dictionaries inside the object at the same time, making WikidataIn- which will be published to Wikidata. tegrator our central source for handling the retrieval, processing, and manipulation of data from Wikidata. We further extend WDI’s searching capabilities by also using the Python module PyWikiBot3 3.2 Properties to retrieve results that WDI might overlook, as the acronym-heavy nature of searching in this domain proves to be quite a challenge for The Wikidata for Digital Preservation portal does not support the tools designed to search any topic. Because the API for Wikidata has creation of properties. This is due to the property proposal process, been linked to PyWikBot, we decided to use Python-Flask4 as the which requires community deliberation within the Wikidata system. web application framework to allow users to interact directly with Any member of the Wikidata community can propose a property for the Wikidata API in a controlled environment. PyWikBot provides creation. The property proposal process involves using a template excellent sandbox functionality that proved to be critical in our to describe the desired property, and then posting it to a specific development from debugging to staging releases. Our approach, wiki page in the Wikidata project where other members of the blending aspects of WDI and PyWikiBot allows us to combine community can read it, provide feedback and commentary, or ask the speed and ease of use of WDI with the extensive processing questions. This coordination and discussion phase is designed to PyWikiBot provides. build agreement among community members as to the meaning and extent of property definitions. As new properties are created 4.2 Enhanced Search Functionality that have relevance for the domain of computing, we add them to the property checklists presented in the Wikidata for Digital We incorporate a second search approach that makes direct calls to Preservation portal. the Wikidata API to provide additional functionality in the search- bar of the portal. For example, in the Wikidata interface, users are currently not able to use Wikidata’s native search to lookup an item by its PRONOM Unique Identifier. Instead, if a user wants tore- 3.3 Portals on top of Wikidata trieve items using PUIDs, they must write their own SPARQL query An established strategy to address consistent data modeling for a and run that query on the Wikidata Query Service5. In order to given domain is the creation of a portal on top of Wikidata, such implement this functionality, we validate the user input alongside as that of the WikiGenomes project [21, 22]. The WikiGenomes a python regular expression for PUIDs that shows our searching portal provides the relevant set of Wikidata entities so that basic re- script that the user is specifically requesting items by this external searchers who would like to engage in data curation can contribute identifier. Once it passes validation, we leverage a SPARQL query via the portal interface. Creating tools that reduce the complexity that returns all items mapped to this PUID value. This query is of knowledge representation, while supporting the work of domain converted into a python class, for which we designed methods al- experts will increase the number of people who can contribute to lowing us to execute future SPARQL queries and process the results data curation for the semantic web [5]. The strategy of creating in the event that we extend retrieval based other common search a domain-specific portal is one approach to increasing the consis- parameters such as MIME-TYPE. Not only does the portal lower tency of how data is being structured, as the portal system design the threshold for users getting familiar with Wikidata, the portal can guide users to contribute data according to the current set of has built-in queries familiar to users of other digital preservation best practices as understood by the developers of the portal. En- databases. couraging the use of the data models that are most frequently used In order to aggregate the many search functions, we return a list in Wikidata will improve data consistency. of up to 10 items, ranked based on commonality across all individual function output and those most similar to the rest of the result set. Not only is is possible for users to click on a search result to be 4 PORTAL ARCHITECTURE redirected to a page showing every statements currently available in Wikidata for that item, but the entire result set stays cached with The Wikidata for Digital Preservation Portal facilitates search across the user’s experience in case there are multiple items with similar the current content in Wikidata. The user experience begins with a labels that the user can contribute to or the user wants to preview simple search field directing the user to search for domain-specific another, similarly labeled item. content with minimal other options on the page, a design choice that we find to be the most consistently digestible introduction into 2Wikidata Integrator is published under the MIT license and is available at https: our application. This initial searching action goes beyond the native //github.com/SuLab/WikidataIntegrator. 3 search functionality built into the API. Instead, we aggregate these The library is published under the MIT license and is available at https://github.com/ wikimedia/pywikibot. result alongside the following methods to ensure a more robust 4http://flask.pocoo.org/ engine for selecting items specific to digital preservationists. 5The Wikidata Query Service is the public SPARQL endpoint for Wikidata. It is avail- able athttps://query.wikidata.org/. iPres’18, September, Cambridge, MA USA Thornton et al.

Figure 3: A screenshot of the Wikidata for Digital Preservation search results page.

4.3 Streamlined Property Selection to find a balance of guiding the user through this step while also still Upon previewing the item and reviewing the set of current state- requiring the data type rules to be integrated with our application. ments, the user can choose to contribute to the item with statements To achieve this goal, we color-code the statements based on the of their own. In the statement construction view of the portal, the data type for the property, and use client-side JavaScript to adapt user has access to several features that make the process more in- the input field whenever a new property is selected. As different tuitive. To begin, we created a property checklist on the sidebar properties require specific data types, the input field handles the that is dynamically filled based on to the selected item’s super-set. validation in each case differently. The most common value type Here, the portal checks if the item is described using either a P279 is “Wikidata item", which means that users often need to enter the Subclass of or P31 Instance of statement which indicates the correct text string for the label used to describe that item or the item’s class membership. For example, the class file format is a correct QID for the item. As memorizing QIDs is not realistic for the superclass for JPEG. Depending on this classification, our checklist majority of users, we created a quick look-up feature. We support presents the user with a predetermined list of appropriate proper- name-to-identifier conversion without leaving the active window. ties. Reducing the 4,000 different properties available in Wikidata to We again point back to our extensive searching capabilities during the 20 most relevant, makes the statement construction approach- this view as well where there is an Item Lookup panel that sends able for a first-time contributor. Furthermore, one area of friction AJAX calls allowing users to make a quick search showing quick for a new user is determining missing properties. While it makes reference descriptions and the item number for efficient access. The sense to contribute all of the information about an item that you user then builds a list of their contributed statements and can write know, in practice it is not feasible to expect users to recall what them directly to the item of their choice upon submission. all they know without some guidance explaining which properties we would like to see used in statements for the item. Thus, in our 4.4 SPARQL queries checklist we also included a count of the number of statements Users of the Wikidata Query Service SPARQL endpoint can request currently made about the item that use the given property and subsets of the data contained in Wikidata that match desired pat- highlight all properties with no instances. At a glance, the user terns. Users can design powerful queries such as the examples in now can see what statements are available as well as a list of what Figure 4 below. statements have yet to be contributed. We reduce the complexity of determining the data type formats for statement values. Currently, each Wikidata property has a data 5 WIKIDATA INTEGRATION WITH type. There are 13 data types in Wikidata, which can require some EXTERNAL REPOSITORIES exploration to fully understand6. For example, some properties Connecting Wikidata items to resources in external databases or require another Wikidata item as their value, while others require systems allows for software agents to discover related content au- a number, URL, string, etc.In Figure 2, the two statements shown tomatically. This allows us to benefit from complementary work have the same datatype, item, because the values need to be the and provides infrastructure for connecting information that was QIDs of other Wikidata items. In order to account for this, we strive previously fragmented across multiple systems. In the domain of software preservation, connecting the work of the Internet Archive 6The full list of data types is available at https://www.wikidata.org/wiki/Special: ListDatatypes. to Wikidata helps more people discover information about legacy Wikidata for Digital Preservation iPres’18, September, Cambridge, MA USA

Figure 4a: A screenshot of a SPARQL query to request software titles that can read .dxf files. Try this query! Code for this query.

Figure 4b: A screenshot of a SPARQL query to request file formats used for 3D data. Try this query! Code for thisquery.

Figure 4c: A screenshot of a SPARQL query to request sequence alignment software with date of publication, programming language and license. Try this query! Code for this query.

Figure 4d: A screenshot of a SPARQL query to request the a list of file formats described by ISO standards. Try this query! Code for this query.

Figure 4e: A screenshot of a SPARQL query to request the file format signature of the .stl file format. Try this query! Codefor this query. iPres’18, September, Cambridge, MA USA Thornton et al. software. The Old School Emulation Center (TOSEC)7 has created Label Wikidata Property detailed descriptions of many legacy computing platforms and pro- Arch package P3454 vides disk images for a wide variety of software titles. TOSEC also Debian package P3442 provides in-browser emulation of a large number of software titles. Fedora package P3463 Through the use of P724 ‘‘Internet Archive ID’’ the pages Gentoo package P3499 dedicated to these resources can be connected to their correspond- Ubuntu package P3473 ing items in Wikidata. Items in Wikidata are connected to external databases or collec- Table 1: A table of properties that indicate the name of a soft- tions through the use of properties that have the data type “external ware package in an external package manager. id”. More than half of all properties in Wikidata are external id prop- erties. Connecting Wikidata items to other resources in this way is a powerful feature allowing us to fulfill the promise of linked open data [10]. By following external id links, users can discover more information about the item of interest. We prioritize connecting to multiple external projects in our curation activities. For software items we add P4107 ‘‘Framalibre ID’’ to con- nect to identifiers from the Framalibre software descriptions cre- ated by Framasoft8. We add P2537 ‘‘Free Software Directory entry ID’’ to connect to identifiers from the Free Software Foun- dation’s wiki describing free software resources9. We add P2749 ‘‘PRONOM software ID" for software titles that have been de- 10 Figure 5: A screenshot of a subset of file formats with PUID, scribed in PRONOM . We also add identifiers for software titles in LOC FDD, and Just Solve different Gnu/Linux package management systems as seen in Table 1. For file formats there are three external repositories that we properties to identify concepts in a growing number of controlled prioritize. We have added P2748 ‘‘PRONOM ID’’ to more than vocabularies and ontologies, which facilitates data integration from 11 1,100 file format items described in the PRONOM repository . disparate sources [4, 9]. More than 2,600 file format items have been connected to Just Solve Wikidata is a multilingual knowledge base, leveraging the con- the File Format Problem through the use of P3381 ‘‘File format cept mappings created through years of conceptual alignment wiki page ID’’. We have used P3266 ‘‘Library of Congress among the different language versions of Wikipedia4 [ ]. This means Format Description Document ID’’ to link to the descriptions that more users will have access to metadata related to the domain of file formats for preservation created by the Library of Congress of computing in their language, an important step in reducing the 12 of the United States as seen in Figure 5. dominance of the English language which disadvantages other As more external identifiers are published to the Wikidata knowl- linguistic communities. edge base it grows in prominence as a cross-switch for identifiers The data contributed to Wikidata is compliant with the FAIR and vocabularies [33]. The Confederation of Open Access Reposi- data principles [32]. By creating data that aligns with the FAIR tories (COAR) emphasized the importance of integrating persistent data principles, we ensure that this metadata is easy to find and identifiers in their 2015 Roadmap18 [ ]. Wikidata is becoming a hub easy to reuse. This technical metadata describing the software and of persistent identifiers17 [ ].As users contribute additional data to file formats all digital preservation professionals must be ableto Wikidata it will become even more valuable. identify and refer to, will be more complete if we distribute our effort. Redundant, fragmented descriptions in siloed repositories 6 FAIR DATA are frustratingly incomplete. Many governmental bodies and in- Long-term preservation and governance of metadata for the domain ternational consortia have endorsed the FAIR data principles as of computing is an important issue for the digital preservation a key aspect of their open science or open data initiatives [15]. community [13, 14]. Centralization of technical metadata for the The data contributed to Wikidata is linked open data 14. Experts domain of computing benefits all creators and users of this metadata. from libraries, archives, museums and technologists of the World The fact that the data in Wikidata’s knowledge base is licensed Wide Web Consortium (W3C) recommend linked data for library under a Creative Commons 0 (CC0) license13 means that anyone metadata published on the web[1]. can reuse this data for any purpose. This means that this data will FAIR is an acronym for findable, accessible, interoperable and be appealing for data consumers. Wikidata contains specialized reusable. Metadata for the domain of computing that we contribute to Wikidata are findable in that Wikidata items are indexed by all 7https://www.tosecdev.org/ large search engines. The Qids assigned to Wikidata items are their 8https://framalibre.org/ unique, persistent identifiers. 9 https://directory.fsf.org/wiki/Main_Page These metadata are accessible because the entity data associ- 10https://www.nationalarchives.gov.uk/PRONOM/Default.aspx 11https://www.nationalarchives.gov.uk/PRONOM/Default.aspx ated with their unique ids (all statements and references asserted 12http://www.digitalpreservation.gov/formats/intro/intro.shtml 13https://creativecommons.org/ 14https://www.w3.org/DesignIssues/LinkedData.html Wikidata for Digital Preservation iPres’18, September, Cambridge, MA USA about an item) are dereferencable via the HTTP protocol. They We conclude that reusing the infrastructure of the Wikidata are interoperable in that they link to many other databases and knowledge base, which has been assigned to the public domain, systems through the collection of external ids as seen in Figure 6. is a compelling model for cultural heritage institutions looking These metadata are reusable due to the use of the CCO license for a centralized repository for metadata. Commons-based peer for the content of Wikidata. Anyone can reuse Wikidata data for any production of infrastructure allows for distributed stakeholders to purpose. Publishing data in the Wikidata knowledge base fulfills collaboratively maintain the infrastructure [3]. The wiki software the most complete degree of FAIRness, level F, “FAIR data, Open is developed by engineers who have agreed to make their work Access, Functionally Linked", as described in [15]. available to all by releasing it under free software licenses. The Wikidata for Digital Preservation Portal provides direct links The Wikidata community maintains a public SPARQL endpoint to items in Wikidata. If a user would like to consult Wikidata to for the knowledge base. This SPARQL endpoint allows users to view additional information related to vocabularies that have been write flexible, powerful queries to retrieve subsets of the data inthe stored, they may consult the item of interest by following the links knowledge base. In contrast, maintaining a public SPARQL endpoint provided in the Portal. is not often feasible for a software or format registry developed within an institution or project context. The Wikidata community 7 DISCUSSION: A CENTRALIZED has enhanced the SPARQL endpoint by providing multiple visualiza- REPOSITORY OF FAIR, LINKED OPEN DATA tion options for the data returned in queries, from bubble charts to graphs of many varieties. The developers who work on the SPARQL Wikidata is growing. We have been participating in Wikidata struc- endpoint have also created an interface that supports users who do turing data in the domain of computing since August, 2016. In the not yet know SPARQL in writing or modifying SPARQL queries17. years of our participation we have seen growth in Wikidata as a whole, and improved data coverage for computing topics. Item 7.2 Collaboration counts as of April 1, 2018 for several classes relevant to digital preservation are presented in Figure 7. The structure of Wikidata allows the crowd to collaborate. A bound- ary object is a tool for thinking that allows people from different communities of practice to use a shared form to bridge the differ- ences in their experiences and effectively collaborate25 [ ]. When multiple boundary objects are used in conjunction they can be- come parts of systems of boundary objects [24]. Star and Bowker introduced the concept of “boundary infrastructure" to theorize about systems of boundary objects. Boundary infrastructure allows for collaboration without consensus [24]. Wikidata, the knowledge base of structured data that anyone can edit, is an example of bound- ary infrastructure that allows people from many communities of practice, from many walks of life, specialists and non-specialists across many domains, to effectively collaborate to structure data, and make it available for reuse. Figure 7: A table of classes in the domain of digital preser- vation with the number of Wikidata items currently in that 7.3 Sustainabilty class. Digital Preservation is a expensive activity for many institutions. Institutions with limited budgets for digital preservation can reuse this data at no cost. The boundary infrastructure of Wikidata pro- Wikidata is a project of Wikimedia Deutschland15 and has been vides a means for digital preservation professionals from different supported by the chapter budget, grant awards and donations. In parts of the world, working in different languages, to collaborate 2016 the announced that it would begin by creating structured data in the knowledge base. This reduces funding the software engineering activity for Wikidata 16. This is a the risk of redundant effort to describe the same file format in strong signal that the infrastructure of Wikidata will continue to numerous local format registries. The boundary infrastructure of be supported in the future. the knowledge base also supports contributions from the crowd, people who have interest in, and information about, the domain 7.1 Infrastructure and Maintenance of computing. This allows for collaborations that otherwise might The Wikidata for Digital Preservation portal can support the work not happen without the boundary infrastructure that facilitates of many users, and only one local developer [8] with knowledge of communication in a community of practice. Wikidata is required. Local developer inspects the data models and Members of the general public will also have access to this in- the infrastructure of Wikidata in order to make recommendations formation. Having this information in an accessible, structured about the user interface and the interaction design of the portal, repository will allow more people to consult it, which could lead effectively articulating work for many users. to people making different computing choices in their lives, for example choosing an open format, which could impact the work of 15https://wikimedia.de/wiki/Hauptseite 16https://blog.wikimedia.org/2016/10/04/supporting-the-future-of-wikidata/ 17https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service iPres’18, September, Cambridge, MA USA Thornton et al.

Figure 6: A screenshot of the Wikidata Item for the Sterolithography file format showing the collection of links to external resources that also describe the format. future generations of digital preservation professionals. Wikidata’s the fact that our portal is a layer of infrastructure on top of Wikidata, CC0 license ensures that this data will have an equalizing force, as there is a risk that users of the portal will not be learning about the it will not be controlled by any single institution, or even any con- infrastructure of Wikidata itself, and thus may not become members sortium of institutions. Anyone with access to the internet will be of the Wikidata community. We view the Wikidata for Digital able to inspect and reuse this data for their own projects or systems. Preservation Portal as part of the larger Wikimedia ecosystem, and Institutions that do not yet have budgets for digital preservation we hope that interest in the infrastructure of this portal will still will have access to this metadata and will not have to recreate it in lead to interest in the infrastructure of Wikidata itself. We would their local systems. also be delighted to see this portal become unnecessary if people decide they would like to edit Wikidata using other tools. 8 LIMITATIONS Wikidata is still a relatively young project [30]. Although there The Wikidata for Digital Preservation Portal is designed to grow are efforts to mass import data from other sources, including Free- and change in response to developments in the larger Wikidata base, there are still gaps in the data [19]. The data coverage for the ecosystem. Thus it is a work in progress. There are several limita- domain of computing is not yet complete, and our vision relies on tions of the system at the time of this writing. The portal interface is uptake of the portal; we do not yet know if it will be accepted by only available in English at the moment. We would like to translate the digital preservation community. the interface into other language in the future. We do our best to guide the user, but it is still possible to con- Not all aspects of editing are possible in the current version of tribute data that is unexpected or uses different semantics. We do the portal. Currently it is not possible to merge items. We have not not yet know how people will use the portal. Through future user yet built in support for adding qualifiers or references. testing we will be able to observe user behavior and adjust future There is a risk that using the portal could hinder a users under- iterations of the interface. standing of the Wikidata community. The Wikidata knowledge base Incorrect data could potentially be added via the portal, just as is a wiki just as the sister projects of the Wikimedia Foundation. is the case in Wikidata itself. However, this does not pose a large There is an active community of practice of people who use the threat to Wikidata. The community quickly responds to errors by wiki affordances, talk pages, requests for comments, the property flagging, and often correcting, such problems. proposal process, and their own user pages to communicate with one another and to articulate the work of collaboratively creating 9 FUTURE WORK Wikidata [11]. We identify two complementary areas for future work: (a) modi- Using the portal may create a barrier to the process of moving fications to and extension of the portal, and (b) work within the from legitimate peripheral participation [12] to active member of Wikidata community such as extensions of the data models in Wiki- this community of practice. One of the dimensions of infrastructure data, and contributing additional structured data to Wikidata itself. introduced by Star and Ruhleder is that infrastructure is learned as The Wikidata for Digital Preservation Portal is designed so that part of membership in a community of practice [23, 26]. Newcomers we can respond quickly to changes to the data models in Wiki- to a community must familiarize themselves with infrastructure data. For example, if a new property is created that is useful for used by the community in order to acculturate themselves. Due to the domain of computing, we can add that property to our list of Wikidata for Digital Preservation iPres’18, September, Cambridge, MA USA supported properties, and make it available for reuse in the portal. Wikidata. To review the source code for this project please visit our We will continue to customize the interface as the property list Github repository20. We appreciate the informative conversations changes and additional qualifiers are created. we had with participants of Wikicite 2016, Wikicite 2017. This work The multilingual aspect of Wikidata allows users to interact is supported by the Council on Library and Information Resources, with Wikidata in their language of preference. By changing the Andrew W. Mellon Foundation, Alfred P. Sloan Foundation, and language of the words used in the framing of the interface to another the Open Preservation Foundation. language, for example replacing the text of the contribute button to contribuer, we could provide a French language version of the REFERENCES portal. With the help of other members of the Wikidata community [1] Thomas Baker, Emmanuelle Bermès, Karen Coyle, Gordon Dunsire, Antoine with language expertise we could extend this approach to additional Isaac, Peter Murray, and M Zeng. 2011. Library linked data incubator group final report. report, W3C Incubator Group, October 25 (2011). languages. [2] Yochai Benkler. 2002. Coase’s Penguin, or, Linux and The Nature of the Firm. Contributing data to Wikidata allows for more entities to be Yale Law Journal (2002), 369–446. [3] Yochai Benkler, Aaron Shaw, and Benjamin Mako Hill. 2013. Peer production: a described. We extend data models through participating in the modality of collective intelligence. Collective Intelligence (2013). property proposal process and engaging in discussions with other [4] Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Elvira Mitraka, Julia members of WikiProject Informatics. We work to link out to existing Turner, Tim Putman, Justin Leong, Chinmay Naik, Paul Pavlidis, Lynn Schriml, Benjamin M Good, et al. 2016. Wikidata as a semantic framework for the Gene repositories that describe software. We work to increase the number Wiki initiative. Database 2016 (2016), baw015. of controlled vocabularies and ontologies related to computing with [5] Vania Dimitrova, Ronald Denaux, Glen Hart, Catherine Dolbear, Ian Holt, and which Wikidata is interoperable. Our team is also planning to create Anthony G Cohn. 2008. Involving domain experts in authoring OWL ontologies. In International Semantic Web Conference. Springer, 1–16. a generic portal framework that others could fork and customize [6] Biswanath Dutta. 2017. Examining the interrelatedness between ontologies and for additional domains by inserting their own relevant property Linked Data. Library Hi Tech 35, 2 (2017), 312–331. [7] Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, and Denny lists. Vrandečić. 2014. Introducing Wikidata to the linked data web. In The Semantic Web–ISWC 2014. Springer, 50–65. 10 CONCLUSION [8] Michelle Gantt and Bonnie A Nardi. 1992. Gardeners and gurus: patterns of cooperation among CAD users. In Proceedings of the SIGCHI conference on Human The Wikidata for Digital Preservation Portal facilitates increased factors in computing systems. ACM, 107–117. communication between members of the Wikidata community and [9] Benjamin M Good, Sebastian Burgstaller-Muehlbacher, Elvira Mitraka, Timothy Putman, Andrew I Su, and Andra Waagmeester. 2016. Opportunities and Chal- the international digital preservation community. lenges Presented by Wikidata in the Context of Biocuration.. In ICBO/BioCreative. Users contributing data to the Wikidata for Digital Preservation [10] Eero Hyvönen. 2012. Publishing and using cultural heritage linked data on the semantic web. Synthesis Lectures on the Semantic Web: Theory and Technology 2, portal contribute in alignment with the data models for the domain 1 (2012), 1–159. of computing without needing to first learn the details of these [11] Dariusz Jemielniak. 2014. Common knowledge?: An ethnography of . models. The interface guides users so that they can contribute Stanford University Press. [12] Jean Lave and Etienne Wenger. 1991. Situated learning: Legitimate peripheral quickly and easily, lowering the barrier to entry into the Wikidata participation. Cambridge university press. community. [13] Peter McKinney, Steve Knight, Jay Gattuso, David Pearson, Libor Coufal, David Centralizing the metadata for the domain of computing, whether Anderson, Janet Delve, Kevin De Vorsey, Ross Spencer, and Jan Hutař. 2014. Reimagining the Format Model: Introducing the Work of the NSLA Digital Preser- technical metadata or descriptive metadata, eliminates redundant vation Technical Registry. New Review of Information Networking 19, 2 (2014), labor of individual institutions creating structured data within their 96–123. [14] Peter McKinney, David Pearson, David Anderson, Jan Hutař, Steve Knight, Libor local systems. When we collaboratively create metadata and publish Coufal, Janet Delve, Jay Gattuso, Kevin DeVorsey, and Ross Spencer. 2014. A it in Wikidata, anyone can reuse it. This allows metadata profession- next generation technical registry: moving practice forward. iPRES 2014: 11th als to focus on the administrative, preservation, and use metadata International Conference on Digital Preservation (2014). [15] Barend Mons, Cameron Neylon, Jan Velterop, Michel Dumontier, Luiz pertinent to their local settings. Olavo Bonino da Silva Santos, and Mark D Wilkinson. 2017. Cloudy, increasingly Infrastructure supported by the Wikimedia foundation, build FAIR; revisiting the FAIR Data guiding principles for the European Open Science and maintained by an active community of tens of thousands of con- Cloud. Information Services & Use Preprint (2017), 1–8. [16] Claudia Müller-Birn, Benjamin Karran, Janette Lehmann, and Markus Luczak- tributors is a new option for cultural heritage institutions. The fact Rösch. 2015. Peer-production system or collaborative ontology engineering effort: that this infrastructure is built in conformance to open standards What is Wikidata?. In Proceedings of the 11th International Symposium on Open Collaboration. ACM, 20. and is comprised of free software which we can inspect means that [17] Joachim Neubert. 2017. Wikidata as a Linking Hub for Knowledge Organization we can audit this system and ensure that we can continue to trust Systems? Integrating an Authority Mapping into Wikidata and Learning Lessons it to store and access our data. for KOS Mappings. In Proceedings of the 17th European Networked Knowledge Organization Systems Workshop co-located with the 21st International Conference on Theory and Practice of Digital Libraries 2017 (TPDL 2017), Thessaloniki, Greece, ACKNOWLEDGMENTS September 21st, 2017. 14–25. http://ceur-ws.org/Vol-1937/paper2.pdf [18] Confederation of Open Access Repositories. 2015. COAR Roadmap Future Direc- This portal was inspired by the Su Lab and their Wikigenomes tions for Repository Interoperability. (2015). https://www.coar-repositories.org/ portal. We thank Wikimedia Deutschland18 for supporting the files/Roadmap_final_formatted_20150203.pdf. [19] Thomas Pellissier Tanon, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, Wikidata project. We thank the Wikidata community for sharing and Lydia Pintscher. 2016. From freebase to wikidata: The great migration. In thoughts, feedback, and data modeling work. Specifically, would like Proceedings of the 25th international conference on world wide web. International to thank all participants of WikiProject Informatics19 for engaging World Wide Web Conferences Steering Committee, 1419–1428. [20] Lydia Pintscher. [n. d.]. State of the Project. WikidataCon 2017 ([n. with us as we work to describe the domain of digital preservation in d.]). https://upload.wikimedia.org/wikipedia/commons/b/b0/WikidataCon_ 2017-_State_of_the_Project.pdf, year=2017. 18https://wikimedia.de/wiki/Hauptseite 19https://www.wikidata.org/wiki/Wikidata:WikiProject_Informatics 20https://github.com/WikiDP iPres’18, September, Cambridge, MA USA Thornton et al.

[21] Tim E Putman, Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Chunlei Wu, Andrew I Su, and Benjamin M Good. 2016. Centralizing content and dis- tributing labor: a community model for curating the very long tail of microbial genomes. Database 2016 (2016), baw028. [22] Tim E Putman, Sebastien Lelong, Sebastian Burgstaller-Muehlbacher, Andra Waagmeester, Colin Diesh, Nathan Dunn, Monica Munoz-Torres, Gregory S Stupp, Chunlei Wu, Andrew I Su, et al. 2017. WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata. Database 2017 (2017), bax025. [23] Susan Leigh Star. 1999. The ethnography of infrastructure. American behavioral scientist 43, 3 (1999), 377–391. [24] Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology and Human Values 35, 5 (2010), 601âĂŞ617. [25] Susan Leigh Star and James R Griesemer. 1989. Institutional ecology,translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39. Social studies of science 19, 3 (1989), 387–420. [26] Susan Leigh Star and Karen Ruhleder. 1996. Steps toward an ecology of infras- tructure: Design and access for large information spaces. Information systems research 7, 1 (1996), 111–134. [27] Global Digital Format Registry Team. 2008. Global Digital Format Registry. http://library.harvard.edu/preservation/digital-preservation_gdfr.html. (2008). Visited Nov 16, 2016. [28] Unified Digital Format Registry Team. 2012. Unified Digital Format Registry. http://www.udfr.org/. (2012). Visited Nov 16, 2016. [29] Katherine Thornton, Euan Cochrane, Thomas Ledoux, Bertrand Caron, and Carl Wilson. 2017. Modeling the Domain of Digital Preservation in Wikidata. iPRES 2017: 14th International Conference on Digital Preservation (2017). [30] Denny Vrandečić. 2012. Wikidata: A new platform for collaborative data col- lection. In Proceedings of the 21st International Conference Companion on World Wide Web. ACM, 1063–1064. [31] Wikidata. 2015. DataModel. (2015). https://www.mediawiki.org/wiki/Wikibase/ DataModel [32] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Apple- ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3 (2016), 160018. [33] Marcia Lei Zeng and Jian Qin. 2016. Metadata. American Library Association.