Urobe: a Prototype for Wiki Preservation
Total Page:16
File Type:pdf, Size:1020Kb
UROBE: A PROTOTYPE FOR WIKI PRESERVATION Niko Popitsch, Robert Mosser, Wolfgang Philipp University of Vienna Faculty of Computer Science ABSTRACT By this, some irrelevant information (like e.g., automat- ically generated pages) is archived while some valuable More and more information that is considered for digital information about the semantics of relationships between long-term preservation is generated by Web 2.0 applica- these elements is lost or archived in a way that is not easily tions like wikis, blogs or social networking tools. However, processable by machines 1 . For example, a wiki article is there is little support for the preservation of these data to- authored by many different users and the information who day. Currently they are preserved like regular Web sites authored what and when is reflected in the (simple) data without taking the flexible, lightweight and mostly graph- model of the wiki software. This information is required based data models of the underlying Web 2.0 applications to access and integrate these data with other data sets in into consideration. By this, valuable information about the the future. However, archiving only the HTML version of relations within these data and about links to other data is a history page in Wikipedia makes it hard to extract this lost. Furthermore, information about the internal structure information automatically. of the data, e.g., expressed by wiki markup languages is Another issue is that the internal structure of particular not preserved entirely. core elements (e.g., wiki articles) is currently not preserved We argue that this currently neglected information is adequately. Wiki articles are authored using a particular of high value in a long-term preservation strategy of Web wiki markup language. These simple description languages 2.0 data and describe our approach for the preservation contain explicit information about the structure of the text of wiki contents that is based on Semantic Web technolo- (e.g., headings, emphasized phrases, lists and tables, etc.). gies. In particular we describe the distributed architecture This internal structure is lost to some extent if digital preser- of our wiki preservation prototype (Urobe) which imple- vation strategies consider only the HTML version of such ments a migration strategy for wiki contents and is based articles rendered by a particular wiki software as this ren- on Semantic Web and Linked Data principles. Further, dering step is not entirely reversible in many cases. we present a first vocabulary for the description of wiki In a summary, we state that the current practice for the core elements derived from a combination of established preservation of Web 2.0 data preserves only one particular vocabularies/standards from the Semantic Web and digital (HTML) representation of the considered data instead of preservation domains, namely Dublin Core, SIOC, VoiD preserving the core elements of the respective data models and PREMIS. themselves. However, we consider these core elements and their relations crucial for future data migration and 1 INTRODUCTION integration tasks. In the following we introduce our system Urobe that is capable of preserving the core elements of Users of Web 2.0 applications like wikis, blogs or social data that are created using various wiki software. networking tools generate highly interlinked data of public, corporate and personal interest that are increasingly con- 2 UROBE: A WIKI PRESERVATION TOOL sidered for long-term digital preservation. The term Web 2.0 can be regarded as referring to “a class of Web-based We are currently developing a prototype (Urobe) for the applications that were recognized ex post facto to share long-term preservation of data created by wiki users. One certain design patterns”, like being user-centered, collabo- particular problem when considering wiki preservation is rative and Web-based [6]. Web 2.0 applications are usually that there exists not one single but a large number of dif- based on flexible, lightweight data models that interlink ferent wiki implementations 2 , each using its own wiki their core elements (e.g., users, wiki articles or blog posts) markup language. This is what makes a general emula- using hyperlinks and expose these data on the Web for hu- tion approach for preserving wiki contents unfeasible as man consumption as HTML. The current practice for the it would require establishing appropriate emulation envi- preservation of these data is to treat this layer like a regular ronments for each wiki software. After further analyzing Web site and archive the HTML representations (usually by crawling them) rather than the core elements themselves. 1 Cf. http://jiscpowr.jiscinvolve.org/wp/2009/ 03/25/arch-wiki/ 2 For example, the website http://www.wikimatrix.org/ c 2010 Austrian Computer Society (OCG). lists over 100 popular wiki engines. several popular wiki engines, we have identified the fol- this problem, we make use of well-defined standardized lowing required components for implementing a long-term, Semantic Web vocabularies to define the semantics of our migration-based wiki preservation strategy: data explicitly. 1. An abstract, semantic vocabulary / schema for the de- Existing inference support. Inference enables to find scription of core elements and their relations stored relations between items that were not specified explicitly. in a wiki, namely: users, articles and revisions, their By this it is possible to generate additional knowledge contents, links, and embedded media. about the preserved data that might improve future access 2. Software components able to extract these data from to and migration of the data. wiki implementations. Expressive query language. One of the key goals of 3. A scalable infrastructure for harvesting and storing digital preservation systems is to enable users to re-find these data. and access the data stored in such an archive. This often 4. Migration services for migrating contents expressed requires a preservation storage to enable complex and so- in a wiki markup language into standardized formats. phisticated queries on its data. Data in RDF graphs can be queried using SPARQL, a rich, expressive, and standard- 5. Migration services for the semantic transformation ized query language that meets these requirements [8]. of the meta data stored in this system to newer for- mats (i.e., services for vocabulary evolution). Furthermore, we decided to publish the archived data 6. Software interfaces to existing digital preservation as Linked Data [2] in order to exchange them between infrastructures using preservation meta data stan- de-centralized components. Linked Data means that (i) re- dards. sources are identified using HTTP URIs (ii) de-referencing (i.e., accessing) a URI returns a meaningful representation 7. An effective user interface for accessing these data. of the respective resource (usually in RDF) and (iii) these representations include links to other related resources. 2.1 Benefits of Semantic Web Technologies Data published in this way can easily be accessed and Urobe is implemented using Semantic Web technologies: integrated into existing Linked Data. It extracts data stored in a wiki and archives it in the form This highly flexible data representation can be accessed of named RDF graphs [4]. The resources and properties in via the Urobe Web interface and can easily be converted these graphs are described using a simple OWL 3 vocabu- to existing preservation meta data standards in order to lary. Resources are highly interlinked with other resources integrate it with an existing preservation infrastructure. due to structural relationships (e.g., article revisions are linked with the user that authored them) but also semantic 2.2 A Vocabulary for Describing Wiki Contents. relationships (e.g., user objects stemming from different We have developed an OWL Light vocabulary for the de- wikis that are preserved by Urobe are automatically linked scription of wiki core elements by analyzing popular wiki when they share the same e-mail address). We decided to engines as well as common meta data standards from the make use of Semantic Web technologies for the represen- digital preservation domain and vocabularies from the Se- tation of preserved data and meta data for the following mantic Web domain. The core terms of our vocabulary reasons: are depicted in Figure 1. Our vocabulary builds upon three common Semantic Web vocabularies: (i) DCTERMS Flexibility. The data model for representing the pre- for terms maintained by the Dublin Core Metadata Initia- served wiki contents is likely to change over time to meet tive 4 , (ii) SIOC for describing online communities and new requirements and it is not predictable at present how their data 5 and (iii) VoiD for describing datasets in a Web this data model will evolve in the future. In this context, of Data 6 . modelling the data with the flexible graph-based RDF data The vocabulary was designed to be directly mappable model seems a good choice to us: migrating to a newer data to the PREMIS Data Dictionary 2.0 [9]. We have imple- model can be seen as an ontology matching problem for mented such a mapping and enable external tools to access which tools and methods are constantly being developed the data stored in an Urobe archive as PREMIS XML in Semantic Web research [5, 11]. descriptions 7 . This PREMIS/XML interface makes any Urobe instance a PREMIS-enabled storage that can easily High semantic expressiveness. In order to read and in- be integrated into other PREMIS-compatible preservation terpret digital content in the future, it is necessary to pre- infrastructures. serve its semantics. As a consequence of the continuous 4 http://purl.org/dc/terms/ evolution of data models, knowledge about data semantics 5 http://sioc-project.org/ disappears quickly if not specified explicitly [11]. To face 6 http://vocab.deri.ie/void/ 7 In compliance to the Linked Data recommendations access to these 3 http://www.w3.org/2004/OWL/ representations is possible via content negotiation.