UROBE: A PROTOTYPE FOR PRESERVATION

Niko Popitsch, Robert Mosser, Wolfgang Philipp University of Vienna Faculty of Computer Science

ABSTRACT By this, some irrelevant information (like e.g., automat- ically generated pages) is archived while some valuable More and more information that is considered for digital information about the semantics of relationships between long-term preservation is generated by Web 2.0 applica- these elements is lost or archived in a way that is not easily tions like , blogs or social networking tools. However, processable by machines 1 . For example, a wiki article is there is little support for the preservation of these data to- authored by many different users and the information who day. Currently they are preserved like regular Web sites authored what and when is reflected in the (simple) data without taking the flexible, lightweight and mostly graph- model of the . This information is required based data models of the underlying Web 2.0 applications to access and integrate these data with other data sets in into consideration. By this, valuable information about the the future. However, archiving only the HTML version of relations within these data and about links to other data is a history page in makes it hard to extract this lost. Furthermore, information about the internal structure information automatically. of the data, e.g., expressed by wiki markup languages is Another issue is that the internal structure of particular not preserved entirely. core elements (e.g., wiki articles) is currently not preserved We argue that this currently neglected information is adequately. Wiki articles are authored using a particular of high value in a long-term preservation strategy of Web wiki . These simple description languages 2.0 data and describe our approach for the preservation contain explicit information about the structure of the text of wiki contents that is based on Semantic Web technolo- (e.g., headings, emphasized phrases, lists and tables, etc.). gies. In particular we describe the distributed architecture This internal structure is lost to some extent if digital preser- of our wiki preservation prototype (Urobe) which imple- vation strategies consider only the HTML version of such ments a migration strategy for wiki contents and is based articles rendered by a particular wiki software as this ren- on Semantic Web and Linked Data principles. Further, dering step is not entirely reversible in many cases. we present a first vocabulary for the description of wiki In a summary, we state that the current practice for the core elements derived from a combination of established preservation of Web 2.0 data preserves only one particular vocabularies/standards from the Semantic Web and digital (HTML) representation of the considered data instead of preservation domains, namely Dublin Core, SIOC, VoiD preserving the core elements of the respective data models and PREMIS. themselves. However, we consider these core elements and their relations crucial for future data migration and 1 INTRODUCTION integration tasks. In the following we introduce our system Urobe that is capable of preserving the core elements of Users of Web 2.0 applications like wikis, blogs or social data that are created using various wiki software. networking tools generate highly interlinked data of public, corporate and personal interest that are increasingly con- 2 UROBE: A WIKI PRESERVATION TOOL sidered for long-term digital preservation. The term Web 2.0 can be regarded as referring to “a class of Web-based We are currently developing a prototype (Urobe) for the applications that were recognized ex post facto to share long-term preservation of data created by wiki users. One certain design patterns”, like being user-centered, collabo- particular problem when considering wiki preservation is rative and Web-based [6]. Web 2.0 applications are usually that there exists not one single but a large number of dif- based on flexible, lightweight data models that interlink ferent wiki implementations 2 , each using its own wiki their core elements (e.g., users, wiki articles or blog posts) markup language. This is what makes a general emula- using hyperlinks and expose these data on the Web for hu- tion approach for preserving wiki contents unfeasible as man consumption as HTML. The current practice for the it would require establishing appropriate emulation envi- preservation of these data is to treat this layer like a regular ronments for each wiki software. After further analyzing Web site and archive the HTML representations (usually by crawling them) rather than the core elements themselves. 1 Cf. http://jiscpowr.jiscinvolve.org/wp/2009/ 03/25/arch-wiki/ 2 For example, the website http://www.wikimatrix.org/ c 2010 Austrian Computer Society (OCG). lists over 100 popular wiki engines. several popular wiki engines, we have identified the fol- this problem, we make use of well-defined standardized lowing required components for implementing a long-term, Semantic Web vocabularies to define the semantics of our migration-based wiki preservation strategy: data explicitly.

1. An abstract, semantic vocabulary / schema for the de- Existing inference support. Inference enables to find scription of core elements and their relations stored relations between items that were not specified explicitly. in a wiki, namely: users, articles and revisions, their By this it is possible to generate additional knowledge contents, links, and embedded media. about the preserved data that might improve future access 2. Software components able to extract these data from to and migration of the data. wiki implementations. Expressive query language. One of the key goals of 3. A scalable infrastructure for harvesting and storing digital preservation systems is to enable users to re-find these data. and access the data stored in such an archive. This often 4. Migration services for migrating contents expressed requires a preservation storage to enable complex and so- in a wiki markup language into standardized formats. phisticated queries on its data. Data in RDF graphs can be queried using SPARQL, a rich, expressive, and standard- 5. Migration services for the semantic transformation ized query language that meets these requirements [8]. of the meta data stored in this system to newer for- mats (i.e., services for vocabulary evolution). Furthermore, we decided to publish the archived data 6. Software interfaces to existing digital preservation as Linked Data [2] in order to exchange them between infrastructures using preservation meta data stan- de-centralized components. Linked Data means that (i) re- dards. sources are identified using HTTP URIs (ii) de-referencing (i.e., accessing) a URI returns a meaningful representation 7. An effective user interface for accessing these data. of the respective resource (usually in RDF) and (iii) these representations include links to other related resources. 2.1 Benefits of Semantic Web Technologies Data published in this way can easily be accessed and Urobe is implemented using Semantic Web technologies: integrated into existing Linked Data. It extracts data stored in a wiki and archives it in the form This highly flexible data representation can be accessed of named RDF graphs [4]. The resources and properties in via the Urobe Web interface and can easily be converted these graphs are described using a simple OWL 3 vocabu- to existing preservation meta data standards in order to lary. Resources are highly interlinked with other resources integrate it with an existing preservation infrastructure. due to structural relationships (e.g., article revisions are linked with the user that authored them) but also semantic 2.2 A Vocabulary for Describing Wiki Contents. relationships (e.g., user objects stemming from different We have developed an OWL Light vocabulary for the de- wikis that are preserved by Urobe are automatically linked scription of wiki core elements by analyzing popular wiki when they share the same e-mail address). We decided to engines as well as common meta data standards from the make use of Semantic Web technologies for the represen- digital preservation domain and vocabularies from the Se- tation of preserved data and meta data for the following mantic Web domain. The core terms of our vocabulary reasons: are depicted in Figure 1. Our vocabulary builds upon three common Semantic Web vocabularies: (i) DCTERMS Flexibility. The data model for representing the pre- for terms maintained by the Dublin Core Metadata Initia- served wiki contents is likely to change over time to meet tive 4 , (ii) SIOC for describing online communities and new requirements and it is not predictable at present how their data 5 and (iii) VoiD for describing datasets in a Web this data model will evolve in the future. In this context, of Data 6 . modelling the data with the flexible graph-based RDF data The vocabulary was designed to be directly mappable model seems a good choice to us: migrating to a newer data to the PREMIS Data Dictionary 2.0 [9]. We have imple- model can be seen as an ontology matching problem for mented such a mapping and enable external tools to access which tools and methods are constantly being developed the data stored in an Urobe archive as PREMIS XML in Semantic Web research [5, 11]. descriptions 7 . This PREMIS/XML interface makes any Urobe instance a PREMIS-enabled storage that can easily High semantic expressiveness. In order to read and in- be integrated into other PREMIS-compatible preservation terpret digital content in the future, it is necessary to pre- infrastructures. serve its semantics. As a consequence of the continuous 4 http://purl.org/dc/terms/ evolution of data models, knowledge about data semantics 5 http://sioc-project.org/ disappears quickly if not specified explicitly [11]. To face 6 http://vocab.deri.ie/void/ 7 In compliance to the Linked Data recommendations access to these 3 http://www.w3.org/2004/OWL/ representations is possible via content negotiation. migration tools for the markup languages of MediaWiki has_host sioc:Site sioc:Wiki and JspWiki based on components from the WikiModel project 11 . void:subset Creole is a wiki markup that contains common elements of many existing wiki engines. However, it is not able to sioc:Container Snapshot void:Dataset express all specialized elements that are available in the var- 12 has_container ious markup languages . This means that converting wiki articles to Creole is often a lossy migration step. Therefore MediaEntity has_target Urobe additionally preserves the original article source WikiEntity Link code in its original markup language to enable less lossy ArticleEntity has_link migration in the future. However, some loss is unavoid- sioc:container_of has_source able in such a migration, although it might concern mostly Revision sioc:Post features of minor importance (such as e.g., specialized coloring of table headings or borders around embedded has_representation sioc:has_creator images). If such features have to be preserved, storing the ContentFormat Representation sioc:User HTML representation of wiki articles is unavoidable. How- has_format ever, even in this case we consider the preservation of a has_format wiki’s core elements as beneficial as it enables integration dcterms:has_part has_migration of the data with other data but also direct reasoning on the has_migration_target ContentObject Migration archived contents. has_migration_source Further it is notable, that preserving the article source code instead of its rendered HTML version saves a lot of space in a preservation storage: When we compared the Figure 1: Core terms of the Urobe vocabulary for de- raw byte sizes of HTML and plain source code represen- scribing wiki contents. The vocabulary is available at tations of random Wikipedia articles, we found out that http://urobe-info.mminf.univie.ac.at/vocab. the source code representation uses less than 10% of the HTML size in most cases. Thus, the storage requirements for a Wiki archive could be reduced considerably if the PREMIS is extensible by design: As RDF graphs can mentioned migration loss is considered acceptable in a be serialized to XML 8 they are directly embeddable in particular wiki preservation strategy. PREMIS descriptions (using the objectCharacteristicsEx- As mentioned before, not only the data themselves but tension semantic unit). Further, it is possible to describe also their semantics that are expressed using our OWL media objects (i.e., images, videos, documents) that are vocabulary will have to be migrated in the future. We have embedded in wiki pages using appropriate semantic vo- not yet developed tools for the migration of our vocabulary, cabularies like the COMM multimedia ontology [1] (an but are confident that this can be achieved by using tools MPEG-7 based OWL DL ontology that covers most parts and methodologies from ontology matching research. of the MPEG-7 standard) or the Music Ontology 9 (that provides a formal framework for describing various music- related information, including editorial, cultural and acous- 2.4 Modularized, Distributed Architecture tic information). These descriptions can then be embedded Urobe is a distributed Web application that comprises three in/mapped to PREMIS descriptions. These meta data could central components: partly be extracted from the object’s content itself (e.g., from ID3v2 tags or XMP headers) but also be retrieved directly from the Web of Data (e.g., from the MusicBrainz Proxy components access particular wiki implementa- database, cf. [10]) which could enhance the quality of tions, convert their data to RDF and expose these RDF these meta data considerably. graphs as Linked Data. Proxies know how to access the data stored by a particular wiki software (e.g., by directly interacting with the database the wiki stores its data in). 2.3 Migration of Wiki Articles Some time ago, the wiki research community started with a The format registry stores descriptive and administra- first standardization attempt for wiki markup languages 10 tive meta data about particular file formats, including de- which led to a first stable recommendation (Creole 1.0). scriptions of various wiki markup languages. We therefore decided to implement tools for migrating the source code of wiki articles from their original wiki markup The preservation server periodically accesses the proxy language to Creole 1.0 as soon as they are integrated into components using HTTP requests and harvests all data that our preservation storage. So far we have implemented were created since the proxy was last accessed. These

8 http://www.w3.org/TR/REC-rdf-syntax/ 11 http://code.google.com/p/wikimodel/ 9 http://musicontology.com/ 12 An example are mathematical formulas in media wiki, see http:// 10 http://wiki.wikicreole.org/ en.wikipedia.org/wiki/Help:Displaying_a_formula. ing the Linked Data interface that exposes these data in a user machine processable format. The various representation formats are accessible via content negotiation: e.g., when Wiki 1 Wiki 2 the content type text/n3 is passed in the Accept header of the HTTP request, Urobe returns a N3-serialized RDF Proxy Proxy graph describing the respective resource. Urobe also pro- vides a SPARQL endpoint for formulating complex queries Format over the preservation storage. As future work we further registry consider to implement a time-based content negotiation mechanism for accessing our preservation storage, as re- Preservation Server cently presented in [12]. Migration services Storage

3 CONCLUSIONS Figure 2: Core architecture of our Urobe prototype. Solid black boxes and arrows denote Linked Data interfaces, We have formulated requirements and presented a first ap- dashed blue boxes and arrows denote HTML interfaces. proach for the digital long-term preservation of wikis, a particular type of a Web 2.0 application. Our approach strongly relies on the adoption of Semantic Web meth- data are stored in a triple store and interlinked with other ods and technologies. Wiki contents are modeled using a data stemming from other wiki instances (e.g., user objects graph-based data model and their semantics are described are automatically interlinked when they share a common using a simple OWL ontology. The advantages of Seman- e-mail address). Such links are then exploited e.g., for data tic Web technologies for digital preservation tasks were access via the Urobe GUI. A preservation server is able also recognized by others [7, 8, 3], especially the flexible to archive multiple wikis of different types. The internal and extensible way of data representation is considered as architecture of the preservation server is influenced by the beneficial for future data and vocabulary migration as well reference model for an Open Archival Information System as for data integration tasks. (OAIS). Migration services for various object types can We further contribute a first vocabulary for the abstract be plugged into this server. Currently, object migration is description of wiki contents which we consider a precondi- done either immediately after ingestion (for wiki articles) tion for a general wiki preservation strategy. We envision or on demand (for media objects). A preservation workflow this vocabulary to be continuously improved in the future, component is under development. which requires algorithms and tools for migrating the data The components of Urobe are loosely coupled via HTTP in a Urobe preservation storage to a new vocabulary ver- interfaces (cf. Figure 2). This modular architecture and sion. As discussed, we have not yet implemented such a the standardized protocols and formats used by Urobe al- functionality, but due to the strong application of Semantic low for the easy integration of its components into other Web technologies we can benefit directly from the ongoing applications. research in the area of ontology matching. In the course of the ongoing Urobe project, we aim at 2.5 A Web Interface for Accessing the Urobe Preser- extending our vocabulary and implementing support for vation Storage other semantic vocabularies that are able to capture addi- tional aspects of the preserved data that are of importance Human users may access Urobe via an HTML interface in digital preservation, such as context information and (Figure 3) provided by the preservation server component. provenance meta data. This interface enables them to search for wiki contents Finally, our proposed way of exposing the data stored in in the Urobe archive using full-text queries and a faceted Urobe as Linked Data enables others to link to these data in search approach. Facets for filtering result sets include a standardized way without compromising their integrity. (i) the wiki(s) the user wants to search, (ii) the time in- These externally linked data could then be exploited to terval the results were created in, (iii) content types and, harvest additional preservation meta data and ultimately (iv) the size of multimedia objects. The detail view of to improve future content migration steps. Further, others articles/media objects presents a timeline of the preserved could directly benefit from the invariant data in such an revisions of this item that indicates all revisions that were archive by being able to create stable links to particular created within the search time frame using a different color. revisions of wiki core elements. Users may navigate to other revisions by simply clicking into the timeline. The original source code of an article as well as all migrated representations are accessible via 4 ACKNOWLEDGEMENT this screen. A HTML version that is rendered from the pre- served Creole source code comprises the default view of an This research is partly funded by the Research Studios Aus- article. Machine actors may further access PREMIS/XML tria program of the Federal Ministry of Economy, Family and RDF representations of the stored wiki contents us- and Youth of the Republic of Austria (BMWFJ). (a) Main search screen. (b) Detail view.

Figure 3: Urobe graphical user interface. The left screenshot shows the main search screen, including the full-text search and the faceted search interface. The right screenshot shows the detail view of a preserved article: the timeline on top of the screen visualizes the revisions of the corresponding wiki article. Below, various representations of the article (XHTML, Creole, original markup) can be accessed.

5 REFERENCES [9] PREMIS Editorial Committee. Premis data dictionary for preservation metadata, version 2.0, 2008. http: [1] Richard Arndt, Raphael Troncy, Steffen Staab, Lynda //www.loc.gov/standards/premis/. Hardman, and Miroslav Vacura. COMM: Designing a well-founded multimedia ontology for the web. In [10] Yves Raimond, Christopher Sutton, and Mark Sandler. Proceedings of the ISWC ’07, pp. 30–43, 2007. Interlinking music-related data on the web. IEEE Mul- tiMedia, 16(2):52–63, 2009. [2] Tim Berners-Lee Christian Bizer, Tom Heath. Linked data - the story so far. IJSWIS, 5(3):1–22, 2009. [11] Christoph Schlieder. Digital Heritage: Semantic Chal- lenges of Long-term Preservation. submitted to the [3] Laura E. Campbell. Recollection: Integrating data Semantic Web Journal (SWJ), 2010. http://www. through access. In Proceedings of the ECDL ’09, pp. semantic-web-journal.net/content/ 396–397, 2009. new-submission-digital-heritage -semantic-challenges-long-term [4] Jeremy J. Carroll, Christian Bizer, Pat Hayes, and -preservation. Patrick Stickler. Named graphs, provenance and trust. In Proceedings of the WWW ’05, pp. 613–622, 2005. [12] Herbert Van de Sompel, Robert Sanderson, Michael Nelson, Lyudmila Balakireva, Harihar Shankar, and [5] Giorgos Flouris, Dimitris Manakanatas, Haridimos Scott Ainsworth. An HTTP-based Versioning Mech- Kondylakis, Dimitris Plexousakis, and Grigoris An- anism for Linked Data. LDOW2010, Co-located with toniou. Ontology change: Classification and survey. WWW ’10, 2010. Knowl. Eng. Rev., 23(2):117–152, 2008. [13] W3C Semantic Web Activity - RDF Data Access Work- [6] Mark Greaves. Semantic web 2.0. IEEE Intelligent ing Group. Sparql query language for rdf. Technical Systems, 22(2):94–96, 2007. report, W3C, 2008. [7] Jane Hunter and Sharmin Choudhury. Panic: an inte- grated approach to the preservation of composite digital objects using semantic web services. Int. J. Digit. Libr., 6(2):174–183, 2006.

[8] Gautier Poupeau and Emmanuelle Bermes.` Semantic web technologies for digital preservation : the spar project. In Proceedings of the Poster and Demonstra- tion Session at ISWC2008, 2008.