Blogforever: D3.1 Preservation Strategy Report
Total Page:16
File Type:pdf, Size:1020Kb
SEVENTH FRAMEWORK PROGRAMME FP7-ICT-2009-6 BlogForever Grant agreement no.: 269963 BlogForever: D3.1 Preservation Strategy Report Editor: Yunhyong Kim, Seamus Ross Revision: First Version Dissemination Level: Public Author(s): Yunhyong Kim, Seamus Ross, Karen Stepanyan, Ed Pinsent, Patricia Sleeman, Silvia Arango-Docio, Vangelis Banos, Ilias Trochidis, Jaime Garcia Llopis, Hendrik Kalb Due date of deliverable: 30 September 2012 Actual submission date: 30 September 2012 Start date of project: 01 March 2011 Duration: 30 months Lead Beneficiary name: University of Glasgow (UG) Abstract: This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design. BlogForever: D3.1 Preservation Strategy Report 30 September 2012 Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013) The BlogForever Consortium consists of: Aristotle University of Thessaloniki (AUTH) Greece European Organization for Nuclear Research (CERN) Switzerland University of Glasgow (UG) UK The University of Warwick (UW) UK University of London (UL) UK Technische Universitat Berlin (TUB) Germany Cyberwatcher Norway SRDC Yazilim Arastrirma ve Gelistrirme ve Danismanlik Ticaret Limited Sirketi (SRDC) Turkey Tero Ltd (Tero) Greece Mokono GMBH Germany Phaistos SA (Phaistos) Greece Altec Software Development S.A. (Altec) Greece BlogForever Consortium Page 2 of 238 BlogForever: D3.1 Preservation Strategy Report 30 September 2012 History Version Date Modification reason Modified by 0.9 21/08/2012 Drafting process Yunhyong Kim (Version from incorporating contributions throughout WP3 Task 3.1 by the people listed in a author’s list on the cover page). 0.91 27/08/2012 Drafting chapters 5 and 6 Yunhyong Kim 0.95 03/09/2012 Last stages of drafting Yunhyong Kim 0.99 26/09/2012 Draft for review by WP3 and the project Yunhyong Kim management. 1.0 30/09/2012 First version of the deliverable Yunhyong Kim 1.1 25/09/2013 Updated version of the deliverable Yunhyong Kim BlogForever Consortium Page 3 of 238 BlogForever: D3.1 Preservation Strategy Report 30 September 2012 Table of Contents TABLE OF CONTENTS .................................................................................................................................... 4 EXECUTIVE SUMMARY ................................................................................................................................. 7 1 INTRODUCTION ................................................................................................................................. 10 1.1 WHY PRESERVE WEBLOGS? ................................................................................................................... 10 1.2 BLOGFOREVER OBJECTIVES .................................................................................................................... 12 1.3 CONTRIBUTIONS OF THIS REPORT ............................................................................................................ 13 1.4 STRUCTURE OF THE REPORT ................................................................................................................... 14 2 PREVIOUS WORK: REVIEW AND CRITICISM........................................................................................ 16 2.1 A BRIEF OVERVIEW OF WEB ARCHIVING IN THE CONTEXT OF DIGITAL PRESERVATION ....................................... 16 2.2 RELEVANT PROJECTS AND INITIATIVES ...................................................................................................... 20 3 WEBLOGS .......................................................................................................................................... 26 3.1 WEBLOG SURVEY ................................................................................................................................. 26 3.1.1 Digital Object Type: Structured Text ........................................................................................... 29 3.1.2 Digital Object Type: Image ......................................................................................................... 29 3.1.3 Digital Object Type: Document ................................................................................................... 30 3.1.4 Digital Object Type: Audio .......................................................................................................... 30 3.1.5 Digital Object Type: Moving Image ............................................................................................ 31 3.1.6 Digital Object Type: Executable .................................................................................................. 31 3.1.7 File Formats From the Weblog Survey ........................................................................................ 32 3.1.8 Next Steps ................................................................................................................................... 34 3.2 WEBLOG DATA MODEL AND ITS PROPERTIES ............................................................................................. 34 3.2.1 Introduction ................................................................................................................................ 34 3.2.2 Data Modelling ........................................................................................................................... 34 3.2.3 Methods Used ............................................................................................................................. 35 3.2.4 Outline of the Data Model .......................................................................................................... 35 3.2.5 Blog Core ..................................................................................................................................... 36 3.2.6 Records within the Repository .................................................................................................... 37 3.2.7 Components of the Data Model .................................................................................................. 40 3.2.8 Representation in XML ................................................................................................................ 41 3.3 SIGNIFICANT PROPERTIES OF BLOGS: BRINGING TOGETHER THE DATA MODEL AND USER REQUIREMENTS ............ 42 3.3.1 Disambiguation........................................................................................................................... 42 3.3.2 Related Work .............................................................................................................................. 43 3.3.3 Significant Properties: an Attempt to Measure Preservation Performance ............................... 44 3.3.4 Proposed Changes ...................................................................................................................... 45 3.3.5 Applying the Proposal to Blogs ................................................................................................... 47 3.3.6 Discussion and Conclusions ......................................................................................................... 52 3.4 SIGNIFICANT PROPERTIES OF EMBEDDED DIGITAL OBJECT TYPES ................................................................... 53 3.4.1 Structured Text ........................................................................................................................... 54 3.4.2 Image .......................................................................................................................................... 56 3.4.3 Document ................................................................................................................................... 56 3.4.4 Audio ........................................................................................................................................... 57 3.4.5 Moving Image ............................................................................................................................. 57 3.5 CONCLUSION ....................................................................................................................................... 58 4 PRESERVATION STRATEGY TESTING ................................................................................................... 59 4.1 REVISITING PRESERVATION STRATEGIES .................................................................................................... 60 4.2 RISK OF INFORMATION LOSS .................................................................................................................. 62 4.2.1 Missing Links and Incorrect