FOUNDATION

Proceedings of the DLM Forum - 7th Triennial Conference Making the information governance landscape in 10-14 November 2014 - ,

Editors: José Borbinha, Zoltán Szatucsek, Seamus Ross Blank page DLM Forum 2014– 7th Triennial Conference on Information Governance

Copyright 2014 Biblioteca Nacional de Portugal

Biblioteca Nacional de Portugal Campo Grande, 83 1749-081 Lisboa Portugal

Cover Design: João Edmundo

ISBN - 978-972-565-541-2 PURL - http://purl.pt/26107 Blank page CONFERENCE ORGANIZATION

Chairs José Borbinha - IST / INESC-ID (Local chair) Seamus Ross – University of Toronto (Scientific Committee chair) Zoltán Szatucsek – National Archives of (Scientific Committee chair)

Scientific Committee Janet Delve – University of Portsmouth Ann Keen - Tessela Silvestre Lacerda - DGLAB Jean Mourain - RSD Aleksandra Mrdavsic – National Archives of Slovenia Daniel Oliveira – Elena Cortes Ruiz – State Archives of Jef Schram – European Commission Lucie Verachten –

Local Committee Ana Raquel Bairrão - IST / INESC-ID José Borbinha - IST / INESC-ID João Cardoso - IST / INESC-ID João Edmundo - IST / INESC-ID Alexandra Mendes da Fonseca - Caixa Geral de Depósitos Bruno Fragoso - Imprensa Nacional da Casa da Moeda Maria Rita Gago - Município de Oeiras António Higgs Painha - IST / INESC-ID Denise Pedro - IST / INESC-ID Diogo Proença - IST / INESC-ID Susana Vicente - IST / INESC-ID Ricardo Vieira - IST/INESC-ID

i Blank page

ii CONTENTS

Session: Keynote Speaker 1 e-Residency in e-Estonia Taavi Kotka, Janek Rozov, Liivi Karpištšenko...... 1

Session: and Governance Bringing legacy and physical content under Governance Jean Mourain, François Chazalon...... 6 Risk-based appraisal and selection of records in The Netherlands: development of new tools Charles Jeurgens...... 7 Building a risk based records management governance for the City of Rotterdam Bart Ballaux and Jeroen van Oss...... 12 Session: Ingesting Special Records & Databases Preserving digital heritage: a network centric approach Francisco Barbedo, Ana Rodrigues, Lucília Runa and Mário Sant’Ana...... 17 The modernization, migration and archiving of research register Johanna Räisä and Mirja Loponen...... 24 Activities to facilitate the authentic interpretation of archived databases Jože Škofljanec and Aleksandra Mrdavšič...... 28

Workshop Information Culture: An Essential Concept for Next Generation Records Management Gillian Oliver and Fiorella Foscarini...... 31

Session: Information Governance Motivation The role of Information Governance in an Enterprise Architecture Framework Richard Jeffrey-Cook...... 32 One consolidated view of information management references Ricardo Vieira, Liliana Ragageles and Jose Borbinha...... 37 Transforming Information Governance Using SharePoint 2010 Stephen Howard...... 42

Session: Records Management Theory in Transition From Casanova to MoReq2010: Ages of Records Bogdan-Florin Popovici...... 53 Introducing MoReq, 4th Edition, and what comes next Jon Garde...... 57 Search, Discovery and Harmonization of Diverse Digital Contents Mikko Lampi, Aki Lassila and Timo Honkela...... 58

Session: Preserving and Accessing Databases Database Preservation Toolkit: a flexible tool to normalize and give access to databases Luis Faria, José Carlos Ramalho and Helder Silva...... 63

iii Practical experiences and challenges preserving administrative databases Mikko Eräkaski...... 69 Long-term access to databases the meaningful way Kuldar Aas, Janet Delve and Rainer Schmidt...... 71

Session: Data Protection Data protection in the archives world – fundamental right or additional burden? Jaroslaw Lotarski and Job Sueters...... 76

Session: Managing Hybrid Records Reducing complexity of hybrid data at ingest Tarvo Kärberg...... 78 Interdisciplinary Approach for Hybrid Records Management in Belgian Federal Administrations: The HECTOR Research Project Marie Demoulin, Sébastien Soyez, Seth Van Hooland and Cécile de Terwangne...... 80

Session: Archival Services and Tools Research projects as a driving force for open source development and a fast route to market Luis Faria, Miguel Ferreira and Helder Silva...... 83 Open Source Archive Anssi Jääskeläinen and Liisa Uosukainen...... 89 Integration of records management and digital archiving systems: what can we do today? Robert Sharpe, Pauline Sinclair and Alan Gairey...... 92 From retention schedules to functional schemes in the French Ministry of Defence Hélène Guichard-Spica and Anne-Sophie Maure...... 96

Session: Information Governance in Practice Information Governance with MoReq Jon Garde...... 99 A Maturity Model for Information Governance Diogo Proença, Ricardo Vieira and Jose Borbinha...... 100 Evidence-based Open Government: Solutions from Norway and Spain James Lowry...... 105

Session: Keynote Speaker 7 Can records management be automated? James Lappin...... 106

Session: Keynote Speaker 8 MoReq and E-ARK Jon Garde...... 112 Session: The Cloud, Social Data, and Big Data Is Big Data governing future memories? Alessia Ghezzi, Estefania Aguilar Moreno and Ângela Guimarães Pereira...... 113 Access and Preservation in the cloud: Lessons from operating Preservica Cloud Edition Robert Sharpe, Kevin O’Farrelly, Alan Gairey, Maïté Braud, Ann Keen and James Carr...... 115 iv Session: Education and training Recordkeeping Informatics: Building the Discipline Base Gillian Oliver, Joanne Evans, Barbara Reed and Frank Upward...... 126 Law and Records Management in Archival Studies: New Skills for Marie Demoulin and Sébastien Soyez...... 131

Author Index...... 136

Notes: - Full papers have the title in bold - Full-texts not provided in time have only the abstracts

v e-Residency in e-Estonia Taavi Kotka Janek Rozov Liivi Karpištšenko Deputy Secretary General for Department of Information Society Department of Information Society Communication and State Information Services Development Services Development Systems Ministry of Economic Affairs and Ministry of Economic Affairs and Ministry of Economic Affairs and Communications Communications Communications Harju 11, Tallinn, Estonia Harju 11, Tallinn, Estonia Harju 11, Tallinn, Estonia https://www.mkm.ee/en/objectives- https://www.mkm.ee/en/objectives- https://www.mkm.ee/en/objectives- activities/information- activities/information-society/records- activities/information-society society/information-society-services management-information-governance [email protected] [email protected] [email protected]

ABSTRACT the data exchange layer X-road, public key infrastructure and e- In this paper, we give a brief overview of the basics of Estonian e- ID, the administrative system of state information system (RIHA), Governance, Digital Agenda 2020, and the newest approach – the document exchange centre, and the information gateway e-Residency. eesti.ee.

General Terms 3. X-ROAD Management, Measurement, Documentation, Performance, The X-Road is often called the backbone of Estonian e- Economics, Reliability, Experimentation, Security, Human Governance and public services. The X-Road, operating since Factors, Standardization, Legal Aspects. 2001, is a technical and organizational environment which enables secure Internet-based data exchange between the state’s information systems. Therefore, there is no need to collect the Keywords same data several times, but it can be reused in a secure eGovernance, eResidency, eServices, Data Exchange. environment by various authorities. The X-Road not only allows exchange data, but also people to access to the data maintained 1. INTRODUCTION and processed in state databases. In 2014, a new term was coined in the Estonian Republic – Public and private sector enterprises and institutions can connect e-residency. The concept was greeted with acclaim, as it gives their information system with the X-Road. This enables them to equal opportunities to people residing in Estonia and in other use X-Road services in their own electronic environment or offer countries to make business and use e-services in a high-level their e-services via the X-Road. Joining the X-Road enables electronic environment of Estonia. institutions to save resources, since the data exchange layer already exists. This makes data exchange more effective both Estonia has a state information system the architecture of which is inside the state institutions as well as regarding the a precondition for introducing and implementing such a novel communication between a citizen and the state. idea. Estonian state information system is designed to be flexible, secure, and to serve the purpose of collecting data only once and 3.1 X-Road and Services making it available for re-use. Thus, it allows designing the most X-Road provides a good solution for service design. Borders modern e-services and making information management at the between different authorities are not a problem anymore. It is state level a success. possible to exchange information needed for providing services. For example, if a business needs a certain business license then 2. E-ESTONIA they do not have to gather data from different authorities (Tax and The development of e-governance and e-services in Estonia is Customs Board, Local authorities etc.) and submit different remarkable. Estonia has a unique system for the use of electronic documents providing evidence. X-Road gathers data in a matter of ID and, therefore, it is possible to enjoy almost paper-free a second. Also, it does not matter any longer which authority administration processes. Both public and private sector solutions provides a service. This means that it is possible to provide have won international attention and provided the basis for the services in places where the client really is (in internet, in that image of Estonia as an excellent e-state. authority service bureau, where most of the clients go). Thus, X- Road gives Estonian authorities freedom to customize services for Development of the state information system has been the biggest clients. The same principle applies to every information system strength of the current national ICT policy. The ground rules of connected to the X-Road. the Estonian information policy – dispersed service-based architecture, suitable security of data and data exchange, online features, focus on e-services and the use of strong authentication measures – have been observed to achieve this result.

State information system services provide the basis for development of modern public services and successful information management at the state level. These services include

Page 1 All of the inquiries made through X-Road possess probative value, i.e. have a legal effect. This means that inquiries made through X-Road can later be identified along with the person who submitted the inquiry and it is possible to establish that the inquiry has been logged correctly.

3.3 X-Road and Public Sector Efficiency Officials can use the X-Road services intended for them (for instance document exchange centre) via information systems of their own institutions. This facilitates the officials’ work, since it avoids the labour-consuming processing of paper documents, large-scale data entry and data verification. Communication with other officials, entrepreneurs and citizens is faster, more accurate, and secure.

X-Road data transfer capabilities are not limited only to Figure 1. X-Road as the backbone of public services structured, machine-readable data. Unstructured electronic records (in the pdf-, word-, and other such formats) are also exchanged in In Estonia, pre-filled application forms, tax declarations, etc. that the secure environment of X-Road. For that purpose, a central citizens and entrepreneurs can use in electronic self-service component Document Exchange Centre (DEC) was created in portals is a common practice. In addition, the X-Road enables to 2006. In 2014, more than 30 000 electronic records are exchanged introduce proactive, “invisible” services. These are services where monthly. In addition to security, DEC has other merits. Records decisions are made without any administrative burden for the end- are transferred in SOAP envelopes with XML containers users of service – they do not need to submit any applications or (“envelopes” of records) each of which, in turn, contains a record additional information, but, at most, to confirm their acceptance and an extract of its metadata. The transfer of a standard metadata of the service and the data. An example of a possible proactive set facilitates the capturing and registration of the records in the service was given at the Athens meeting of the DLM Forum in 1 recipient's system, since the necessary metadata can be populated June, 2014 . automatically. Via DEC, electronic records and metadata are also transferred to the National Archives for permanent preservation. 3.2 X-Road, Transparency and Security In case of citizens, the X-Road enables using the services of the The administration system of state information system (RIHA) is X-Road via different portals. That includes making enquiries from the main asset for information management at the state level. state databases and to control the information related to the person RIHA is a complete and detailed catalogue of the state himself/herself. For example, every citizen can use X-Road to information system. RIHA administers information systems, submit inquiries to the Population Register about their personal services, and classifications of the state, as well as semantic and data or inquiries to the vehicle database of the Traffic Register XML assets. For the most part, RIHA's data is available to regarding their car. In order to use the services, the end users must everyone, but can be amended only by the organisations registered first authenticate themselves with an ID card or via an Internet in RIHA; each organisation is individually responsible for the bank. The entrepreneur’s right of representation is authenticated correctness of the data they have entered. on the basis of the data of the Commercial Register. 3.4 X-Road and Cross-Border Connections All of the inquiries made through X-Road possess probative If we go into wider scales then borders between countries are not value, i.e. have a legal effect. This means that inquiries made obstacles anymore. It is possible also to exchange information through X-Road can later be identified along with the person who between different countries. Estonia and Finland have already submitted the inquiry and it is possible to establish that the started cooperation in that area. For example, confidential tax inquiry has been logged correctly. information between Finland and Estonian tax administrations is The security of X-Road is at the highest, third level of the already changed. At the moment there are four services, where Estonian three-level baseline security system (ISKE). The data is exchanged: Payments made to employees, withheld and connection is created between two parties through standard calculated taxes; Employers of a natural person, payments and security servers in order to ensure the safety of the data exchange. taxes subject to social tax; Control of tax arrears’ absence; Legal In the data exchange, data is encrypted, and a two-step person’s data on VAT. Other opportunities arising from data authorisation is used – an agency is authorised in the security exchange are being considered, e.g. faster ways of exchanging server of X-Road, and a user is authorised in the information data necessary for payments of social welfare benefits (pensions, system of the agency. child allowances, etc.). In cross-border data exchange, technology has not proved to be an obstacle. Problems which need to be solved for future are usually in other areas than technology. Usually, the obstacles come from 1 ’How to Move Forward without Having to Move Back’ by Janek legislation (for example the right to ask information from another Rozov - country) and bureaucracy. That means there is the need to http://www.dlmforum.eu/files/proocedings/2014%20Athens/Pre cooperate closely in these areas too, so that to fully exploit the sentations/11.06.2014/Session%20IV/Janek%20Rozov.pdf whole potential of X-Road.

Page 2 in the form of digital ID, to non-residents; the services will For clients it does not matter in which country they are. X-Road be aiming for a position similar to that held by Swiss helps to exchange information between countries and also to banking. design cross-border services. That would be a very big leap • The “Data Embassy” concept will be implemented. This toward EU Single Digital Market. X-Road enables us to provide means secure storage of digital copies of registration services based on client life or business event. So if there is a life information and records important for the state in virtual event (for example getting married) in a client’s life, different embassies that are located in other countries. As a result of services of different countries could be combined or united so that the project, Estonia will be able to ensure the running of the clients would have to have minimal communication with public country, using “cloud technology” and regardless of whether sectors. Estonia’s territorial integrity can be ensured or not; this concept is also valuable in case of many other hazards. 4. ESTONIAN DIGITAL AGENDA 2020 • Estonia will strengthen its position as the think tank of In Estonia, there is a common understanding that ICT is an information society to continue the dissemination of e- important tool for achieving economic growth and improved governance experiences and stand for the freedom of the quality of life. In November 2013, the Government approved Internet and the protection of privacy. A global information the Digital Agenda 2020 for Estonia. This is an ambitious strategy society think tank will be established in Estonia to achieve that will serve as a basis for various sectoral strategies as well as this goal. for development plans that all the public sector organizations have • The existence of a competent and innovative provider or to follow. competitive ICT sector is important for both the development of public sector IT solutions and economic development. In The digital agenda sets out the development activities to be 2020, about 50% more people than in 2013 will be working implemented by the end of 2020: in the ICT sector. For that purpose, we will contribute to enhancing the popularity of IT specialities within the • The construction of ultra fast basic Internet network will be framework of a life-long learning programme and improving completed – at the moment, one third of the planned 6,500 the quality of the doctoral degree studies of these specialities. km basic network is finished. By 2020, the basic network will be finished and everyone in Estonia will be able to use fast Internet. 5. E-RESIDENCY • The Nordic E-governance Basic Infrastructures Innovation The Republic of Estonia is the first country to offer e-residency. Institute will be established at the initiative of Estonia. This On 21 October, 2014 the Estonian Parliament passed a bill that is intended to be an international development centre, aiming sets out the legal basis for the issuance of a digital identity to at joint development of X-Road, digital signature and other persons who are not Estonian citizens or residents of Estonia. The components of the basic infrastructure. bill will enter into force on 1 December, 2014. People from all • By 2020, 20% of the employment age population of the over the world will now have an opportunity to get a digital will be using digital signature for faster identity provided by the Estonian government – in order to get business and handling of personal issues. The secure access to world-leading digital services from wherever you implementation of digital signature in the European Union might be. will become one of the main goals of the external activities of Estonia and, where appropriate, also the issue to be As it was described above, Estonian citizens and residents enjoy pursued by Estonia in 2018, while holding the Presidency of the merits a modern and secure digital environment offers. For the Council of the European Union. residents of other countries, there have been no similar • Coping with increasing data volumes and loss of privacy opportunities. If Estonian businesses or public sector resulting from extensive interbase cross-usage of data will be organizations develop cooperation with non-residents, they have made easier by taking control over the data usage. to use paper-based parallel processes which are time-consuming Conditions will be established to allow people to be always and costly. It has no economic justification that the availability of aware of who, when and for what purpose their date is being Estonian e-Governance and other electronic services depends on used and to marshal the use. the residency or citizenship of a person. Nor does it facilitate the • To avoid getting stuck in old technologies, a reform will be development of Estonia and the single market of the EU. introduced to reform the public e-services and support IT solutions. Estonia’s public e-services must be modernised 5.1 Authors of the Idea of e-Residency and meet uniform quality requirements. In addition, the The concept of e-Residency was proposed and developed by Mr “legacy” principle will be adopted, which means that in Taavi Kotka (CIO of the Estonian Government, the Deputy public sector, IT solutions of material importance must never Secretary General for Communication and State Information be older than 13 years. Systems with the Ministry of Economic Affairs and • The state owns huge quantities of data; however, these are Communications), Mr Siim Sikkut (National ICT Policy Adviser not used enough to adopt better political decisions and to with the Government Office), and Ms Ruth Annus (then Head of offer better services. Over the years to come, the public Migration and Border Policy with the Ministry of Interior). sector’s capacity to absorb analytical solutions will be considerably improved. The concept was approved by the Government on April 24, 2014, • Estonia will start to offer its secure and convenient services and awarded as the best development idea by Arengufond on June to the citizens of other countries. Virtual or e-residency will 12, 2014. Arengufond is a public institution subject to the be launched – Estonia will start to issue electronic identity,

Page 3 Parliament whose aim is to contribute to the economic • foreign scientists, academics, tutors, and students; development of Estonia. • representatives of foreign and international organizations in the Republic of Estonia; In July 2014, Taavi Kotka received the European ICT award • family members of the aforementioned. “European CIO of the year” at the ICT Spring Conference in Luxembourg. Two of the main reasons for awarding him were his There are, however, other groups of people who are interested in global strategies and technical choices. and benefit from becoming e-residents of Estonia – former Estonian residents, people of Estonian nationality, etc.

5.2 e-Residency of Estonia – Who and What For giving access to world’s best e-services, the X-Road for infrastructure is used, including the Estonian information E-residency gives to foreigners residing abroad similar gateway, the State Portal eesti.ee. The portal brings together many opportunities as to the people living in Estonia. At any place in of the information systems that are in use in Estonia and provides the world, a person with an Estonian ID card can: its users with the opportunity to find the necessary information • sign and encrypt documents within minutes; and gain access to various X-Road, register and information • establish a company within an hour; system services through one portal. On the opening page of the • make bank transfers within seconds; portal, a user can choose whether to use e-services, read materials • participate actively in the management of a company on various topics, or search for contact data of agencies. After registered in Estonia; choosing e-services or topics, a choice can be made whether to • submit tax returns in Estonia with just a few mouse search further in the area designed for citizens or entrepreneurs clicks. and, depending on the user’s choice, the list of services or topics intended for the role in question can be retrieved. E-residency is especially useful for entrepreneurs and others who already have some relationship to Estonia: who do business, work, study or visit here but have not become a resident. However, e- 5.3 How to Get e-Residency of Estonia An e-resident will be a physical person who has received the e- residency is also launched as a platform to offer digital services to resident’s digital identity (smart ID-card) from the Republic of a global audience with no prior Estonian affiliation – for anybody Estonia. This will not entail full legal residency or citizenship or who wants to run their business and life in the most convenient, right of entry to Estonia. Instead, e-residency gives secure access digital way. Estonia is planning to keep adding new useful to Estonia’s digital services and an opportunity to give digital services from early 2015 onwards. signatures in an electronic environment. Such digital identification and signing is legally fully equal to face-to-face The digital ID of an e-resident will enable a foreigner to perform identification and handwritten signatures in the European Union. transactions online regardless of the location. People from outside the EU can thus create a central base for themselves for the The card is not a physical ID-card or even a travel document transaction of business in the EU – establish a business in Estonia because it has no photo on it, but it does have a microchip with and participate actively in its management while residing in Brazil security certificates. These enable the card to be used with a small or Australia, for instance. piece of software installed and a reader attached via USB to a computer. It works on two-factor authentication. To get access to The digital ID of an e-resident will enable a foreigner to perform a service or sign digitally, one needs to enter secure PINs which transactions online regardless of the location. People from outside only they will know. the EU can thus create a central base for themselves for the transaction of business in the EU – establish a business in Estonia To apply for e-residency, it is necessary to visit a Police and and participate actively in its management while residing in Brazil Border Guard office in Estonia – to submit an application and or Australia, for instance. Today, the participation of a foreign provide biometrical data (the facial image and fingerprints) for investor in the active management of the company is rather background check. The decision will be made within 2 weeks and complicated. The ID-card of an e-resident and the digital if it is positive, the card will be issued to the applicant in-person signature will give the necessary flexibility. at the Police and Border Guard office. The one-time state fee for the card is 50 €, other fees will depend on service providers – In Estonia, a business can be established and the bank account public digital services will be offered mostly free-of-charge, just opened within a day. This is but one electronic service that like to ‘real’ residents. Measures are being taken to add capacity Estonia can offer for its e-residents – not to speak about simple to Estonian embassies to process e-residency applications and and fully digital tax system, flexible (digital) usage of highly issue cards abroad by the end of 2015 – so that it would not qualified workforce, etc. In addition, reinvested profit is not taxed necessarily need to travel to Estonia. in Estonia and highly developed e-banking enables e-residents to manage their assets from the distance. 6. ACKNOWLEDGMENTS Thus, the main stakeholders of the Estonian e-residency are: Our thanks to the DLM Forum and the local committee of Lisbon • foreign investors, and people working abroad in the conference for their interest in the subject and for including it in companies founded by them; the program of the 7th Triennal Conference 2014 as the first • board members of companies, residing abroad; keynote speech. • foreign specialists and workers of Estonian companies; • foreign clients/partners of Estonian companies;

Page 4 7. REFERENCES [1] Identity Documents Act, Chapter 52 (entry into force from 01.12.2014, not translated at the moment of submitting this paper). Riigi Teataja. DOI= https://www.riigiteataja.ee/en/eli/ee/519092014001/consolid e/current. [2] Become e-Resident. E-estonia.com. Dgital Society. DOI= http://e-estonia.com/e-residents/become-e-resident/ [3] ICT Spring Europe 2014. European ICT Awards. DOI= http://www.ictspring.com/european-ict-awards/. [4] Implementation of the MoReq2 Model Requirements for the Management of Electronic Records in Estonia (Estonian “chapter zero” for MoReq2). Version 2.0. Ministry of Economic Affairs and Communications (2013), 21–28. DOI= https://www.mkm.ee/sites/default/files/estonian_et_- _chapter_0_english.pdf

Page 5 Bringing legacy and physical content under Governance Jean Mourain Vice president, Global Strategy RSD SA [email protected]

ABSTRACT An organization has decided to trigger an Information Governance (IG) program, with its own weighting of three main performance indicators: “create value, mitigate risk, reduce cost of information”. To be practical, the IG program committee (or steering committee) has set specific goals and milestones, generally in relation to business functions, that would be achievable in a certain timeframe, say 9 to 12 months. For instance, it will start by applying well defined policies to newly created “financial” content, stored in a given repository, adding the proper metadata, within a certain file plan. This applies to electronic and physical content. So going forward, step by step, all new content will be “governed”, with all the expected benefits, actually monitored via a proper dashboard. However this leaves aside the huge amount of pre-existing content. So called legacy content. Another element of the IG program will be dedicated to addressing that legacy, be it physical or electronic. This presentation will elaborate on the problems, solutions, techniques and methods to clean up what is of no interest, categorize, and ultimately place under governance what is of interest, against the “value, risk, cost” indicators selected by the organization.

(Final text not received in time)

Page 6 Risk-based appraisal of records: some developments in Dutch appraisal practice

Charles Jeurgens Nationaal Archief/Leiden University Prins-Willem Alexanderhof 20, 2595 BE Den Haag, The Netherlands [email protected] / [email protected]

ABSTRACT short general introduction in which I focus on the growing role In this paper I will discuss some recent ideas about and risk and risk management play in the archival community. After developments of risk management and risk assessment in the field that, I will discuss some of the later developments in the of records- and archives management in The Netherlands. Netherlands in the field of risk-oriented records-management and appraisal and selection of records and I finish with discussing the opportunities of risk driven appraisal. GENERAL TERMS Management, Theory RISK AND RISK MANAGEMENT IN THE KEYWORDS ARCHIVAL COMMUNITY The enfant terrible of the archival community, David Bearman Appraisal and selection; records-management; risk-assessment; 3 risk-management has never been afraid of thinking out of the archival box. Taking pride in being not an archivist he confronts the archival community since the 1980’s incessantly with fresh thoughts about information- and archives management. In spite of his INTRODUCTION autonomous and sometimes somewhat distant position, archivists Risk, risk assessment and risk management are polymorphous, adopted many of his ideas over time. An important and repeated multidimensional and context dependent categories. Risk has 1 theme in his writings is records management based on risk- been defined as ‘the effect of uncertainty on objectives’ and risk assessment. The writings of Bearman made archivists aware that management can be seen as the activities to control and to risk and risk management are inextricably connected to and are manage risks in order to be able to achieve the set aims. Growing part of records management. Due to his persistent attention for complexity of society causes an increasing interest in and this theme, risk management became gradually part of archival preoccupation with risk and risk management. For instance, due vocabulary and archival practices. This is best illustrated by the to climate change, worldwide banking crises, financial crises and fact that today risk management has its own ISO guideline to help other catastrophes in the world, risk management gets much records managers to evaluate risks related to records processes attention and is predominantly directed toward reducing the and records systems.4 effects of uncertainty and, in case an incident might take place, to limit the probable effects. This approach of risk-reduction is only When archivists use the term risk they usually associate it with one side of the picture. There is a more expansive view of risk, the purpose to control potential threatening and damaging effects which is called strategic risk taking. Risk management can be on the quality of records due to bad management and the aim to used not only to protect but also to generate value and is based on create a risk-free realm where the archival legacy is safe. This the premise that risk and taking risk has rewarding effects. After archival reflex is understandable, since it has always been the first all risk creates opportunities, which can be easily illustrated by responsibility of archivists to safeguard the archival legacy and to looking at many of our daily activities. Taking risk is not only protect archives against dangers of decay and loss or as Hilary necessary to make extra profits but is even a prerequisite to Jenkinson already in the 1920’s clearly stated: the archivist ‘has survive. Compared to saving, investments in stocks are much to take all possible precautions for the safeguarding of his riskier but generate higher returns. The hunting caveman was Archives and for their custody, which is the safeguarding of their confronted with many dangers but without taking risks he would essential qualities’ 5 by which he meant ‘impartiality’ and not be able to end up with food. Many innovations are the result ‘authenticity’ of the archives. In our time archivists still are of the desire to diminish risks or of the opposite to take risks. preoccupied by creating a save haven and a secure base for Every risk has its reward and it is this pairing of risk and reward, archives in order to preserve the essential qualities of archives which is at the core of a more encompassing risk definition and a 2 which are closely connected to the key values archivists adhere to more encompassing risk-perspective. safeguard: integrity, authenticity, reliability and useability of records. In this paper I will discuss some recent developments of risk management and risk assessment in the field of records and archives management in The Netherlands. I first will start with a

3 See the still worth reading article written by Terry Cook ‘The Impact of David Bearman on Modern Archival Thinking: An Essay of Personal Reflection and Critique’ in Archives and Museum Informatics 11 (1997): 15–37 4 ISO/TR18128:2014(E) Information and documentation – Risk assessment for records processes and systems. A records system is defined as ‘any business application which creates and stores records’. 1 ISO 31000 2009 5 Hilary Jenkinson, A Manual of Archive Administration, 2 Aswath Damodaran, Strategic Risk Taking. A framework for including the problems of War Archives and Archive Making risk management (New Jersey 2007) 7-10. (London 1922) 15.

Page 7

This emphasis on safeguarding records and avoiding risks of MISSING THEORY damage and loss is clearly reflected in the archival terminology. Although appraisal and selection of records is not explicitly In the American archival terminology risk management is defined associated with risk assessment and risk management, the process as ‘the systematic control of losses of damages, including the of appraisal and selection bears all characteristics of it. As in analysis of threats, implementation of measures to minimize such 6 many countries, government agencies in the Netherlands need to risks and implementing recovery programs’. In his now classic have a retention schedule in order to be able to legally destroy article ‘Moments of Risk: identifying threats to electronic records. The Dutch Archives Decree 1995 describes in very records’ Bearman identified six moments of great vulnerability 7 general terms which interests and values must be taken into for the integrity and authenticity in the existence of records. account when a government agency compiles a retention These six moments of high risk are the moments of transition at schedule: the business-processes of the government agency, the capture, maintenance, ingest, access, disposal and preservation. relation of the government agency to other government agencies, Better knowledge of potential risks and the moments of risk leads the value of records as part of the cultural heritage and the to a more active and at least to a more explicit policy of records significance of the information kept in the records for government managers and archivists to take (and accept) or to minimize risks. agencies, for persons who are seeking justice and evidence and Viktoria Lemieux, a Canadian scholar who is an expert on risk for historical research. This general description leaves much room issues in records management and more in particular in how risks for interpreting and operationalizing this ‘taking into account’. impact upon transparency, public accountability and human rights The archives legislation does not prescribe a specific method of considers records- and information risks as chances ‘that may how to determine the life span of a record. Legislation does pose a threat to the effective completion of business transactions 8 however prescribe who should at least be involved in the process and fulfillment of organizational objectives or opportunities’. In of designing a retention schedule: the official who is responsible 2010 she carried out a research among seven leading archival for information management of the government agency, the journals in the field of archives and records management and she archivist of the repository where the records of the agency will be found out that between 1984 and 2010 seven different kinds of transferred to and an impartial expert who takes care of the risks related to records and records management were discussed 9 information interests of citizens in the process of assessment. in these journals. In the analyzed journals, the focus is mainly How they come to their assessment of records is not prescribed. on attempts to limit the risks of losing recorded information and more specific, most attention was given to disasters and Traditionally archivists have paid more attention to securing devastating human behavior that constitutes a danger for the valuable records for perpetuity than having an all-encompassing existence of the records, on long-term preservation of digital 10 involvement in appraisal and selection. The selection-goals are records and on long-term preservation of authenticity. still more or less one-dimensionally directed to identify the records destined to keep forever. The latest selection goal for All these approaches to risk management have one important records that was adopted in 2010 by the Minister of Cultural thing in common: the emphasis is always on avoiding potential Affairs and the Minister of Interior clearly shows this: identified risks. This one-sided perspective of risk and risk- management leaves out the rewarding perception. In my paper I ‘The purpose of appraising, selecting and acquiring archives is to want to use the more encompassing viewpoint of risk and risk bring together and secure the sources that enable individuals, management and connect it to the developments in appraisal and organisations and social groups or bodies to discover their selection issues in the Dutch records- and archival community. A histories and to reconstruct the past of state and society (and question that could be made from this more encompassing their interaction). To this end those archives or parts of archives perspective is for instance how much effort (time and money) that must be secured are: archivists and records managers want or need to spend in a. representative of those items which have been managing different categories of information. Recently explicit recorded in society risk-management has become one of the areas of interest in b. representative of the activities of the members appraisal and selection of records. The next step is to develop (people and organisations) of a society further the experimental tools with the purpose to make a better c. considered by commentators as significant, and more explicit risk assessment. Before turning to that aspect I exceptional or unique because these reflect the first will briefly sketch the general developments of appraisal and significant, exceptional and unique social selection within the Dutch records-management and archival developments, people and organisations of a particular community and describe the context in which the need for risk period’.11 assessment has become manifest. After that I will focus on the risk analysis tools itself and sketch some first experiences with The selection-goal may give direction of what to keep but is not this new approach. helpful at all for the selection of records to be destroyed. This at the very least is a remarkable observation. Especially when we look at it from a records continuum perspective.

For those who are responsible for appraisal and selection 6 Richard Pearce-Moses, A glossary of archival and records decisions, it is a nightmare to make wrong judgements that might terminology (Chicago 2005) 348 result in the destruction of records, which, as it sometimes turns 7 David Bearman, ‘Moments of Risk: identifying threats to out only much later, should not have been destroyed. Sometimes electronic records’ in Archivaria 62(2006), 15-46, p 25 appraisers make obvious mistakes in assessing records, for 8 Viktoria L. Lemieux, ‘The records-risk nexus: exploring the instance because the records are needed for the business relationship between records and risk’ in Records Management processes for a longer period of time than the retention schedule Journal 20 (2010) 2, 199-216, p 201. stipulates. Sometimes it happens that records that actually exist 9 The researched journals were: American Archivist; Archival are not included in the retention schedule. And of course it will be Science; Archivaria; Information Management Journal (IMJ); Journal of Documentation; Journal of the American Society of Information Science and Technology (JASIST); Records 11 Commissie Waardering en Selectie, Gewaardeerd Verleden. Management Journal. Bouwstenen voor een nieuwe waarderingsmethodiek voor 10 Ibidem, 211. archieven (Den Haag 2007), p. 37-38.

Page 8

possible that records are kept or destroyed contrary to the activities the Council regarded this retention-period as an assessment made in the retention schedule. It are however not example of ill-considered risk assessment.14 only this kind of more or less clear mistakes that play a role in appraisal and selection. Changing societal conditions can also Based on this short analysis of the appraisal practice in The lead to new and different insights about the value of information Netherlands we may observe two things. In the first place there is for business processes. Although records may have been a somewhat ambiguous attitude towards risk management. destroyed legally and completely in line with the societal Although risk assessment and risk management are not clearly standards of the time when the decision was made, because of and explicitly addressed as being part of the appraisal and new developments in society policy-makers sometimes may selection procedure, implicitly they certainly are. In the second regret the destruction of the records. It is not difficult to find place we may discern that once risk assessment is introduced in examples of such changes in evaluation of the same documents the archival debate, there often is a one-dimensional alarmist and over time and not only because of changing historical interests reproaching undertone in it. Indeed, risk assessment might be but also for business reasons. A good example is for instance the seen as something that is so self-evident and arises from common interest for environmental issues in the late 20th century. Since the sense, which should not need specific attention or an explicit 1870’s factories needed permission to start activities that might method. Reality however is different. What I argue is that it is be harmful for health and environment. For a long time these risk for the quality of information that risk-management and risk licenses issued under the Nuisance Act were legally destroyed assessment lack serious and methodical attention in appraisal and several years after the permit expired. This policy changed after selection. There is an urgent need to be much more explicit about some environmental scandals shocked the Dutch society in the risk management and risk assessment in appraisal and selection as 1980’s. A very confronting wake-up call was the discovery of an integral part of records management. health threatening quantities of poison in the soil of a newly built residential area in the municipality of Lekkerkerk in 1980. An SIGNS OF A TURNING TIDE? expensive soil-sanitation program was the result and the It is indisputable that selection not only has to do with the highly government started a national survey to list all potential spots of valued principles of meeting democratic rights of citizens who chemical pollution in the country. Suddenly the information in th th want a government that can be held accountable for its activities the -for the most part destroyed- 19 and 20 century licenses by showing the records and for the sake of being able to would have been very relevant for this purpose because with the reconstruct the past. When we look at the world of paper records, records still available, it would have been possible to trace for selection basically has to do with the very trivial issue of space instance the highly polluting white-lead factories or gas plants 15 th (which means money) that is needed to keep records. For a long that have been in operation in the country since the late 19 time the issue of selection was primarily connected to the need to century. Exactly because of this newly manifested value for the solve space problems for the administration. Periodical administration, the retention schedule was changed. Licenses destruction of records was the answer, which however often issued under the Nuisance Act were no longer destroyed but kept 12 appeared to be a largely spurious solution because many permanently. This example clearly shows the difficulties of government agencies inclined to postpone the real selection of making sustainable long-term appraisal decisions, and it also their records until there was hardly any administrative interest left shows aspects of risk-assessment. in keeping them. Piles of paper waiting to be selected were the result and the aim to eliminate the ‘backlogs’ in processing The Dutch Council for Culture [Raad voor Cultuur], which is the records became a recurring policy statement within the archival most important advisory board for the Minister of Cultural community. Many special programs were set up in an attempt to Affairs, was until two years ago involved in the evaluation of speed up the processing, which in the end often turned out to be every single retention schedule before a schedule could be disappointing and still very time-consuming trajectories. The decreed by a minister’s resolution. In looking back at the 45 years often-quoted Manyard Brichford once said that appraisal and of assessing hundreds of retention lists, the Council stressed that selection “is the area of the greatest professional challenge to the ‘selection has everything to do with risk analysis and risk archivist. In an existential context, the archivists bears the management. The failure to recognize the risks of bad responsibility for deciding which aspects of society and which information in this field, could have major implications for 13 specific activities should be documented in the records retained citizens and authorities’. In its report, the Council sketches eight for future use”.16 Probably the archive professional on the spot examples to illustrate the importance of precise and careful does not always feel the heavy burden and the big appraisal and selection of records because of the far-reaching responsibilities, but the dilemmas he is confronted with in consequences for government, society and citizens. In some of appraisal and selection indeed easily paralyze the whole the examples the Council expresses fierce criticism on the processing of records. David Bearman already in the 1980’s has inaccurate assessment of some categories of records because of failing risk awareness, for instance in the retention schedule that deals with extracting minerals, issued in 2005. In the draft version 14 Ibidem. 15 of this retention schedule maps and designs of drilling machines See for instance Gustaw Kalenski, ‘Record Selection’ in The were regarded as destructible and measured values of earthquakes American Archivist 39 (1976) 25-43, p 27-28 and James Gregory and subsidence were to be destroyed after 10 years. In the light of Bradsher, ‘An Administrative History of the Disposal of Federal growing damage on houses and infrastructure caused by mining Records, 1789-1949’ Provenance, Journal of the Society of Georgia Archivists 3 (1985) Issue 2. See

http://digitalcommons.kennesaw.edu/provenance; Charles 12 At the same time one could question this decision to keep the Jeurgens, ‘De selectielijst en het historischArchiefambacht motief tussen in de waardering en selectie van archieven’ In: Put E., newly made licenses for t his purpose. Env ironmental- and health geschiedenisbedrijf en erfgoedwinkel. Een balans bij het issues get so much attention in society that a whole range of new afscheidVancoppenolle van vijf rijksarchivarissen Ch. van (Eds.) environmental laws and regulations have been issued with new 16 Maynard J. Brichford, Archives and Manuscripts: Appraisal kinds of registrations of dangerous chemicals. and Accessioning (Chicago: Society. (Brussel of American 2013) 207 Archivists-226. 13 Raad voor Cultuur, Selectie. Een kwestie van waardering (Den Basic Manual Series, 1977), 1. Haag 2013) 30.

Page 9

put forward some interesting but at that time surely not isolation. 18 The functionality of appraisal and selection has undisputed ideas about solving this growing burden of piling drifted further and further apart from the business processes. The papers. He introduced the perspective of risk assessment in categories that have been constructed for appraising and selecting appraising and selecting by posing other questions than archivists official information do not always fit into the actual information were used to in these matters. He asserted that ‘[i]nstead of structures, resulting in complicated matching operations, which asking what benefits would derive from retaining records, they are time-consuming. 19 The observations of the National Audit [=archivists CJ] should insist on an answer to the probability of Office can easily be associated with the earlier mentioned incurring unacceptable risks as a consequence of disposing of discrepancy between the structuring principles of business records. This will very likely dramatically reduce the volume of processes on the one hand and information-management records that are judged essential to retain. And it suggests an processes on the other. approach to solving the second dilemma of our current appraisal methods: their focus on records rather than the activity they This discrepancy made us rethink the relationship between the document.’17 He introduced a new perspective with this reversed business processes and the information processes. Because the approach, which immediately shows that risk has also its reward. bond between the business- and information practices is not The very simple question that is at the basis of this approach is: always self-evident - partly because the business processes are what will really go wrong if the records are not available not clearly defined, partly because records managers do not anymore? always know or understand the structures of business processes – the quality of records management is affected and creates Some years ago the Dutch National Archives were involved in a uncertainty and lack of clarity in appraisal and selection program that aimed to speed up the processing of the large activities. In particular this is the case in policy-making activities amounts of records of state institutions. The program dealt with because they lack a clear pre-defined structuring format. In the records created between 1975 and 2005 and although never analysis of some information specialists at the Dutch Ministry of exactly calculated, at that time estimations were made that more Defense the real problem behind the often poor quality of than 800 km1 shelves filled with paper were waiting to be selection is not so much the complexity of regulations as well as selected. How to speed up selection in a way that government the lack of financial resources and a lack of high qualified agencies within a reasonable time could meet their obligations to employees to manage the records properly. Attempts to bring the transfer records within 20 years to the National Archives? Not quantity and the quality of the records-management staff at a being able to go into detail here about this project and the level to be able to manage all records in accordance with the methods that were developed to accelerate the selection process, requirements were not very successful because of lack of interest in the context of this paper it is important to mention one of the from the top-management. The effect of this structural problem of elements that played a role in speeding up the process. What failing quality of records management was that the ministry ran archivists from the National Archives usually did not do, but serious information risks but without knowing what the risks started to do as an experiment, was to discuss the relevance of the were or without knowing where they could become manifest. Due records for the business processes with the managers who were to this trivial but very realistic problem, the Ministry of Defense responsible for these business processes in the government started to experiment with risk-oriented records management. agencies. The very basic question to these business-process Instead of treating all records in the same way, records managers managers was what would go wrong if all these records that were started to diversify the intensity of records management based on waiting for selection were to be destroyed. Of course it never was risk-assessment of the processes the Ministry carried out and was a serious idea to destroy all these records, but the question responsible for.20 The same experiment started at the Ministry of appeared to be an interesting starting point for a serious Finance in a project with the National Archives. conversation about the relevance of the records for them. These process-managers made risk calculations and it was generally The first thing that has to be done by records managers - in close speaking not so difficult for them to tell which records from cooperation with business managers - was to identify and to list which processes produced 10 or 20 years ago were still of vital the processes that were carried out by the organization and to relevance. A staggering observation was the discrepancy in assess the risks of uncontrolled information loss (because assessing the relevance of the records for the business processes information could not easily be found, was not complete or might by the business-process managers on the one hand and the be destroyed illegally) for every single process. Initially the records managers on the other. The information management Ministry of Defense distinguished between three risk-levels processes were only to a certain extent a reliable reflection of the (high, middle and low) but nowadays there are only two levels business processes, which makes the operationalization of the left. The category high-risk processes is reserved for the concept of archives as process-bound information rather processes that risk casualties, major political damage and serious problematic. stagnation of the primary processes of the Ministry in case of uncontrolled information loss. Examples of such high-risk NEW DIRECTIONS processes are military missions in for instance Afghanistan and In a recently published report the Dutch National Audit Office processes that are carried out in the scope of national security like criticized the complexity of implementing the many regulations that aim to bring government agencies in control of their 18 information management. In its report, the Audit Office paid Handelingen Tweede Kamer der Staten-Generaal special attention to the ‘tenacious issue’ of selection and reported [ Parliamentary Papers ] II, 2009 – 2010, 32 307, nos. 1-2: that in the past decades the authorities have not succeeded in Algemene Rekenkamer, Informatiehuishouding van het Rijk developing an effective method of appraisal and selection that (2010) 33. 19 Ibidem provides a lasting contribution to the quality of information 20 management. In particular, it blames the pattern of short-term Ministerie van Defensie, Generieke Selectielijst voor de official interest in solving only partial problems in which singular archiefbescheiden van het Ministerie van Defensie vanaf 1945 actions are implemented to solve that partial problem in (Den Haag 2014); H.E.M.J. Kummeling, Documentaire Informatie. Studie DI-risico’s bij defensieprocessen (Den HAAG 2007); KennisLab, Eindrapport Expertteam Risicomanagement 17 David Bearman, Archival Methods, Archives & Museums (Den Haag 2011) In formatics Technical Report 3 (1989) Page 10

intelligence services and explosive disposal activities. A instance legislation that requires mandatory destruction of substantial part of the limited resources available for records- specific police records after 5 years. Or are there specific management was allocated to improve information- and records- circumstances why records should be kept for a longer period of management for these high-risk processes. The result of this risk- time than usually? Recently the Ministry of Defence decided to assessment of processes is a deliberate policy of better control of keep the personal files about short-term psychosocial help for information-management in the high-risk processes and less in soldiers who were sent abroad for a military mission for 80 years the low-risk processes. Of course this generates new risks, but the instead of the usual 5 years. Risk-calculation, based on some rewarding element is a better control of the most important recent experiences of serious problems led to this re-evaluation. processes. Another thing is that the business-managers should answer the question whether there can be reasons to keep some information In this still experimental risk perspective the archival function of permanently. One could think of the records that contain appraisal has got a more holistic significance than in the information about some infrastructural works like, the dams and traditional approach. Appraisal is more than assessing the value dikes in the Netherlands. That information can be of vital of records from a perspective of retention. The outcome of importance on the long run. In fact the most important aspect of process-assessment as described in this example, will have its the new appraisal method is that it will be an on-going process of impact on the efficacy of the retention schedules. Selecting evaluation and re-evaluation. Until now a retention-schedule has records from high-risk processes will be more accurate than validity for maximum 20 years. That is impossible in the dynamic selecting records from low risk-processes, simply because of a information era we are in. A period of 20 years is eternity. more intensive records management, which for instance may Appraisal will be a continuum. result in employing more detailed metadata to documents of high- risk processes. CONCLUSIONS Risk-management, risk-assessment and risk-based appraisal are RETENTION-SCHEDULES important aspects of records-management. We only started very The methods and procedures of compiling retention schedules in recently to give risk-management the attention it deserves. In this the Netherlands are currently redesigned because of the need to paper I have been focusing on the risk-management aspects from appraise records much earlier in the digital workflow. Backlogs the business perspective. There is a growing need for the in the processing of digital records are a doom scenario that archivist/records-manager to interfere in the realm of guarantees uncontrolled information-loss. In the new approach information-creation. He no longer can limit himself to be a explicit risk-assessment will be part of the process of preparing a passive and records-receiving professional. Instead, archivists/ retention schedule.21 Risk-analysis will be one of in total three records-managers need to develop methods and tools to play a different tools available to make an all-encompassing evaluation role in the information- and records-continuum without clear of records and to determine whether and when records should be dividing lines between the different information interests. An all- destroyed. The new methods and procedures aim to appraise at encompassing risk perspective and risk assessment, not limited to the moment of, or even better, before the moment of creation of the limited scope of appraisal and selection, but based on records- information. management closely connected to the real business-processes, may be a valuable contribution to the quality of the information. The risk-analysis tool is developed for and from the perspective of those who are responsible for the business-processes. A prerequisite for being able to compile a retention schedule is an extensive and up-to date list of the business-processes the organisation is responsible for. Even in the field of records management the maxim of W.E.Deming is applicable, which says ‘[i]f you can't describe what you are doing as a process, you don't know what you're doing’. Good knowledge and understanding of the business processes is the starting point for solid records- management.

With a good picture of the business processes it will be easy to identify the managers who are responsible for these processes. In the newly developed method for designing retention schedules, it will be crucial to involve the manager. He will be called to account for his responsibilities as a business-manager. How long does he need the information kept in the records to pursue all the obligations he is held responsible for? What are the business-, management-, financial-, political and legal risks if records are not available anymore or if records are not destroyed on time? The business-manager in a governmental setting has obligations and responsibilities that are beyond the direct business processes. Within the scope of the government accountability (to the Parliament and to the citizen) plays a major role. In the newly developed tool some suggestions are included to help the business-manager in making his assessments. Does specific legislation on information (national or international) exist in the field of the activities the manager is responsible for? There is for

21 This will also be the case in for instance the revised edition of ISO 15489 . Expectations are that the new edition will be issued in 2015.

Page 11

Building a risk based records management governance for the City of Rotterdam

Bart Ballaux Jeroen van Oss City of Rotterdam City Archives of Rotterdam Postal box 70012 3000 KP Hofdijk 651 3032 CG Rotterdam, The Netherlands Rotterdam, The Netherlands +31-6-57999433 +31-10-2675527 [email protected] [email protected]

ABSTRACT structure. The organizational structure of Rotterdam has been This paper describes the transition process of records management “reduced” from 17 autonomous central business units and 14 local in the city of Rotterdam. It anchors records management in the authorities to 7 autonomous central business units and 14 city information architecture framework and links it to other districts with advisory committees. Six primary business units are information disciplines, in particular information security. A supported by a shared services unit that centralizes all support policy and maturity model has been developed to guide the services. development toward a more professional practice of records These changes on an organizational level both create management. One of the cornerstones of the policy is a risk based opportunities and challenges for records management 1 approach to information and records . Policy and the risk professionals. Before the organizational changes, records approach are aimed at making records management an integral management services were the responsibility of the autonomous part of the business and process analysis on the one hand, and business units. Some business units were supported by only one system analysis on the other hand. As a result, records or two records managers; others had a team of 15 to 20 records management should reach a more mature level in the maturity managers. Records management professionals however mainly model. operated on the operational level, i.e. main duties included The paper provides an overview of how records management is registering incoming and outgoing correspondence on the item organized in the city of Rotterdam. It will highlight how records level, managing case files with limited use for the administration, management in Rotterdam has moved away from an application- and supporting administrative staff with practical records driven approach to a business and process-driven approach; it management issues (classification, appraisal, etc.). Some central elaborates the risk approach as a starting point for further action guidance and advice originated from one of the 17 central and a method to raise awareness of records management in the business units, the City Archives. A central records management organization. policy (a legal obligation for Dutch city administration), indicating in general terms responsibilities and rules, has been in Categories and Subject Descriptors place for decades, but has never been fully implemented by the business units. H.1.0 [Information Systems]: Models and Principles - General Prior to 2010, various efforts were made to standardize and General Terms professionalize records management. In the years 2006-2010 a Management, Standardization. large scale attempt was made to implement a standardized and uniform digital records management application for the entire municipal organization. Due to internal and external factors, it Keywords was limited though, to the implementation of a stand-alone Records management, Risk management, Quality management, document management application (without integration with other Information governance. business applications). In addition, it was not governed by a central records management strategy and policy, which was 1. INTRODUCTION subscribed by the business units. In the end, only two of 17 Like many government administrations, the city of Rotterdam business units adopted the application, and in 2010 further (+600,000 inhabitants and an administration of approximately implementation was canceled. 11,000 civil servants) is in the process of adapting its structures and working methods to the needs of a rapidly changing society, The main reasons for the failure were: and is combining this change with a reduction in costs. Keywords 1. The implementation was application-driven, not business- in this shifting role of the city administration are efficient, driven. flexible, digital, reliable, and service-oriented. 2. The implementation was not supported by a broadly accepted During the last couple of years, the administration has been records management policy. transformed from a decentralized structure with a rather high 3. The application was designed to support the records flow, degree of autonomy for organizational units to a more centralized not the business process. 4. As a result, the range of mandatory and optional metadata to be filled manually on document level was rather high. 1 Information and records are used as synonyms in this paper.

Page 12 5. Acceptance by users was lacking as the application was field into several disciplines or “information themes,” of which mainly considered as the “playground” of records records management is one. Others include information security, management professionals. The application was considered a collaboration, business intelligence, open data, and workflow nuisance, rather than a tool supporting business. engineering. A roadmap has been described for all information 6. The application was a stand-alone, without integration with themes. The records management roadmap consists of four business processes and applications. domains: A new approach was necessary. In order to be successful, it 1. Policy and integration of records management in the should not put an emphasis primarily on records management information architecture; workflows, rules, regulations, and applications (as summarized in 2. Integration of records management in IT systems; general in ISO 15489) [1], but on articulating the importance of 3. Integration of records management in business process re- good records management for the business. As a result, the call engineering; for appropriate records management would not only be 4. Organizational culture and its relationship with awareness of propagated by records management professionals, but by the the value of information. executive management level. The idea of the roadmap and the domains is that a higher level of In 2010, the city archivist initiated a records management maturity should be reached in each domain. For instance, records program. Five records management professionals were hired to management functionalities can be implemented in software, but initiate the change and to anchor records management in the without any awareness and training (organizational culture), it business, especially on the executive level. will not have any effect. The program re-endorsed some general principles, but also In 2014, the most effort has gone into domain 1 (policy and introduced some new principles: information architecture). A new records management policy was introduced. Many chapters are dedicated to “traditional” records 1. Records management is primarily the responsibility of the management activities, as summarized in ISO 15489 (e.g. business. Records management professionals give advice to classification, appraisal, metadata, etc.) [1]. However, two policy the business; however, decisions to implement are made by rules are not self-evident: the business. Risks on non-compliance are the integral responsibility of the business units. 1. All IT projects are assessed by a records manager. 2. Records management, like information security, is a standard 2. Various levels of quality are introduced in records and integral part of each business process. management. 3. Records management supports the business, therefore it is This last policy rule is elaborated in the next section. preferably as invisible as possible for users. 4. The traditional distinction in stand-alone Electronic Document and Records Management Systems (EDRMS) 3. RISK APPROACH between document management and records management is The basic principle of the Rotterdam records management policy not based on records management logic, but on application is that not every business process requires the same quality level logic. In a business approach to records management this of records management. Some processes are not that important distinction is not useful. and (thus) imply limited risks. In addition, the importance of 5. The required quality level of records management is information and records (and thus of the choice of the records determined through risk assessment. Risk assessment is management regime), primarily determined through business executed at the process level, not at the system level. needs and rules, is also diversified. 6. Records management is just one perspective of information. Others include process (re)design, information security, Traditionally, records management policies in The Netherlands do information architecture, business intelligence. Collaboration not differentiate between processes and records in levels of and integration with these perspectives will help put records quality or importance (and risks). The retention periods, in a management on the agenda. Dutch municipal setting mainly based on laws and regulations, only indicate how long information has some value and thus are In 2014, the records management program will be canceled. As a only marginally linked to business needs and rules. The traditional consequence of the organizational changes, a central department approach results in a “one size fits all” solution: all records are of records management professionals has been created in the considered as important. shared services unit. The risk approach allows for differentiation in records The program has introduced a new framework for records management regimes, and thus meets the needs of the business to management that comprises a governance structure, a risk based keep records management rules as simple and user-friendly as policy, a strategy and a road map, methods, and a new records possible. If a process and its information have been identified as management organization. low risk from a records management perspective, then no or a very limited number of records management measures are 2. GOVERNANCE AND POLICY implemented. As part of the organizational changes in Rotterdam, a new It is important to make three preliminary notes about the governance structure for all support services was introduced. The Rotterdam risk approach. CIO, as the “business owner” and primary responsible for information and ICT, makes policies, collects demands from the 1. It is not a risk analysis in the traditional sense [2]: it does not primary business units, and negotiates with the shared services identify specific records management risks in a business unit about service levels. The CIO has divided the information process, nor does it enumerate risks in records (management)

Page 13 processes (as is done in ISO 18128 [3]). The risk analysis is Figure 1: Quality criteria of records management and a tool to make business executives aware of the importance information security of their process and its information. As such, it could also be

called a process and information classification tool. Information security However, using “risk assessment” in communication serves Records management the purpose of provoking a discussion with the business about the importance of an appropriate level of records

management measures. 2. The risk approach combines a calculation of the importance Retrievable Authentic of the process and that of the records created in the process. Theoretically, the two are different variables and could be Integer independently assessed. However, experience shows that in Available Integer practice the valuation of the process and the records created in the process, follow a similar logic.

3. The risk approach has no ambition to create an absolutely objective result. It is possible that the same process in Displayable another city administration is valued differently, because circumstances are different.

The risk method uses four quality criteria, each divided into four Interpretable quality levels, which are operationalized into three fields of control measures. It was inspired by the data classification method Confidential that is used in the field of information security. The required quality level is determined using a digital questionnaire: the “Records Management Risk Tool,” The questionnaire is filled during a workshop attended by records managers and representatives of the business. In 2014, a new version of the tool was extended with questions about information security. The information security definition of integer partly covers the The records management criteria originate from the ISO 15489 ISO 15489 definitions of authenticity and integrity. Availability standard [1] and are based on the four characteristics of a record: implies that a record can be retrieved and represented. authentic, reliable, integer, and usable. Reliability was not upheld as a criterion: it implies a check on the completeness of the Every records management criterion is divided into four quality content of a record as compared to reality. It was assumed that levels, related to business requirements, as described in table 1. such a check is the sole responsibility of the business and should not be evaluated by a records management tool. Table 1: Levels of records management quality Usability, as a separate criterion, was too broad. Therefore, it was split into three criteria that could be operationalized in practice: Level Authentic and integer retrievable, interpretable, and displayable. 0 Not sure: the business process allows that there is no guarantee that the information is authentic and integer. For reasons of practicality, authenticity and integrity were merged into one criterion: it was assumed that the business would not 1 Protected: a basic level of guarantee for authenticity significantly differentiate between the two. and integrity is required. The four records management criteria are: 2 High: the business process allows little violation related to authenticity and integrity.  authentic and integer; 3 Absolute: conclusive evidence about author, moment of  retrievable; creation, content, and changes, is necessary.  interpretable; and  displayable (it can be presented) Level Retrievable These criteria can easily be matched with the quality criteria used 0 Not necessary: information may not be retrieved, in the field of information security: without any consequence. 1 Necessary: information may incidentally not be  Integer retrieved.  Confidential  Available 2 Important: if necessary, information can be retrieved with special (incidental) effort. As seen in figure 1, some of the criteria are complementary, while others partly overlap. 3 Essential: information can be retrieved in a timely and efficient manner.

Level Interpretable 0 For those directly involved: persons directly involved are able to interpret and understand the information.

Page 14 1 For a broader group in the organization: information Indicators and rationales can all be questioned. Separately, they can be interpreted and understood by persons not tend to overemphasize one little aspect of the quality of directly involved in the process, shortly after closure of information. Taken together, the indicators give a more balanced the case. overview of the quality requirements of information in a process. 2 For users outside the organization and through time: The outcome of the questionnaire is a risk level, information information can be interpreted and understood by users classification level or information quality level, varying from 0 to outside the organization and after closure of the case. 3 for the combination of the four records management quality criteria. The quality criterion with the highest score determines the 3 For users at a large distance in space and time: overall score. information can be understood and interpreted by persons and stakeholders who are at a great distance from the original business process and its information. 4. RECORDS MANAGEMENT REGIMES What are the consequences of these quality levels, in terms of Level Displayable records management measures? The records management risk 0 Not necessary: information cannot be displayed without approach distinguishes three fields of records management any consequence, even not by authorized persons. measures. Two are to be implemented in applications and one consists of procedures and is thus concerned with responsibilities 1 Necessary: information cannot be displayed and organizational culture. The three include for each quality incidentally, even not by authorized persons. level: 2 Important: if necessary, information can be displayed  A list of metadata, based on the ISO 23081 standard [4] and with special (incidental) effort. a Dutch metadata guideline [5]; 3 Essential: information can always be displayed by  A list of functional requirements for applications, based on authorized persons. the Dutch standard NEN 2082 [6] and ISO 16175 [7];

 A list of procedures, describing activities related to the For each criterion, indicators were formulated and translated into management of records. questions that were not too technical for persons not familiar with The highest classification (level 3) requires that the full set of records management terminology. metadata, functional requirements, and process measures be Indicators are, for instance: implemented. This also includes a preservation regime for 1. The impact of financial, political, reputational or health risks permanent preservation. At the other end of the scale (level 0) no caused by bad quality of information. records management measures are required. In other words, Rationale: the use of information of bad quality may have an “records management does not care what the business is doing impact on decisions at a financial, political, reputational or with its information.” Levels 1 and 2 require a subset of measures. health level. Classification outcomes are documented in a central registry. The This indicator affects all quality criteria. ideal is to have an overview of all municipal processes with their 2. Legal requirements related to the process (time available to classification level for records management and information finish the process). security, as the basis for a plan, do, check, act cycle. Rationale: timely delivery of a product is a legal requirement. This indicator affects the ease of retrieving records. 5. IMPLEMENTATION 3. Legal requirements related to the form and status of records. The risk approach has been used for approximately 50 processes Rationale: legislation, rules, and procedures may prescribe in Rotterdam. Most of the classification levels are high (2 or 3). the formal characteristics of a record. This should not come as a surprise: most of the processes that are This indicator affects the authenticity and integrity of analyzed are chosen because they are considered important by the records. business. Processes on legal complaints, domestic violence, debt 4. The extent to which partners, law enforcers, or accountants mediation, fraud research, recovery of money, etc. have a high have to rely on information in a later phase. risk level. However, the process of issuing a Rotterdam event year Rationale: persons involved in other processes may need card for instance scores a risk level of 1. information on this process. This indicator affects all the quality criteria. The added value of the risk approach is that the business is aware 5. The number of times a process is executed. of the value of its information, realizes that good management of Rationale: the more a process is executed, the more information is necessary, and is more inclined to invest in important the quality of information becomes. appropriate measures. This indicator affects all the quality criteria. Owners of business processes in risk level 2 and 3, making use of 6. Retention periods. business applications, are advised to link these business Rationale: the longer information has to be preserved, the applications with a standard, integrated document and records more important it is to ensure it can be used in future. management tool. In that way, the business is sure that functional This indicator affects the interpretability and displayability of and metadata requirements are met. If it is not possible, records records. management functionalities (and required metadata) are built in 7. The quantity and complexity of the information. the business applications. Rationale: the more information in a case file, the more important it is to be able to browse through it. This indicator affects the ease of retrieving records.

Page 15 6. CONCLUSIONS [2] ISO 31000:2009. Risk management - Principles and This paper presented the new approach to records management in guidelines. 2009. the city of Rotterdam. The transition from a traditional [3] ISO/TR 18128:2014. Information and documentation - Risk application-driven to a process-centered approach was necessary assessment for records processes and systems. 2014. because the technological approach failed, and did not take into [4] ISO 23081-1:2006. Information and documentation - Records account organizational changes and cultures. To address this management processes - Metadata for records - Part 1: challenge, a roadmap for records management was developed. In Principles. 2006. contrast with the previous approach, a more integrated view of records management was taken, including technical, cultural, and [5] Richtlijn Metadata Overheidsinformatie, version 2.5. 2009. policy elements. The new approach is based on a risk analysis. [6] NEN 2082:2008 nl. Eisen voor functionaliteit van informatie- Inspired by the characteristics of a record, a risk methodology has en archiefmanagement in programmatuur. 2008 been developed in order to assess the required quality of [7] ISO 16175-1:2010. Information and documentation - information. The risk classification is the starting point for further Principles and functional requirements for records in action. Four records management regimes have been identified, electronic office environments - Part 1: Overview and ranging from no risk to high risk. The specific city of Rotterdam statement of principles. 2010, ISO 16175-2:2011.Information interpretation of risk analysis is helping to raise awareness in the and documentation - Principles and functional requirements business about the value of information and it is one of the for records in electronic office environments - Part 2: principles that will move records management to a higher level. Guidelines and functional requirements for digital records management systems. 2011, and ISO 16175-3:2010. REFERENCES Information and documentation - Principles and functional [1] ISO 15489-1:2001. Information and documentation - Records requirements for records in electronic office environments - management - Part 1: General. 2001. Part 3: Guidelines and functional requirements for records in business systems.

Page 16 Preserving digital heritage: a network centric approach Ana Rodrigues Francisco Barbedo Lucília Runa Mário Sant’Ana DGLAB* DGLAB* DGLAB* DGLAB* Alameda da Alameda da Alameda da Alameda da Universidade Universidade Universidade Universidade 1649-010 Lisboa 1649-010 Lisboa 1649-010 Lisboa 1649-010 Lisboa ana.rodrigues@ francisco.barbedo@ lucilia.runa@ mario.santana@ dglab.gov.pt dglab.gov.pt dglab.gov.pt dglab.gov.pt

ABSTRACT In this process the difference between cultural domains seems to become less relevant as critical mass is best The paper presents a rationale about the project digital achieved converging efforts from digital heritage holders continuity launched by DGLAB, explaining it’s fundaments, irrespectively of the community of practice (CP) to which methodology and findings so far. It finishes proposing future each one belongs to. work and aims to be achieved. Working together means sharing resources and to manage collectively organizational data. This requires a new way of doing business, which in turns demands new social Keywords relationships, new management models and new financial Digital heritage, cooperation network, digital preservation sustainability solutions. This problem concerns every digital object that must be preserved for operational or cultural reasons more than 7 1. INTRODUCTION years. But as far as digital heritage is concerned the issue is permanent and requires particular attention as heritage Preserving digital objects is no longer an exclusive belongs to every citizen in a nation. technological challenge. Correlated with informatics development and use, social and organizational issues are Portuguese National Archives decided to organize a meeting mandatory in order to obtain complete and accurate in order to listen to different stakeholders regarding the preservation solutions. problem of preserving digital heritage and explore the possibility of organizing new ways, network centered, of Three orders of reasons contribute to this situation: the first preserving digital heritage. one is that digital preservation is a pervasive problem that spreads to every organization and individual that produce The event took place in September 2013 and the agenda professionally or individually digital data. consisted on putting together different CP from public or private sector in order to discuss the problems each The second one is that lack of preservation actions drive experienced in preserving digital objects. very quickly to obsolescence, which is a condition that can actually stop business continuity. Today’s organizations are Some questions were raised concerning the possible beginning to grasp this reality as the digital data that has convergence of similar problems independently of the being produced for the past years accumulated into a cultural domain to which digital material belongs. We also proportion to which digital obsolescence is already being wanted to harvest people perception on what is really digital strongly perceived.. heritage, meaning to know how different CP included digital objects into their cultural and heritage domain. We were The third reason is that preserving digital objects is a costly particularly interested in digital surrogates of analogical activity that demands a lot of expertise and highly qualified digital material that have been created and managed through people but also a considerable investment in equipment and massive digitization projects that were in the past decade development as well as a high fixed cost in order to keep very popular. Should digital surrogates be processed as digital repositories. This reality presently highlighted by originals in the sense that they should be preserved forever? financial crash and general economic depression may lead to establishing partnership between institutions that need to Other questions were asked and also raised by the meeting preserve digital material sharing costs and knowledge- A attendants, such as the convergence between different CP powerful and dedicates info structure and human when dealing with digital data; the possibility of shared capacitation on order to deal effectively with it it’s a solutions of storage and digital repositories or issues related business that is better managed together- Preserving digital to cloud solutions. objects is a solidarity activity and not an egotistic one. The conclusions 1,were drawn by the interventions of invited speakers together with the conclusions of 4 workshops held with the public that were organized of 4 thematic issues: 1/ The inclusion of digital objects into the heritage set; 2/

1http://1seminariopreservacaopatrimoniodigital.dglab.gov.pt/ conclusoes/ Page 17 Curatorial responsibilities in digital world; 3/ Common develop a governance and sustainability model that enables technological platform; 4/ Building a national network for a smooth management and operation of a common network. preserving digital heritage. Acquiring financial resources to bid whatever is decided at the end of phase one will also be a subject that the working team, will tackle by 2015. 2. THE PROJECT Several people that work in different CP were invited to join the team, which was divided in an executive set of people The project, baptized as “digital continuity”, is an initiative committed to gather material, ensure the logistics and of National Archives and was launched as a response to the perform analysis and another set whose task is basically to conclusions from the 1st seminar that considered important provide data and information and also discuss and validate to continue the work started and to raise awareness on the analyzed data submitted to them. present situation in Portugal regarding preservation of digital heritage and to propose network centered solutions to that All the work developed so far has considered all the CP that issue. the project team members belong to: The project assumes that in digital environment, heritage • Archives; objects, no matter their cultural provenance, are basically • Libraries; information binary coded and machine readable. • Cultural Heritage; • Journalism; This fact turns digital information as digital heritage sharing • Television; common features that may enable their common • Radio; management. • Cinema; The differences between different CP objects become • Photography; indistinct, except for the use that specific communities of • Music; audience require. But even then recent organizational • Multimedia, entertainment. experiences like Europeana show us that the needs of remote audience does not account for the traditional division between different cultural domains. 3. WORK DEVELOPED SO FAR The basic organization of the project staff is to join together 3.1 Step one representatives of different cultural domains and CP in order to discuss the possibility of constructing such a structure The first step of the planned chronogram considered looking for similarities and differences spread by multiple harvesting the regulatory environment that influences the layers of practice and knowledge: the topics we want to activity of the different CP represented in the project team. discover are: This phase is already finished and led to the general 1/ regulatory framework comprehending law, terminology conclusion that there are more similarities than differences structures and concepts, metadata standards and formats; between requirements and factors that influence the work of CP. This observation corroborates the work already 2/ authenticity and appraisal considered under the different developed in international instances, like Europeana or point s of view of the represented CP; NDIIP4. 3/ access requirements and Digital Rights Management In order to preserve together digital material there it is only (DRM) in digital landscape; necessary a common set of requirements and practices that 4/ technical requirements such as storage, dimension, enable trough easy interoperability the development and prospective growth; operation of common structures. 5/ architecture and logical model definition; The methodology followed was to harvest systematically the documents on the defined sets of observation. This was 6/ business and sustainability models for the network. achieved with the help of project team that provided all information regarding their specific cultural domain. The The project is organized on a bottom up methodology that analysis then took place on a sample of all the documents stems in grounded theory, enabling new findings to lead to identified This sampling was justified by 1/ as some of them new analysis trends. The approach is included in an are not really in use by CP, at least in Portugal and 2/ inductive type and is inspired on international experiences identified documents are repetitive not bringing new data to such as project Interpares2 and NDIIPP3. the analysis. For example there are several standards and The project development plan that will eventually lead to a terminologies on music or photography that partially or fully pack of conclusions that cannot, for the moment, be overlap. In this case it would be useless to take all of them in anticipated. The ongoing work will inform the actions to be account for the analysis performance. taken in the future. It is possible to achieve a situation where A comparison was then performed element by element in different levels of acceptance from the participants are order to find similarities and divergences between them. All identified. the results were intensively discussed with the project team. The project began in January 2014 and has two phases: the first one that is currently going on aims to produce a body of 3.1.1 Law and regulations knowledge aligned with different layers of research that may In the context of legislative analysis, for each CP, we inform the participants of the advantages and disadvantages proceeded to the recognition and identification of regulatory or a common preservation info structure. We also intend to statutes governing the respective activity - legal regimes Act, deontological codes - as well as specific aspects with 2 http://www.interpares.org/ 3 http://www.digitalpreservation.gov/index.php 4 http://www.europeana.eu/; http://www.digitalpreservation.gov/ Page 18 particular relevance to digital preservation, which were exist at least in more than two of the examples observed. grouped around these two categories. The average number of common terms present in the observed vocabulary structures is 4,2. Accordingly, we identified the regulatory and specific law - both national and European level, this latter whenever The results by CP and more popular terms are depicted in possible5. the following figure and table. From the inquiry conducted, it was found the existence of 10 acts: 9 9  With interest for all CP (Law on Copyright and Related Rights, the Legal Deposit Act and Data Protection Law); 8 7 7  Multidomain applicable to journalistic activity - exercised through the Press, Radio and Television - and 6 5 5 Cultural Heritage; 5

 With explicit references to heritage preservation / 4 3 digital, including: 3

Clause 11 of the Law on Cultural Heritage on "duty 2 o 1 1 of preservation, protection and enhancement of 1 Cultural Heritage"; 0 0 o Chapter VII of the Television Act, paragraph 1, 2 and 3 of article 92 on "preservation Television Heritage";

o Chapter VII of the Radio Law, article 83 on "Heritage preservation radio broadcasting - records 6 of public interest". Figure 1. Popular terms by CP

3.1.2 Terminology Table 1. More popular terms by CP Regarding terminology and vocabulary structures were terms # studied. It was observed that 2 types of vocabulary occurrences structures may be find. A first one dedicated to the activity itself. It contains concepts and vocabularies whose purpose Access 5 is to help workers that deal with that specific core business. Authenticity 2 For example movie terminology corresponds to vocabulary that is actually used in the cinematography industry and Appraisal 4 therefore contains terms and concepts dedicated to that particular business. Custody 1 The other type consists in vocabulary structures dedicated to Identification digital heritage 1 the activity of describing or cataloguing material that was Digital heritage 4 produced for a specific kind of activity. Photography for instance shares a set of common terms to other structures Digital preservation 3 that deal e.g. to description works of art. Certification and security of repositories 2 Although there are of course a lot of different concepts and therefore terms, it is possible to find a core of common Copyright 5 terms and concepts that indicate an approach between the way different CP conceptualize aspects that are common to Usability 2 their activities and the material they deal with. The presence of the same terms (that represent concepts) in several vocabulary structures may give a clear indication of 3.1.3 Standards the existence of a shared perception of at least some parts of different CP. The analysis of the 6 vocabulary structures The identified standards 8correspond to: 7had as an outcome the identification of a set of terms that

5 Regarding methodology followed, see: http://1seminariopreservacaopatrimoniodigital.dglab.gov.pt/ content/uploads/sites/19/2014/10/SinteseMaterialPassoUm1. projeto-continuidade-digital/documentos-de-projeto/. 0.xlsx. 8 st 6 In spite of the two last explicit references, there are no Listed in the tab “Standards”, Annex 1 of the 1 project public laws to protect or to classify the television heritage report available at: but only an organizational policy management of collection. http://1seminariopreservacaopatrimoniodigital.dglab.gov.pt/ 7 Other vocabulary structures were identified. Those can be wp- checked in 1st project report available at: content/uploads/sites/19/2014/10/SinteseMaterialPassoUm1. http://1seminariopreservacaopatrimoniodigital.dglab.gov.pt/ 0.xlsx. wp- Page 19  International rules, standards, guidelines and identified. Case studies were conducted to ascertain their recommendations prepared by International Council on correspondence with standards used by other CP. Archives (ICA), International Federation of Television The main conclusions of the standard analysis were as Archives (IFTA), International Council of Museums follows: (ICM), International Federation of Library Associations and institutions (IFLA), Music Library Association 1/ the major part concern to a unique CP. The exceptions are (MLA), Online Audiovisual Catalogers (OLAC), CIDOC-CRM and Dublin Core; International Organization for Standardization (ISO), Europeana, Archives Portal Europe Network of 2/ the major part are intended only for one of two things: Excellence (APEx), etc. objects description or objects description encoding. The exceptions are Dublin Core, CIDOC-CRM, EBU-TECH  National rules, prepared by national archives, libraries 3293, MPEG 7; and museums (Portugal, Canada, United States of America, Australia). 3/ the major part focus on the object. An exception to CIDOC-CRM, event-centred and object-oriented; Their distribution is as follows: 4/ the major part include contextual information about the described objects, although archives and museums are the CP who considers this kind of information with a greater depth; Music Television 1% 10% Cinema 5/ there are other CP, besides the archives, which adopt a 5% multilevel description, like museums and libraries;

Clinical 6/ the major part of the standards are categorial, which information means they group the information in areas and, within these 29% areas, in different elements. The exceptions are Dublin Core, Museums CIDOC-CRM, EBU-TECH 3293 (based on Dublin Core) 21% and MPEG 7, which are combinatorial: they use metadata, metadata schemes and description definition languages; Libraries 3% 7/ all the standards assume a concern about cross-domain Archives 31% and multidomain coordination. This concern, in the case of the archives, is reflected, e.g., in ISAAR (CPF): “(…) separate capture and maintenance (…) of contextual information is a vital component of archival description. The practice enables the linking of descriptions of records Figure 2. Standards distribution creators and contextual information to descriptions of records from the same creator(s) that may be held by more than one repository and to descriptions of other resources From the identified standards only 4 are shared by two CP: such as library and museum materials that relate to the entity in question. Such links improve records management practices and facilitate research.” (ISAAR (CPF), 1.5, p. 7);

Shared by 2 8/ standards have a common goal: sharing and retrieval CP information; 11% 9/ categorial rules are more classic and the first to be produced, but they are quite equivalent. Considering their specificities:

Not shared o some of them are related with the scope of the standards: 89% e.g. ISAD (G) is a general archival standard – “(…) rules (…) do not give guidance on the description of special materials such as seals, sound recordings, or maps.” (1.4, p. 7). The same principle does not apply to libraries: the ISBD includes specific rules to specific materials;

o but there are other specificities which are not related Figure 3. Shared standards with the characteristics of the described objects: e.g. elements accommodating information about objects location are considered only by the museum standards, They were classified according their pertinence to the main although this kind of information is also relevant to goal of the working team: represent/describe cultural libraries and archives; heritage objects. o there are also elements intended to accommodate The purpose was to proceed to a comparative analysis of the equivalent information. However, considering their most pertinent standards, detect possible matches and to specific objects, there are content specificities: e.g. ascertain the possibility of mapping their elements to a “Immediate source of acquisition or transfer” (archives), common structure. intended to “record the source from which the unit of description was acquired and the date and/or method of Considering the clinical information and the specificity acquisition” and “Acquisition method” (museums), pointed to the clinical records, several standards were Page 20 versus “Acquisition modality” (libraries), intended to 30 accommodate information like the price of a resource; 23 20 13 9 8 o on the contrary, there are elements specially intended to 5 specific types of objects: e.g. the ones grouped in the 10 4 “Edition area” or in the “Publication, production, 0 distribution area” - Libraries) -, the last one having mov. sound still generic text datasets correspondence in the standards used by CP like images images television. Figure 4. Number of formats referred more often by 10/ combinatorial rules, by their relational approach, offer a category flexible description compatible with different CP;  Formats used by a greater number of CP 11/ Practical experiences of crossed description proved the possibility to describe a specific domain object using the XML (identified in the Generic category), PDF and JPEG rules of another domain. are the most used formats by the different CP. The MP3 and TIFF formats are referred by 5 CP, while the PNG, PPT and 3.1.4 Formats ZIP are mentioned by 4 CP. The central purpose of the activity was to identify the 8 formats mainly used among the different CP. 6 6 6 6 5 5 All formats were considered, whether they are used with the 4 4 4 objective of access or preservation. 4 2 In order to be referenced by all CP, an open formats list in an excel file was drafted. Each CP could add any formats 0 they considered adequate. The list was organized by JPEG PDF XML MP3 TIFF PNG PPT ZIP categories in order to allow conceptual framework according Figure 5. Formats used by a greater number of CP to the website of the US Library of Congress “The Digital Formats Web site”9 and several ad hoc contributions of the  Most used format by category digital continuity project team. XML for generic, PDF for text and JPEG for still images are used by 6 different CP. MP3 for sound is used in 5 CP. AVI, The list contained 176 formats structured for 6 categories MPEG-4 Video Encodings and Quicktime for moving (still images, sound, text, generic, moving images and images are used by 3 CP. XLS for dataset is used in 2 datasets) of the 8 provided (Geospatial and Web Files were different CP. not considered). 7 6 6 6 In this section (3.1.4), the Photography and Radio CP, not 6 5 present on the project team, answered the survey. 5 4 Results (some of the more salient aspects): 3 3 2  A total of 54 different formats are used by all surveyed 2 CP10. 1 0  The categories with a higher number of formats are generic text (PDF) still images sound mov. datasets Sound (47) and Still Images (42). The categories with (XML) (JPEG) (MP3) images (XLS) fewer representative elements are Text (16) and Generic (AVI + MPEG-4 (13). With numbers between these two we can find Video Moving Images (38) and Datasets (20). Encodings +  No format is used by all CP in the study Quicktime)  Moving images, with 23 hits, is the category that has the Figure 6. The most used formats by category highest number of formats in use. In the other end, Datasets, with 4 hits, is the category less mentioned  Formats used by the CP by category (summary) o Still images has 9 formats: JPEG, TIFF, PNG, BigTIFF, DICOM, DNG, JPEG 2000 Encodings, JPEG 2000 File Formats and GIF;

o Sound has 13 formats: MP3, WAVE, QuickTime, WAV, AIFF, ASF, BWF, FLAC, ID3, MP2, PCM, WM (Windows Media) and Music XML;

9 Cf. Sustainability of Digital Formats Planning for Library o Text has 5 formats: PDF, DOC, DOCX, XML and of Congress Collections [Online]. [Accessed on, October TXT; 2014] at WWW . AUTOCAD, ASF, DGW, ISO_image and RIFF; 10 In this point, formats marked in more than one category, such as ASF (marked as Sound, Generic and Moving o Moving images has 23: AVI, MPEG-4 Video images) were considered only once Encodings, QuickTime, Flash (SWF, FLA, FLV), Page 21 WM (Windows Media), DivX, MPEG-2, AAF, AC- 3, ASF, Cinepak, DPX, DTS, DV, H.26n ITU-T video encoding standards, Indeo, MPEG-1, MPEG-4 Usability 4,36 File Formats with Encoded Bitstreams, MXF, RealMedia, Uncompressed video encodings, Digital Identification 4,55 Betacam, AKA Digibeta or D-Beta, HDCAM; Identity 4,82 o Datasets has 4 formats: XLS, CSV, ISO_8211, XLS (linux). Integrity 4,55

structure 4,45

3.2 Step 2 Content 4,82

The second step aimed to identify the levels perceived of the Context 4,09 different CP on two aspects that are crucial: 3,60 3,80 4,00 4,20 4,40 4,60 4,80 5,00  The perception of authenticity requirements that impend over digital objects from different cultural domains; Figure 7. Relevance evaluation to authenticity features  The methodology and criteria, if any, used by CP in

order to identify and classify digital objects as "worthy” of becoming part of digital heritage. 4/ it was observed a fair distance between CP regarding criteria for evaluating objects as deserving integration into Both aspects can have a dramatic impact on a future heritage set. Museums, archives and libraries adopt more or common repository. It is therefore necessary to be aware of less the same number of criteria, while other CP, such as different realities concerning these issues. music, adopt less criteria. No real conclusion can be inferred The methodology used for this task was an enquiry to which from this observation because everyone, except journalism, the working team was asked to provide answer. Although has actually criteria for evaluating digital heritage; the small number of respondents could not give definitive or 5/ there seems to be some lack of formal criteria to evaluate meaningful answers the large representative basis of objects as heritage to be. Exception made for archives, different CP might give an accurate perception of which have a very regulated evaluation process, the most sensibilities and trends on the topics. common situation is included in the area of collection The enquiry served a second purpose of testing and allowing management which from each institution policy and adjustments to the questions asked that will be used in the strategy. As such it may vary between organizations and construction of a second big survey that the project intends even inside the same CP; to launch to nearly 300 cultural, both public and private, 6/ several exclusion factors were mentioned, which is portuguese cultural institutions. interesting to remark as people consider different factors, The results were processed with descriptive statistical usually absent on traditional environment, that may measures: the mean in order to measure the degree of influence decisively the classification of objects as heritage; agreement (central tendency) of opinions expressed; the 7/ costs regarding classification, access and storage were standard deviation, as a dispersion measure, used to confirm mentioned as possible hindrances for preserving digital the convergence depicted by centrality measures. The SD objects as heritage. would depict the level of “disagreement” to the general trend estimated by the mean. The most relevant conclusions are as follows: 4. CONCLUSIONS SO FAR AND 1/ usually people belong to more than one CP. In fact there FUTURE ACTIONS is a fairly common situation where an institution has custody of objects from several cultural domains, processed The analysis developed along the first 2 steps leads to the accordingly to the CP that usually has expertise on them. conclusion that no big issues separate different CP E.g. a museum may hold archival collections that will be represented in project team. And until now there could not processed in a compliant way to archival standards, while at be found any major issues that might put insurmountable the same time, as far as museum objects are concerned, obstacles on the building of a common network to share other specific standards will be adopted; resources and preserve common digital heritage. 2/ globally the results revealed a remarkable convergence on Financial and technical aspects must be clarified prior to any the 7 variables under scrutiny corresponding to authenticity. stakeholder to make a decision about it’s possible There were some exceptions essentially due to some lack of participation on such a network. practice and theoretical thought on the subject. This was Next action, as predicted in he approved chronogram, is particularly noticed in CP journalism, where there is a poor launching of a big survey that hopefully will bring us data tradition on heritage connected issues. that will validate or contradict the preliminary observations 3/ the divergences concern some digital object’s authenticity harvested by debate and enquiries inside the project team. requirements from different cultural domains were not The development of a study regarding business model of the meaningful being reported an average of 4,52 on a 1 to 5 prospective network as well as the possible financial models scale. and sustainability will be the most difficult challenges of this project. We expect to gather support of experts from different areas such as economics, sociology and engineering, maybe connected to academic world to Page 22 cooperate with the project team and tackle these particular issues on a knowledgeable way. According to the project schedule, next tasks aim to complete the steps of the first phase of the project - in principle until February 2015. Data will be gathered relating to:  physical environment: storage technology platforms (information size and growth estimates);  logical environment (software);  conceptual and logic architecture (network model, including the model of governance and sustainability).

______(*) Directorate General of Book, Archives and Libraries

Page 23 The modernization, migration and archiving of a research register Johanna Räisä Mirja Loponen Mikkeli University of Applied Sciences Mikkeli University of Applied Sciences Patteristonkatu 2 Patteristonkatu 2 P.O.Box 181, 50101 Mikkeli P.O.Box 181, 50101 Mikkeli [email protected] [email protected]

ABSTRACT with other scientific registers for complementary data. In addition, In this paper, we are going to describe a modernization, migration the data can be updated in real time by the project operators. and archiving of a research register. This register contains demographic information about people in the ceded Karelia region of Finland. The migration to a database started in the 80’s and now the database will be modernized. The objective is to make the modernization in a way that the data can be accessible 2. THE HISTORY OF THE DATA via interfaces and devices. This allows more opportunities in the In Finland, there have been Lutheran parish registers since the future. 17th century. At that time, Finland was a part of the Swedish Empire and the church of Sweden had defined a law which involved parishes to keep registers. The registers had information Categories and Subject Descriptors about people living in the parish [1]. The templates and H.2.4 [Systems]: Relational databases instructions for the registers came from Swedish church and H.2.8 [Database Applications]: Scientific databases priests used Swedish for writing them. This was a normal custom since Swedish was the language of literate people. An interesting H.3.5 [Online Information Services]: Data sharing and Web- feature was that the registers were huge books and the priests based services made the records by their own knowledge. This is why some of the information might be very unclear. General Terms Management, Performance and Standardization 2.1 The parish registers The parish registers are divided into three categories, confirmation registers, migration registers and history registers. The Keywords confirmation registers include information about people’s Migration, modernization, historical data, parish registers. Christian lives for example how a person knew the bible and how many times this person had taken part to Eucharist. The 1. INTRODUCTION confirmation registers were written into two different books [1]. The Karelia database foundation has been storing parish registers Under 15-year-old parish members were in the children’s register from the ceded Karelia of Finland since the 1980’s. This data and after a confirmation they were transferred to the confirmation contains information for scientific research of Finnish family register [1]. This meant that the individual had become a full structure, migration, education and infant mortality. The parish member of the church. The Karelia region also had the registers have preserved to the Karelia database for information confirmation registers of the members of the Greek Orthodox governance and are maintained by the Karelia database Church. This was because the Karelian people had lived close to foundation. They are operating in connection with the provincial Russia for centuries. Both religions, Lutheran and Greek archive of Mikkeli. Now the database will be migrated to the Orthodox had influenced this area strongly. The migration archiving services of the Mikkeli University of Applied Science. registers have information about how people had moved from The data will be utilized in education and scientific research. parish to parish. This information was filled every time a new member came to the parish. There might have been more than one The original database is made with dBase, an old database server. person migrating at the same time. The priests usually wrote the This system has too many disadvantages for preserving the data name of the informer and the number of migrants. The for the future. The objective is to migrate the data to a cost- information in history registers is more statistical. These mainly effective and an open source environment. The migration will be had information about child births and baptisms, deaths and challenging and it will have to be designed in a way that burials, also banns and engagements. However the information in information governance is more effective. the history registers can change century by century. The priests The modernization will continue in the future and it will open didn’t keep infant mortality significant before the middle of the opportunities for more advantage usage. The researchers and 19th century for example. projects will get easier access to the database. The users don’t The parish registers were a good way of having information about have to be in one place to find data from it. The data can be linked people. The Russian Empire used the registers for collecting taxes after incorporating Finland in the year 1809. Finland was an

Page 24 autonomic part of the Russian Empire over 100 hundred years. exclusion. In addition, the foundation is getting employees for This allowed the Finnish have their own government and they storing the data. were able to use their native language. One operator can be working in the foundation for one year. The After a peaceful time as a part of the Russian Empire, the events project has had about 500 operators in these 15 years. They are changed in the beginning of the 1900’s. Finland saw a change of storing about 500 000 - 600 000 rows of data in a year and 90% getting independent, mainly because of the revolution in the of the work has been done by the operators [1]. Of course the Russian Empire. Fortunately, the Finnish got their independence turnover of the operators and teaching the new ones is reducing a in the 1917. little bit of the speed. Finland was independent only a few decades when new events The saving process is going in a way that the operators are reading started in Europe. These events sealed the destiny of the Karelian the data from the microfiches of digital archives of the national parish registers. In the year 1939 the Soviet Union declared a first archive service of Finland [4]. They use an installed program for against Finland. This war lasted only few months and peace saving the data and it is done parish by parish. All the register was made. A second started after a couple of months. Finland types have their own programs, also including the Orthodox suffered two wars against the Soviet Union and the second ended confirmation registers. The programs are making database files in the 1944 [2]. Because of these wars Karelian people had to from the saved data and the files are saved in to the root folder of move farther from the Soviet border [2]. This affected also to the the software. The process is continuing in a way that the operators parish registers and the priests decided that they had to be are saving the database files to pen drives and delivering them to transported to more inland for safe keeping. After the second war, the office of the foundation, or sending an email with attached the Karelian region were lost and it incorporated into the Soviet files. The delivered data will be saved into the final database by Union. Most of the registers were transported to Mikkeli to safe staff members. In some cases, the operator might have some but some of them remained to the Soviet Union [2]. difficulty of reading the registers, so the staff member will help to fill the missing fields. After everything is clear, the data will be migrated into the complete database. 3. OPENING THE DATA At the moment there are over 9 million rows saved and it is The modernization of an old data has been an ongoing process estimated that in the end the database will have more than 11 throughout the years. The Lutheran’s parish registers were first million rows. It was also estimated that the work would have microfilmed in the year 1986. In general the microfilming of finished by today but the early 18th century confirmation registers various historical data started in Finland in the 1920’s [3]. The are still left. The speed has reduced very much since the main purpose of microfilming was to get materials into very small confirmation registers are written with unclear handwriting or the size and the Karelian parish registers were filmed into microfiches language itself isn’t understandable as we can see in the image which were only readable by a magnifier. Microfilming enabled below. These unsaved registers also include the Orthodox better access to the registers and minimized scuff of the originals. registers. The foundation has had difficulty of finding suitable When the technology evolved these microfiches were scanned and people for saving the Orthodox registers since they are written in the pictures were added to the digital archive, where they are old Russian and these registers are saved with the Latin alphabets accessible via internet. and not with Cyrillic. It is quite possible that the database might The idea of storing the Karelian parish registers into a database stay incomplete because of these difficulties. started from Raimo Viikki, the currently retired director of the provincial archive of Mikkeli [4]. The first project started in 1985 and it was a research project. It inspected of how the database should be build and were the Karelian registers suitable for this kind of database migration. In the 1988, a few programmers were hired to carry on the project to the next level [4]. They made a test program and a manual for it. The program was tested with one parish register which was named Lavansaari. The results of the project were very successful. Due to these good results, the Karelia database foundation was founded in 1990 [4]. 3.1 Storing the data First the foundation hired some staff members for storing the data into the database. The data were saved very faithfully from the original source, row by row as it was written. The work was very slow with few workers. Fortunately, the foundation got more workers when they made a contract with Finnish public employment and business services in the year 1995 [4]. This did the saving work faster and long-term unemployed people got work for a little period of time. The co-operation is still going on and new operators are hired every year. The operators are working Image 1. A register page. from home or in the office of the foundation. They are working 5 hours a day and 25 hours a week. Normally, they can make a suitable schedule for themselves [1]. The co-operation project has helped both sides. The long-term unemployed people are getting socially meaningful work and it is preventing them from social

Page 25 4. THE DATABASE informative for developing the new web service and making an The original database is dBase -server and mostly kept in the hard interface for other services. drives of computers and CD-ROMs. The dBase was a cheap and efficient choice in the 80’s, it suited this kind of research register. It is basically a spreadsheet type of a database and there hasn’t been any user interface. The backup of the database has been handled with floppy disks and they are still in use. This hasn’t been very agile for future development. The server is causing a lot of work and managing is very time consuming. First there was one database but this changed during the years. The database had to be divided into two different versions. This solution was made when the database was accessible via internet. The Finnish law has defined that personal data will be public when it is over 100 years old from the birth. However the data of deceased people will be public in 50 years from the death [4]. The foundation decided to solve the privacy issue by dividing the database in two. The public data was accessible via internet and private data via the provincial archive of Mikkeli. 4.1 Use of the database The first access to the database was on the computers of the provincial archive of Mikkeli. This access was mostly for the researchers and use of genealogy. In the year 2008 the foundation released a web service, called Katiha. The data was migrated to a MYSQL-server from the public data of the original database. It also has statistical information from the registers such as child births. Still, the complete database could be seen in the provincial archive of Mikkeli and this choice was more suitable for that time. To get access to the complete database, researchers can apply it with signed application. A staff member of the foundation will accept these applications and when they are approved the researchers can go to the provincial archive of Mikkeli to browse the complete database. Otherwise Katiha is a very convenient tool for browsing the public data and find statistical information.

5. MODERNIZING PROCESS The original database will be migrated to a new server managed Figure 1. The table structure of one parish. by the Mikkeli University of Applied Sciences. In addition, it will have a web service. The purpose is to make the database In the modernization process the database will also have other accessible in the future. The foundation will be disbanded when tables, such as temporary tables and metadata tables. These tables the data is stored and this might happen in a year or two. Some include data which is helping the access to the content. The actions have to be done before this happens. It is natural to Parishes-table has parish index and parish name for example. This migrate the data to the Mikkeli University of Applied Sciences. table is connected to the register tables with the parish index. The original parish registers are still in Mikkeli and the university These metadata tables are also used for listing information in the has had other co-operation projects with the provincial archive of user interface. This way all the data doesn’t have to be queried at Mikkeli. In addition, the university has a good knowledge about once. migration and archiving data, also utilizing scientific data on education. 5.2 Using new and faster techniques 5.1 Migration to a new server We are using an open source search platform called Apache Solr. It is making the search faster from a large amount of data. The To make the database cost-effective, it will be migrated platform is written in Java and it runs on standalone full-text completely into MariaDB which is open source SQL-server, search server [6]. It has Lucene Java library as a core and the data developed by MariaDB foundation from MySQL. Basically, it is can be output as JSON which is a lightweight data-interchange more developed version of the Oracle’s MySQL-server [5]. The format. Humans can read this format easily and machines can original database has over 2000 tables. Every 70 parishes have parse and generate it. This allows Solr to be used with many their own tables. In the figure number one, we can see the programming languages. Solr will power up the search and it can structure and how the tables are linked together. This was done to be configured easily for any kind of data [6]. In addition, the data make the search faster. In the new database, these tables will be can be indexed to Solr straight from the SQL-server using simple joined to around 30 tables. Also the column names of the database SQL-scripts. With Solr indexes the data can be accessed fast and will be translated into English. This change will be more then connect to the database for more information. In the Solr-

Page 26 server, there will be only an indexed data such as names, parishes, 1. Integrating geographic information from different eras years of birth. to the data. The new modernized database will be used through a web service 2. Connecting the statistical information with the and the service will connect to MariaDB-server with JDBC geographic information. connection. This is a Java-based API which defines how a client 3. Integrating different kind of data to the database, such can access a database. This API is used to get the full information as photographs, drawings and stories. from the database. The JDBC connection is also used for saving the data to the database. This access is only enabled to the 4. Connecting the data with the information in the social operators and administrators. The service will have user accounts media. and the access rights are defined that way. Although there are A lot of Finnish population moved to Australia, Canada and accounts, everyone has access to the search and it can be used for United States during the wars against the Soviet Union. There browsing the public data. could be chance to integrate the data globally and get other than Now everything is handled in the same place. Storing the data can Finnish linguists and scientists interested in the data. be done with the service. The researchers can find data from it. The administrators are only creating users to the service. An operator will get account for managing the data. A researcher will 7. CONCLUSIONS have account after the approval of the application. With the During the modernization process it was interested to see the account he or she can have access to the private data. However, course of the register’s life. At first, the research register were in the researcher has to go to the provincial archive of Mikkeli to get huge books. Then these books were digitised into microfiches. access to the private data. This is made to make sure the data Finally, the research register was stored to a database and now doesn’t get in to wrong hands. there are more developing opportunities. This new database and web service are designed in a way that Today there are a great number of devices but still the most information governance is going to be easier. It is migrated to a important thing is the information. The 18th century priests wrote server which uses MariaDB, an open source SQL-environment. the data by not knowing what will happen in the future. Although Web service is done with Java. These will give the register a they linked the data to dates of birth and residence. They made chance to survive for the future generations in a format that can be metadata by accident. With this information there is a question to read and modified easier. When, the database is modernized there be thought. Do we have to think other than this metadata for are possibilities to use it in other projects and develop an interface linking the data to other sources? There are the timestamps and for versatile access. places which can be used, or are there?

8. REFERENCES 6. PROSPECTS [1] The Karelia Foundation. 2014. http://karjalatk.fi. The work on the foundation will finish in a few years. This [2] Ripatti, Jarkko. Karjalan luterilaiset seurakunnat evakossa. enables that the data structure can be modified in to more Paino Livonia Print. 2014. lightweight. In addition, there can be experiments of making the data anonymous. This would help to open the data for everyone. [3] Lybeck, Jari. Asiakirjahallinnon oppikirja. The National Due to some of the registers remained on the other side of the Archive of Finland. 2006. Helsinki border. This would be a good base for an international project [4] Vartiainen, Minna. 2013. Karjala-tietokantasäätiön Karjala- which could make the database fully complete. tietokanta: Tietokannan haasteet. In archiving seminar 2013 The parish registers are only one part of the information from (Finland, May, 2013) ceded Karelia. There is more information in personal records and [5] MariaDB foundation. 2014. https://mariadb.org/. associations. Some of this information has been digitised and [6] The Apache Software Foundation. Apache Lucene. 2011- there might be a chance to get access to it. Linking all this data 2012. http://lucene.apache.org/solr/. would be important but it would need new projects and help of voluntary work. Some plans have already been made.

Page 27 Activities to facilitate the authentic interpretation of archived databases Aleksandra Mrdavšič Jože Škofljanec Archives of Republic of Slovenia Archives of Republic of Slovenia Zvezdarska 1 Zvezdarska 1 1127 Ljubljana 1127 Ljubljana +386 (0)1 24 14 256 +386 (0)1 24 14 248 [email protected] [email protected]

ABSTRACT frequent. In compliance with the Slovenian archival act records Experience in archiving has shown that context information needs need to be entered in the public register of records and made to be provided also by records in a digital form. The paper available to the public two weeks after their transfer to archives. presents the activities and tools used by public archival services Keeping in mind this legal provision we completed our check list during the appraisal and advising records creators on how to by adding some requirements to ensure the records’ usability. create SIPs in a way that information needed for authentic interpretation of records is included. Requirements observed in the preparation of our check list were as follows: 1. INTRODUCTION - Providing control over the preparation of SIP and its A considerable part of digital records that are of permanent and transfer to archives; thus archival value are kept as databases. According to the - Providing sufficient documentation of the process to Reference Model for an Open Archival Information System - preserve authenticity; OAIS the key purpose is to preserve not only the content of the - Providing comparability of implemented procedures of records but their usability as well. In the paper the authors aim to creating individual SIP and its transfer to archives; present the method that has been tested in practice and by means - Providing repetitiveness of individual stages of the of which the Archives of the Republic of Slovenia is attempting to procedure ensure the integrity and authenticity of the ingested database - Preserving the needed level of records’ usability and records as well as include the information needed to ensure their ensure it in the shortest possible time. authentic interpretation. Experience acquired during the ingest of records to the Archives To monitor the method of facilitating authentic interpretation, a of the RS and best practices of some of the European1 public special tool was designed in the form of a check list. The check archives have revealed that ingesting only records from databases list is used by authorized archival employees when discussing the is not enough, and that we also need to ingest additional transfer of records with records creators on behalf of the archive. documentation. The purpose of such documentation is to ensure It can also be used as a template for the preparation of submission authentic interpretation of data. An example from Slovenia agreement. demonstrated that just one missing table with certain codes, which in the original environment was not included in the database, can have a profound effect on our understanding of all records within 2. PRE-INGEST ACTIVITIES the submitted database. Records creators and archives to which their records will be As far as practical application is concerned, in most cases it turns transferred both face the challenge of how to ensure the records’ out that records creators do not have all the additional accessibility, usability, integrity, authenticity and durability. The documentation that the archive would like to ingest, so former are in charge of this task for the time when digital records compromise is often necessary. An example given is the are in their custody and the latter take over once such records are documentation on Graphical User Interface – GUI, which was transferred to them. Based on the experience acquired by public used by the records creator for entering and changing data in the archive services when dealing with such challenges, a number of database as well as for the printing out of the data needed. procedures have been developed to provide integrity, authenticity Typically, records born digitally as databases (such as public and accessibility of records. However, methods for providing registers) have separate GUI designed for managing each such usability of records have not been developed to the same degree register. Usually, instructions for GUI are precise enough in mostly due to the fact that the use of such records is not very describing activities performed during the capture and export of data from and into database. Among other things, they include screenshots which provide clear insight into the work with the

1 In recent years a closer look at e-archiving at the national archives in Austria, Denmark, Estonia, Germany, Hungary, the Netherlands, Norway, Sweden, Switzerland and the United Kingdom has been taken.

Page 28 interface. Assisted by user manual for GUI, a user will be able to example according to their roles) with their access rights understand how the database was managed by records creator. or permissions to change data. When user manuals are not preserved or were not even prepared, they can be replaced by screenshots of some of the key stages of capture, accompanied by short explanations. These can be prepared by records creator if he has a working version of the In the check list each of the check points consists of three application used during the capture of data into database. When elements: instruction, description and references. The latter are working GUI is not available, it is recommended that a user makes particularly important since a reference to one proper part of a short description of how GUI was used, mainly for the capture documentation can replace the entire description. When data and changing of data, and also what the most frequent forms of sources of official registers or its metadata structure are legally query and reports on screen and/or paper were. When employees regulated, records creator only has to refer to a single provision in who worked with GUI had different competencies or roles, it is a legal act. Records creators can also omit from their SIP any sensible to include also the explanations about such competencies. publicly available documentation, such as public regulations. If it is not possible to obtain any information about the working of GUI, the sensible thing to do is to make a note that no such 3. PROACTIVE ACTIVITIES information was available to the archive. The same applies to the Drawing from our experience with ingestion of digital records, the description of external sources in automatic capture of certain Archives of the Republic of Slovenia also had an effect on field- data. related regulations regarding capture and storage of digital records by records creators. Keeping up with our proactive activities, several requirements for database records and public registers The check list prepared by the Archives of the Republic of have been included in our regulation titled Uniform Technological Slovenia includes the following check points for facilitating Requirements, which is a regulation that needs to be observed by authentic interpretation of the records’ content: all Slovenian public sector institutions in the course of their - User documentation – these are user manuals and digital records management. These requirements comply with instructions for end users of applications for entering, check points in our check list. The list helps archivists to guide changing and output of data; their records creators when preparing their records for transfer - Use or its description and screenshots of important and, even more importantly, it guides records creators themselves functionalities (when there are no suitable user to be more attentive to providing certain level of documentation documentation available) – it should include the already during production. description of the actual management and use of data in the database; - Codelists – it should include all codelists and values for Uniform Technological Requirements demand that for each of the the each of them. When individual value was in a databases, especially those that are the basis for the keeping of codelist only for a certain period of time, this time public registers, records creators design the so-called public period should also be indicated; register dossier. Included in such dossiers is key documentation - A list of data sources – it should include information on on managing individual register, i.e. managing data in individual the sources and scope of data obtained from individual database. The files preserve: source. Also included should be the information on - Legal regulations (public and internal) that had an effect when individual source was used and how often data on management of registers; was captured from them; such information is - List of roles and competencies for individual roles; particularly important for interconnection of databases, - List of software and periods of their use; i.e. public registers; - Description of data structure, which also includes data - Forms, reports and queries (when no other suitable user model, data sources (i.e. eventual other public registers, documentation is available) – these are important for application forms …) and codelists; understanding how a user accessed data in the original - Description of technical terms used (semantics) which environment. Also included here should be the enables long-term use of data from databases or public description of queries in the original application, either registers outside their original environment; as screenshots or examples of reports; - Specification or description of changes for data - Typical queries – apart from the actual description of a structure, including data model and the software used; query (as mentioned above) it is also recommended for - Description of any deletion or cancellation of the the records creator to add technical description of chosen data transfer, for example when technical typical queries. Those are the queries that environment or data model is changed. administrators, personnel entering the data, or users regularly used in the course of their work and for which 4. CONCLUSIONS standard output was formed (in the application) to The key tasks of any archive institution is to make the records it display the results. Whenever possible, segments of keeps available to users and do so in a way that enables users code with SQL commands from the original application authentic interpretation of data. For this purpose, it is necessary to should be enclosed; have tools that encourage standardization of procedures. Such - Security scheme in the original environment – it tools need to provide enough room for flexibility in terms of describes restricted access to data in the original implementation of procedures necessary due to diversity of environment. Also stated are groups of users (for business operation of individual creators. At the same time, the tools must enable discussion on all important topics and acquisition of available information to ensure enough data about

Page 29 the context in which records were created and to facilitate authentic interpretation of such records.

When advising and directing business activities of our records creators and when this is allowed b state legislation, proactive action is sensible since it ensures preservation of documentation that will accompany records when they are transferred to archives and enable their authentic interpretation. Check list is independent of the technology used at the time of transfer.

Page 30 Information Culture: An Essential Concept for Next Generation Records Management Gillian Oliver Fiorella Foscarini School of Information Management Department of Media Studies Victoria University of Wellington University of Amsterdam New Zealand The Netherlands [email protected] [email protected]

ABSTRACT The first level, fundamental influences, includes consideration of This workshop will contribute to the Records Management in the value accorded to records, information preferences and Transition stream, by introducing the topic of information culture national or regional technological infrastructure, as well as (that is, the values accorded to information, and attitudes towards language. Although the factors considered at this level are hard to it, specifically within organizational contexts). The objectives of change, identifying key features is essential in order to guide the workshop are twofold. Firstly it will demonstrate the utility of records management programme development. The second ICF the concept of information culture as the basis for the level addresses the knowledge, skills and expertise of staff development and promotion of next generation records members relating to recordkeeping requirements. This management practice. Secondly it will stimulate reflection on the encompasses understanding of information related competencies, content and scope of education for records management, and for example information and digital literacy, as well as traditional explore the appropriateness of this in today’s highly flexible and areas of focus such as awareness of specific records-related dynamic working environment. obligations. The final ICF level considers the characteristics that are unique to a particular organization and probably the most Traditional records management education largely focuses on amenable to change, namely corporate information technology systems, processes and techniques required to achieve governance and trust in existing organizational systems. This recordkeeping outcomes. Generally this approach largely ignores final layer highlights the need for practitioners to work or at best superficially acknowledges the fundamental cultural collaboratively with cognate information professionals and for issues encountered when attempting to implement these systems, reflective practice. processes and techniques in workplaces. The information culture perspective takes people, the employees of the organization, into The workshop presenters will introduce the concepts and explain account, and facilitates the understanding and diagnosis of the the factors at each level of the ICF. Throughout the workshop, cultural dimensions of organizations as socially constructed they will invite participants to reflect on their own experiences entities. with records management education and training, as well as to consider how effective current education and training programmes The workshop will explore the components of a diagnostic model, are in enabling students to engage with the factors identified. the Information Culture Framework (ICF). The ICF is Discussions generated through the workshop will enable underpinned by three key ideas: recordkeeping informatics, soft participants to investigate where development or change is systems methodology and rhetorical genre. The model necessary to equip practitioners with the knowledge and skills that distinguishes three levels of factors, which will be used as the will allow them to apply the concept of information culture in basis to structure the workshop. their workplaces.

Page 31 The role of Information Governance in an Enterprise Architecture Framework Richard Jeffrey-Cook, MBCS, CITP, FIRMS Head of Information and Records Management In-Form Consult Ltd, Cardinal Point Park Road, RICKMANSWORTH, WD3 1RE, UNITED KINGDOM +44 1483 894052 richard.jeffrey-cook@inform- consult.com

ABSTRACT created there is no agreement on how to define information The Open Group Architecture Framework (TOGAF) is a leading architecture and how it should be addressed. method and set of supporting tools for developing an enterprise This paper examines TOGAF, a widely adopted framework for architecture. An element of enterprise architecture is information creating an enterprise architecture and looks at where information architecture and a component of information architecture is governance fits in such a framework. From examining these information governance. components it may be possible to demonstrate how incorporating This paper examines where TOGAF makes reference to information and records management techniques into TOGAF an information governance and identifies the methods and supporting approach to developing an information architecture might be tools in TOGAF which relate to information governance. These created. include the definition and characteristics of governance as used by TOGAF, the Architectural Governance Framework employed by 2. TOGAF TOGAF, the idea of a Governance Repository, suggestions on an The Open Group Architecture Framework (TOGAF) is a Organisation Structure, the use of Capability Maturity Models, and framework - a detailed method and a set of supporting tools - for the concept of an Integrated Information Infrastructure Reference developing an enterprise architecture. It may be used freely (subject Model. to conditions of use) by any organization wishing to develop an enterprise architecture for use within that organization. Examining the methods and supporting tools suggests that there is TOGAF was developed by members of The Open Group, working a disconnect between The Open Group authors and the within the Architecture Forum (www.opengroup.org/architecture). professionals in the information and records management The original development of TOGAF Version 1 in 1995 was based community on applying best practices and techniques for creating on the Technical Architecture Framework for Information an information architecture and applying information governance. Management (TAFIM), developed by the US Department of There is an argument that the community should create an Defense (DoD). information architecture framework compatible with TOGAF and Starting from this sound foundation, the members of The Open promote information and records management best practice and Group Architecture Forum have developed successive versions of techniques to the wider information technology community. A TOGAF each year and published each one on The Open Group body such as the DLM Forum would be in a good position to public web site. achieve this. 3. ARCHITECTURE FRAMEWORK Categories and Subject Descriptors An architecture framework is a tool which can be used for D.3.3 [Enterprise Architectures]: Enterprise Architecture developing a broad range of different architectures. It describes a Frameworks – Information architectures, The Open Group. method for designing an information system in terms of a set of building blocks, and for showing how the building blocks fit General Terms together. It contains a set of tools, provides a common vocabulary Management, Documentation, Design. and includes a list of recommended standards and compliant products that can be used to implement the building blocks. Keywords Information Governance, TOGAF. The primary reason for developing an enterprise architecture is to support the business by providing the fundamental technology and 1. INTRODUCTION process structure for an IT strategy. This in turn makes IT a Effective information governance requires information architecture responsive asset for a successful modern business strategy. to be addressed. Where enterprise architecture has been defined and An information architecture framework should describe a set of methods for developing an enterprise architecture have been tools, common vocabulary, recommended standards and techniques

Page 32 which together form a set of building blocks for managing 5.1 Characteristics of Governance information. TOGAF defines the characteristics of governance (adapted from 4. INFORMATION GOVERNANCE Naidoo 2002) as: There is no universal definition of information governance as these Discipline two examples demonstrate. All involved parties will have a commitment to adhere to The Information Governance Initiative defines information procedures, processes, and authority structures established by the governance as the activities and technologies that organizations organization. employ to maximize the value of their information while Transparency minimizing associated risks and costs [1]. All actions implemented and their decision support will be Gartner defines information governance as the specification of available for inspection by authorized organization and provider decision rights and an accountability framework to encourage parties. desirable behavior in the valuation, creation, storage, use, archival and deletion of information. It includes the processes, roles, Independence standards and metrics that ensure the effective and efficient use of All processes, decision-making, and mechanisms used will be information in enabling an organization to achieve its goals. [2] established so as to minimize or avoid potential conflicts of interest. 5. GOVERNANCE IN TOGAF Accountability TOGAF focuses on “Architecture Governance”. TOGAF identifies Identifiable groups within the organization - e.g., governance that there is a hierarchy of governance structures, which, boards who take actions or make decisions - are authorized and particularly in the larger enterprise, can include all of the following accountable for their actions. as distinct domains with their own disciplines and processes: Responsibility Corporate governance; Technology governance; IT governance and Architecture governance. Within an organisation, governance may Each contracted party is required to act responsibly to the operate at different levels of scope such as organizational and organization and its stakeholders. geographic. Fairness TOGAF incorporates information governance into IT governance. All decisions taken, processes used, and their implementation will Governance is essentially about ensuring that business is conducted not be allowed to create unfair advantage to any one particular properly. It is less about overt control and strict adherence to rules, party. and more about guidance and effective and equitable usage of resources to ensure sustainability of an organization's strategic objectives.

Figure1: TOGAF - Architecture Governance Framework - Conceptual Structure

Page 33 5.2 TOGAF Recommendations 7.2 Stakeholders TOGAF recommends COBIT for IT Governance. COBIT also Governance requirements may be related to one or more addresses information security and thus introduces the idea of stakeholders who have an interest in the information. Stakeholders controls on information. might be specific individuals or organisations. For convenience TOGAF identifies that IT Governance and Architecture they can be grouped together where they share a common interest Governance should be a board-level responsibility. By implication, in a class of information. Information Governance should also be a board-level 7.3 Sources of Requirements responsibility. As further justification for this approach, ISO 27001 Identifying the source of a requirement is important to verify that also identifies that Information Security should be a board-level the requirement has been documented accurately. An example of a responsibility. source might be a piece of legislation that requires an organization Phase G of the TOGAF Architecture Development Method (ADM) to retain a document for a specified period of time. A source might refers to implementation governance. This is generally considered also be a standard or a procedure document. It could equally be the as an aspect of IT Governance and concerns itself with the whim of the Chief Executive or a Director. realization of the architecture through change projects. Implementation governance extends to the realization of 7.4 Classification information management through change projects – as identified in Classification is the framework on which access controls and Stage F of DIRKS [3] for example. disposal schedules can be applied. Classification itself serves five useful purposes: DIRKS itself is one of the few formal methods in information and records management. It appears to be falling out of favor with the 1. It provides context for information. A single document information and records management community but, in common is generally less useful than a set of related documents. with other methods, this may be because it has been poorly applied 2. It can organize storage. For physical storage systems it is rather than in any inherent faults with the method itself. Developed essential to have a classification structure to enable items in the 1990’s it is probably due for a review to see whether or how to be stored in a consistent fashion and to be retrieved it can be incorporated into a modern information architecture reliably. In electronic storage systems classification is framework. not required to retrieve information but still may offer significant benefits in retrieving information effectively 6. ARCHITECTURAL GOVERNANCE and efficiently by improving performance. FRAMEWORK 3. To browse for information. Browsing (as opposed to TOGAF introduces the idea of an Architectural Governance searching) is necessary for effective information Framework. retrieval. 4. To apply access controls. Conceptually, architecture governance is an approach, a series of 5. To apply disposal schedules. processes, a cultural orientation, and set of owned responsibilities that ensure the integrity and effectiveness of the organization's There are many different approaches to classifying information all architectures. of which have their strengths and weaknesses. Records managers typically prefer functional classification schemes because these The framework splits process, content, and context. This allows often change less over long periods of time. Engineering and oil the introduction of new governance material (legal, regulatory, and gas companies may use an asset-based classification scheme to standards-based, or legislative) without unduly impacting the organize their information. Subject based classification systems processes. This ensures that the framework is flexible. The are useful for searching. One of the most common, but least processes are typically independent of the content and implement a effective systems of classification is organization based. Many proven best practice approach to active governance. shared file servers use folder structures based upon organizational Information governance should adopt the same approach. All the structures. They are simple to implement but tend to degrade elements of Architectural Governance Framework are applicable to rapidly as many organisations change their structures regularly. an Information Governance Framework. 7.5 Disposal Schedules An important aspect is the presence of a repository to store all the A governance repository should hold the disposal schedules that the governance related information. organization applies to information. Many organizations have a schedule but fail to implement it effectively. This is due to the 7. GOVERNANCE REPOSITORY schedule being related to the type of information instead of being An information governance repository should hold the information related to how that information is organized and stored. needed to apply governance effectively. This may include items such as governance requirements; stakeholders; sources of 8. ORGANISATION STRUCTURE requirements; classification and retention and disposal schedules. The Architecture Governance Framework also identifies an 7.1 Governance Requirements Organisation Structure. Documenting governance requirements is essential for At the top of this structure is the Chief Information Officer (CIO) demonstrating that the controls applied to information are or Chief Technology Officer (CTO). The IGI advocates the necessary and effective. Requirements may be related to the creation of a Chief Information Governance Officer (CIGO) [4]. storage of information (whether physical or electronic), the security This is from the finding that most CIOs are in fact only responsible and access to information, the retention and disposal of for technology infrastructure, and not the information itself. As the information. report recognizes, a title in itself is meaningless. Projects are underway to define the role of a CIGO and, if adopted, the CIGO

Page 34 must be placed within the Organisation Structure of the combined with the lack of maturity in this discipline. The same can Architecture Governance Framework. also be said for information architecture. Beneath the CIO, the TOGAF defines three distinct areas of 10. INTEGRATED INFORMATION stewardship: Develop; Implement and Deploy. It is in the “Develop” area that an Information Architect would sit. TOGAF INFRASTRUCTURE REFERENCE MODEL distinguishes between Enterprise Architects and Domain AND BOUNDARYLESS INFORMATION Architects. The role of an Information Architect needs to cover both FLOW enterprise architecture and domain architecture. TOGAF defines boundaryless information flow as the requirement TOGAF recognizes that a large organization may need a number of to get information to the right people at the right time in a secure, specialist architecture roles. It has recognized Foundation reliable manner, in order to support the operations that are core to Architects, System Architects, Industry Architects and the extended enterprise. Organization Architects. Unfortunately the role of an Information Information and records management might describe this Architect has not been recognized. This may be due in part to the requirement as one of “accessibility”. It is important to recognize lack of a universally agreed definition for information architecture. that this does not imply that there should be no boundaries, but that Instead TOGAF does refer to a Data Architecture, describing the they are “permeable”. The problem that TOGAF is attempting to structure of an organization's logical and physical data assets and address with boundaryless information flow is in part caused by the data management resources. This could be extended to become an organizational instead of functional classification of information. Information Architecture, describing the structure of an Integrated information infrastructure has been introduced in organization’s logical and physical information assets and TOGAF to provide integrated information so that different and information management resources. There may be significant potentially conflicting pieces of information are not distributed advantages by not distinguishing between structured data and throughout different systems and integrated access to that unstructured information when creating an architecture. information so that staff can access all the information they need In the “Implement” area sits the Programme Management Office and have a right to, through one convenient interface. (PMO). Getting the PMO to recognize the significance of The Integrated Information Infrastructure Reference Model (III- information architecture can make an important contribution to the RM) takes an architectural approach to addressing this problem. It success of an information governance framework. identifies five components: business applications; infrastructure In the “Deploy” area is placed Service Management. In many applications; an application platform; the interfaces and a “quality organisations the role of a local information manager (often backplane”. What is notable by its absence is any reference to a referred to with terms such as “Information Champion” is classification scheme or any approach which looks at the recognized as necessary in order to apply controls successfully at a information itself. This would appear to be a major weakness in local level. This is usually combined with a central information TOGAF and demonstrates the need for TOGAF to incorporate management or records management function that takes information governance best practice. responsibility for corporate level matters. This type of structure (or an equivalent) must be reflected in a Governance framework. 11. HOW TO PROCEED? In trying to create a framework for information governance and TOGAF recognizes that these three areas must be aligned if they architecture two alternative approaches are possible. The first are to be successful. would be to modify TOGAF to place greater emphasis on 9. CAPABILITY MATURITY MODELS information governance and architecture and to adopt the techniques that information and records management professionals TOGAF incorporates the concept of Architecture Maturity Models. have developed. Information Governance already has the Information Governance Maturity Model [5] developed by ARMA as part of the Generally An alternative approach would be to set up an Information Accepted Recordkeeping Principles (GARP). Architecture Framework as a stand-alone approach, separate but needing to interface with an Enterprise Architecture Framework. The Information Governance Maturity Model (Maturity Model) – which is based on the Principles, as well as the established body of The first approach would need to be adopted by The Open Group standards, best practices, and legal/regulatory requirements that members themselves. Over time it is possible, indeed perhaps surround information governance – begins to paint a more complete likely, that TOGAF will recognize the increased importance of picture of what effective information governance is. It defines the information architecture and adopt more techniques developed by characteristics of information governance programs at differing information and records management professionals. levels of maturity, completeness, and effectiveness. The second approach would need an organization such as the DLM For each of the eight principles, the Maturity Model describes Forum to support a project to create an Information Architecture characteristics that are typical for its five levels of maturity: from Framework. level 1 (sub-standard) to level 5 (transformational). Referencing the For organisations that already are using TOGAF or wish to develop Maturity Model alone is a high-level evaluation, a more in-depth an enterprise architecture then the first approach would be more analysis likely will be necessary in order to develop the most useful. effective improvement strategy. An Information Architecture Framework should provide the methods and tools to develop this For organisations without an enterprise architecture and have no strategy. intentions to develop one, an information manager could find an Information Architecture Framework extremely useful. The benefits of capability maturity models are well documented for software and systems engineering. Their application to enterprise The two approaches are not incompatible. An Information architecture has been a recent development, stimulated by the Architecture Framework could be created first and then if it proves increasing interest in enterprise architecture in recent years,

Page 35 useful to enterprise architectures be incorporated into TOGAF at a By promoting these techniques through a framework and ensuring later date. that such a framework is compatible with TOGAF, greater understanding of information governance and information 12. DEVELOPING AN INFORMATION management could be developed in the wider information ARCHITECTURE FRAMEWORK technology community and this would be a good thing. There are a number of different methods and tools that would contribute to an information architecture framework. 13. ACKNOWLEDGMENTS My thanks to colleagues at In-Form Consult who have shared their A starting point would be the definition of an information ideas and thoughts on Information Architecture with me. governance repository with suggestions on the different elements that should be present in such a repository. As an example, the 14. REFERENCES information management strategy elements within such a [1] IGI Report 2014. repository might include a records management plan, an information security management system with an associated access [2] http://blogs.gartner.com/debra_logan/2010/01/11/what- control policy, an evidential weight strategy, an access to isinformation-governance-and-why-is-it-so-hard information strategy, an email management strategy and a capture [3] Design and Implementation of Record Keeping Systems, and scanning policy amongst others. Each of these elements could National Archives of Australia, September 2001 (rev July be defined based upon existing best practice within the information 2003). and records management community. [4] IGI Report 2014, page 28. Classification is an example of an information management method [5] http://www.arma.org/r2/generally-accepted-br- that would benefit from being included in an information recordkeeping-principles/metrics. architecture framework. Outside of the information and records management community the techniques for creating and maintaining a classification scheme and the benefits from having done so are poorly understood.

Page 36 One consolidated view of information management references Ricardo Vieira Liliana Ragageles José Borbinha IST / INESC-ID FCSH – Universidade Nova de Lisboa IST / INESC-ID Rua Alves Redol, 9 Avenida de Berna 26-C Rua Alves Redol, 9 1000-029 Lisbon, Portugal 1069-061 Lisbon, Portugal 1000-029 Lisbon, Portugal [email protected] [email protected] [email protected]

ABSTRACT The actual landscape of relevant business and technical references 1. INTRODUCTION A reference document describes relevant details for consultation in information management comprises multiple views. For about a specific subject. In any business or activity, reference example, standards such as ISO15489 or ISO30300 represent the documents are of vital importance since they can provide a high level concerns of information as a record of evidences of the common understanding of the subject in focus. Additionally, acts of an organization, while MoReq2010 and ISO16175 define adhering to practices described in reference documents may at a lower level requirements systems must consider for that. On support safety, reliability and interoperability among products, in some way complementary views, ISO16363 defines a process services and systems. Standards are a particular type of reference to assess digital repositories and ISO18128 describes a method for documents that are typically used by regulators and legislators to assessing risks to records processes and systems. We must then ensure best practices among businesses. recognize it is fundamental to recognize that the problem of information management can be seen from multiple perspectives Before using a reference document it is necessary to understand and consequently solutions might need to consider requirements its context and purpose. In other words, reference documents are from different areas of expertise or concerns. Besides core defined to specific stakeholders with specific concerns on a concerns of information management, a correct analysis also specific context. Therefore, before the adoption of a reference might have to consider other specific views such as information document, it is necessary to understand how the context where the systems, software engineering, risk management, etc. Therefore, reference document is going to be used compares to the context information management practices should also consider the document was intended. Context differences might result on fundamental references from those areas of expertise, such as for different adoptions of the document or in the worst case scenario example the ISO2700 series of standards for information security, it might indicate that the reference document is not suitable for ISO/IEC TR 15504 for assessing the delivery capabilities of an adoption. organization, ISO/IEC 15288 to address processes lifecycle and When discussing the concern of information management it is stages, ISO 9000 for quality, etc. This proliferation of references possible to identify several reference documents. ISO Technical hinders organizations to determine in a straightforward manner Committee (TC) TC46 is responsible for publishing standards two fundamental business-related concerns: (1) guidance of best about information and documentation. The Committee is practices, meaning what references should be considered for each structured into five subcommittees (SC). The SC4, SC8 and SC9 purpose, and (2) to what extent do their actual processes and focus on standards for libraries and related organizations systems already comply with such. That already led to the respectively about technical interoperability, statistics and definition of national references (e.g. NOARK in Norway, e-Arq performance, and identification and description. SC10 publishes in Brazil, or SAHKE2 in Finland), adding value to the existing standards that describe and define requirements for document knowledge but consequently increasing the entropy. This paper storage and conditions for preservation. Finally, ISO reports a first attempt by the authors to consolidate this referential subcommittee TC46/SC11 has until now published 17 ISO landscape, focusing mainly in the identification of the main standards about archives/records management. From another area relevant concerns and views. of expertise, the TC 20/SC 13 develops data and communication standards for spaceflight and is responsible for several standards Categories and Subject Descriptors widely used in archival and data preservation such as the ISO • Information systems ~ Information systems applications 14721 [2], ISO 16363 [8], and ISO 20652 [10]. Additionally, • Information systems ~ Digital libraries and archives references such as MoReq2010 [17] are also widely used for the definition of systems in information management. General Terms This diversity of reference documents hinders organizations to Management, Standardization. determine in a straightforward manner which references should be considered and for which purpose. In other words, organizations struggle to understand how the reference documents’ knowledge Keywords should be applied to their specific context. This papers proposes Information Management, Standards, References, Information to mitigate the problem by identifying and describing the different Management Concerns, Recordkeeping, Digital Archives contexts that information management reference documents assume. It ends by identifying relations and gaps between

Page 37 different contexts that might indicate relations between different In other words, ISO 155489 describes relevant details about a reference documents. records management system (not to be confused with records system – an information system responsible by the managing the 2. SYSTEM, STAKEHOLDER, CONCERN, record lifecycle). AND PURPOSE The aforementioned subcommittee (TC46/SC11) of the ISO organization is responsible for most of the reference documents To compare and relate different contexts we need to reach a typically adopted when discussing a records management system. common understanding of what influences and dictates the Additionally, MoReq2010 [17] has known increasing attention as context. Using the definitions on [16], the context is what a reference document that describes and defines a set of “determines the setting and circumstances of all influences upon a requirements for a records system. We analysed the system” where system refers to: aforementioned reference documents to identify the intended “Systems that are man-made and may be configured with one or stakeholders and concerns. more of the following: hardware, software, data, humans, Table 1 describes the output of the analysis. process, procedures, facilities, materials and naturally occurring entities” [16]. We can conclude that: Apart from the context, systems have stakeholders, i.e., parties  ISO 15489 [3, 4] and ISO 30300 [14, 15] standards series with specific concerns in the system. Through their concerns, discuss and describe the same concerns. The latter stakeholders have various purposes to a system. Therefore, the describes the concerns in more detail but all the concerns context of a system “is bounded and understood through the of the former are discussed in ISO 30300 series of identification and analysis of the system’s stakeholders and their standards. A more detailed analysis would be needed to concerns” [16]. Figure 1 illustrates the concepts defined and their understand: (1) the extent of the overlap between the relations. standards, and (2) if the standards do not contradict Taken into consideration the definition of system described above themselves when describing the same concerns. it is possible to conclude that a reference document describes  The same conclusion above can be inferred to ISO 16175 relevant details for consultation about a specific system situated in [5, 6, 7] and MoReq2010 [17]. A more detailed analysis a specific context. Stakeholders express interest in the systems would be needed to evaluate the overlap between the described on the reference documents by expressing concerns and references. purposes upon the system. Due to the fact that the context of a It is important to note that an association between a concern and a system is bonded to its stakeholders and concerns, in the next reference document was only identified whenever the concern is sections the paper analyses the intended stakeholders and discussed in detail in the document. For example, the concern of concerns of the reference documents. having a risk assessment for records processes and systems is mentioned on ISO 15489 [3, 4] and ISO 30300 series of standards [14, 15] but is only described on ISO 18128 [9]. 4. THE “ARCHIVE” CONTEXT The Open Archival Information System (OAIS) is defined as: “An Archive, consisting of an organization, which may be part of a larger organization, of people and systems, that has accepted the responsibility to preserve information and make it available for a Designed Community” by the ISO 14721 [2]. The standard developed by the Consultative Committee for Space Data Systems (CCSDS) is the core of several standards published under the ISO subcommittee ISO/TC 20/SC 13 and known for Figure 1. Relations between the concepts of context, system, defining best practices for an Archive. Due to the fact that most of stakeholder, concern and purpose [16]. the standards published by the subcommittee are specific to spaceflight, only a subset of them are referenced when discussing information management. Apart from the ISO 14721 [2] that describes what should constitute an Archive, ISO 16363 [8] and 3. THE “RECORDS MANAGEMENT” ISO 20652 [10] are also well know. The former describes how to CONTEXT assess the trustworthiness of digital repositories, the latter Records Management is typically referred along with ISO 15489 describes the relationship between an information producer and an [3, 4] entitled “Records management” from TC46/SC11. The Archive. This paper analysed the aforementioned reference standard, which is on the core of many reference documents about documents to identify the intended stakeholders and concerns. records management, defines records management as: Table 2 describes the output of the analysis. Through the analysis “The field of management responsible for the efficient and it is possible to conclude that the concerns described in ISO systematic control of the creation, receipt, maintenance, use and 14721 [2] are also described in ISO 16363 [8]. However, the disposition of records, including processes for capturing and purpose of the two reference documents is complementary since maintain evidence of and information about business activities the latter is intended as a checklist to asses that all concerns and transactions in the form of records” [3]. described in ISO 14721 [2] were considered in a repository.

Page 38

Table 1. Analysis of classes of stakeholders and concerns of the view “Records Management” Stakeholders Concerns  Ensure that a records management policy is defined, documented and communicated [3, 4, 11, 13, 14]  Ensure records management responsibilities and authorities are defined and assigned [3, 4, 14] Organizations  Ensure records managers have the necessary skills and competences [3, 4, 14] Managers1  Ensure that the records management policy is aligned with the organization goals and context [14, 15]  Ensure the proper allocation of resources to records management [4, 14, 15]  Ensure the monitor and review of the records management policy [14, 15]  Ensure the correct implementation of a records management programme (records management strategy) aligned with the records management policy [3, 4, 14, 15]  Ensure the monitor and control of records management processes [3, 4, 14, 15] Records Managers2  Ensure that a record system implementation methodology is defined and implemented [3, 4, 14, 15]  Ensure the reliability, authenticity, usability and integrity of metadata associated with records [11, 12]  Ensure the reliability, usability and integrity of record systems [3, 4, 5, 6, 7, 11, 12, 14, 15, IT Managers3 17] Risk Managers  Ensure the assessment of risks for records processes and systems [9]  Ensure the records management programme is compliant with records management Auditors requirements [3, 4, 14, 15]

Table 2. Analysis of classes of stakeholders and concerns of the view “Archive” Stakeholders Concerns Producer  Ensure the information provided is preserved according to specific requirements [2, 10]  Ensure that information is preserved according to the agreed requirements [2, 8]  Ensure that an Archive policy is defined, documented and communicated [2, 8]  Ensure the Archive responsibilities and authorities are defined and assigned [8]  Ensure staff of the Archive have the necessary skills and competences [8]  Ensure the proper allocation of resources to the Archive [2, 8] Management  Ensure proper technology and infrastructure support [2, 8]  Ensure the monitor and review of the Archive policy [2, 8]  Ensure the communication and transparency between stakeholders [2, 8]  Ensure that the preserved information is understandable for the designated community [2, 8]  Ensure that the preserved information is available to the designated community [2, 8] Consumer  Ensure the access to the information requested [2]

1 Also referenced as “top management”, “senior management” and “executives”. 2 Also referenced as “records management professional” 3 Also referenced as “system administrators”

Page 39 “Management” concerns in the Archive we can see that both refer 5. A CONSOLIDATED VIEW to (a) definition and communication of policies, (b) assignment of This section analyses the relationships between the responsibilities and authorities, (c) training of skills and aforementioned reference documents. As stated before, a system competences, (d) allocation of resources, and (e) monitor and and is context are bounded to its stakeholders and concerns. In the review of policies. The difference between the concerns are the previous sections we identified stakeholders and concerns object of focus: while records management focus on records, the described in several reference documents in order to understand Archive focus on information. However, if we ignore the the purpose and goals of their described systems. terminology and focus on the definition of the terms, we can also It is important to note that concerns represent the Ends of the observe an overlap. Record is defined as “information created, system, i.e. concerns are desired results and visions of what the received, and maintained as evidence and information by an system can and should be to their stakeholders [1]. Ends are organization or person, in pursuance of legal obligations or in the achieved through Means, i.e. “devices, capabilities, regimes, transaction of business” [ISO 15489]. Information in the Archive techniques, restrictions, agencies, instruments, or methods that references is defined as “any type of knowledge that can be may be called upon, activated, or enforced to achieve Ends” [1]. exchanged”. In conclusion, Archive references focus on a very In fairness, the analysed reference documents do not only broad concept of information where records can be included. In described Ends (concerns) but also Means to achieve them. While other words, Archives are concerned on the capture, preservation in some references (e.g. ISO 15489 [3, 4], ISO 16363 [8], ISO and access of records as any other type of knowledge that can be 23081 [11, 12], and ISO 30300 series standards [14, 15]) Means exchanged. Therefore, in theory, the Ends to achieve the Archive and Ends are described along, others such as the ISO 16175 [5, 6, Means can also be applied to achieve the Records Management 7], ISO 26122 [13], or MoReq2010 [17], clearly focus on the Means. Ends. 5.2 Addressing Concerns 5.1 Relating Concerns If we ignore the object of focus of the aforementioned references, Judging merely by the identified stakeholders names (records the processes inferred on the concerns are also common in several management references do not provide definitions for their businesses or practices. This represents evidence that information stakeholders so comparison of definitions is not possible), it could management can be seen from different perspectives and be easy to conclude that the records management and Archive consequently their Means should consider requirements from references do not have related stakeholders between them. different views. However, through the analysis of the concerns it is possible to Table 3 presents the views identified in the reference documents identify several overlaps. In fact, if we compare the “Organization along with a non-exhaustive list of reference documents from Managers” concerns in records managements and the those views.

Table 3. “Records Management” and “Archive” related views and reference documents Inferred practices from the analysed Related View Reference Documents references documents  ISO 29383:2010 – Terminology Policies – Development and  Policy Design and implementation Implementation  ISO 21500:2012 – Guidance on Project Management Project Management  Assignment of Responsibilities  Project Management Body of Knowledge (PMBOK) by the  Resource Training Project Management Institute  Allocation of Resources  Projects IN Controlled Environments, version 2 (PRINCE2)  Capability Maturity Model Integration (CMMI)  Goal-Process Alignment Process Maturity  ISO 9000 – Quality Management Series of Standards  Process Monitor and Compliance  ISO 15504 – Information Technology – Process Assessment  Process Quality  ISO 19011:2011 – Guidelines for auditing management system  ISO/Guide 73:2009 – Risk Management – Vocabulary Risk Management  Risk Assessment  ISO 31000:2009 – Risk Management Series of Standards  All standards from the ISO subcommittee ISO/IEC JTC 1/SC 40  Ensure the reliability, usability IT and Management – IT Service Management and IT Governance and integrity of systems  Information Technology Infrastructure Library (ITIL)  Ensure the reliability, authenticity, usability and  Data Management Body of Knowledge (DMBOK) by DAMA Data Management integrity of metadata International  Data capture, preservation and access

Page 40

functional requirements for digital records management 6. CONCLUSIONS AND FUTURE WORK systems. Standard. ISO, Geneva, Switzerland. This paper analyses different reference documents through the [7] ISO 16175-3. 2010. Information and documentation – identification of stakeholders and their concerns. The analysis Principles and functional requirements for records in supports an overall understanding of the references and their electronic office environments – Part 3: Guidelines and relations. Additionally, it allowed the identification of relations functional requirements for records in business systems. between reference documents using the same terminology and Standard. ISO, Geneva, Switzerland. produced by the same stakeholders. However, when comparing similar reference documents produced by different stakeholders it [8] ISO 16363. 2012. Space data and information transfer was not possible to identify equivalent stakeholders. Further systems – Audit and certification of trustworthy digital analysis allows us to conclude that although the identified repositories. Standard. ISO, Geneva, Switzerland. stakeholders are different their concerns are similar. Additionally, [9] ISO 18128. 2014. Information and documentation – Risk it was possible to infer and identify several related views to the assessment for records processes and systems. Standard. analysed body of knowledge. ISO, Geneva, Switzerland. Future work needs to include a more detailed analysis on: (1) the [10] ISO 20652. 2006. Space data and information transfer used terminology and their definitions in order to identify similar systems – Producer-archive interface – Methodology and equivalent terms, and (2) the extent and type of relation abstract standard. Standard. ISO, Geneva, Switzerland. between the reference documents, i.e. identify possible overlaps or contradictions on the extensive body of knowledge. Future [11] ISO 23081-1. 2006. Information and documentation – work will be performed on the European Project entitled Records management process – Metadata for records – Part “European Archival Records and Knowledge Preservation” (E- 1: Principles. Standard. ISO, Geneva, Switzerland. ARK4) where a knowledge centre for information governance is [12] ISO 23081-2. 2009. Information and documentation – being developed. The service will consist in an online information Managing metadata for records – Part 2: Conceptual and system that will allow the consolidation and access of reference implementation issues. Standard. ISO, Geneva, Switzerland. documents in order to derive best practices. The system will be [13] ISO/TR 26122. 2008. Information and documentation – designed according to reference techniques in the area of Work process analysis for records. Standard. ISO, Geneva, requirements engineering, and business analysis and design, such Switzerland. as the Business Motivation Model (BMM) [1] and the ISO 42010 [16] referenced and used in this paper. [14] ISO 30300. 2011. Information and documentation – Management system for records – Fundamentals and 7. ACKNOWLEDGMENTS vocabulary. Standard. ISO, Geneva, Switzerland. This research was co-funded by FCT – Fundação para a Ciência e [15] ISO 30301. 2011. Information and documentation – a Tecnologia, under project PEst-OE/EEI/LA0021/2013 and by Management system for records – Requirements. Standard. the European Commission under the Competitiveness and ISO, Geneva, Switzerland. Innovation Programme 2007-2013, E-ARK – Grant Agreement [16] ISO/IEC/IEEE 42010. 2011. Systems and software no. 620998 under the Policy Support Programme. engineering – Architecture description. Standard. ISO, Geneva, Switzerland. 8. REFERENCES [17] MoReq2010. 2011. Modular Requirements for Records [1] BMM. 2014. Business Motivation Model - version 1.2, Systems – Volume 1: Core Services & Plug-in Modules, Technical Reference. OMG, Massachusetts, USA. Version 1.1. DLM-Forum Foundation, Northampton, [2] ISO 14721. 2012. Space data and information transfer England. systems – Open archival information system (OAIS) – Reference model. Standard. ISO, Geneva, Switzerland. [3] ISO 15489-1. 2001. Information and documentation – Records Management – Part 1: Concepts and principles. Standard. ISO, Geneva, Switzerland. [4] ISO 15489-2. 2001. Information and documentation – Records Management – Part 1: Guidelines. Standard. ISO, Geneva, Switzerland. [5] ISO 16175-1. 2010. Information and documentation – Principles and functional requirements for records in electronic office environments – Part 1: Overview and statement of principles. Standard. ISO, Geneva, Switzerland. [6] ISO 16175-2. 2011. Information and documentation – Principles and functional requirements for records in electronic office environments – Part 2: Guidelines and

4 For more details please consult: http://www.eark-project.com/

Page 41

Transforming Information Governance Using SharePoint 2010 Stephen Howard NATS 4000 Parkway, Whiteley, Fareham, Hants, PO15 7FL, UK +44 (0)1489 446925 [email protected]

ABSTRACT is one of the first Air Navigation Service Providers (ANSPs) to be NATS, an international air navigation service provider, deployed privatized and one of the leading commercial ANSPs in Europe. SharePoint 2010 to approximately 4500 users over 18 months. The business information technology function within NATS This paper provides an overview of the main aspects of the (Information Solutions) outsources all technical services to project, including the business case and change management multiple business partners, and is focused primarily on service delivery model. It outlines the key role of an information development, customer service delivery, supplier coordination governance framework, including a network of Local Information and change management. In June 2011, after extensive internal Managers and Information Points of Contact, underpinned by an consultation, market evaluation and proof of concept trials, the Information, Records & Archives Management policy. The paper NATS Chief Information Officer launched a long-term initiative also details information architecture choices and configuration for called “Our Future Workspace”. Two core elements of the records management. Finally, the results of a comprehensive programme included the virtualization of the standard IT desktop lessons learned exercise are reviewed, including informal and the adoption of Microsoft’s SharePoint Server 2010 at the benchmarking against other SharePoint implementations. The heart of a strategic content and data management infrastructure to project proved to be more complex than envisaged at the outset, facilitate collaboration and mobility [2]. This paper focuses upon and high initial user expectations have not always been met. In the core SharePoint 2010 collaborative team site (Team HUB) particular, records management functionality will be provided deployment at NATS from October 2012 to April 2014. It does during a secondary implementation phase using third-party tools. not address data warehouse analysis using SharePoint for business Users remain engaged and are optimistic that the collaborative intelligence, nor the development of NATS applications built and data management capabilities of the new common platform upon the new platform. It also does not cover the related will lead to innovation in business processes and deliver tangible SharePoint project at NATS to launch a new intranet, training 450 benefits. contributors and transitioning over 15000 pages within 6 months using a governance model that was a useful foundation the main Categories and Subject Descriptors Team HUB rollout. Benchmarking RM and information governance; evolution from RM into global information governance; latest developments and 2. SHAREPOINT BUSINESS CASE best practices in RM. The original SharePoint business case described how NATS Keywords retained a significant amount of content for operational, SharePoint 2010; information governance; enterprise content regulatory and legal purposes. This was stored and managed management. within the Livelink legacy content management system in a manner which did not fully support effective and efficient 1. INTRODUCTION collaboration. There was no automated implementation of retention policies and if no action was taken, it was argued, the “In any organisation, once the beliefs and energies of a critical content and cost of storage would become unmanageable. The mass of people are engaged, conversion to a new idea will spread savings arising from the decommissioning of Livelink would like an epidemic, bringing about fundamental change very deliver a timely return on the project investment. quickly.” [1] NATS is the UK's leading provider of air traffic control services. SharePoint Server 2010 was identified as the only integrated and Each year NATS handles 2.2 million flights and 220 million cost-effective solution that was capable of meeting NATS’ passengers in UK airspace. In addition to providing services to 15 specific information management and wider capability UK airports, NATS works in more than 30 countries around the requirements. The anticipated benefits included: world. Established as a public/private partnership in 2001, NATS  Reduced operating costs, compared to implementing point solutions for records management and the intranet. .  Reduced legal, regulatory and security risks; improving the management of personal and non-personal data; applying retention policies and improving access controls.  Improved productivity since an accurate and authentic “single version of the truth” could be found more easily when stored in shared areas; more effective; information-

Page 42

based processes streamlined using automated workflows; The project team approached senior managers within each rich business intelligence reporting. business area well in advance of the schedule and obtained agreement to follow the rollout plan to build, populate and launch  Alignment with the NATS business growth strategy, since their Team HUB [see Figure 1 below]. the platform provided a single common platform for web content management, business application development and wider innovation. The SharePoint project was originally approved in August 2011 at a total forecast capital and revenue cost of £2.8m including internal labour costs, all supplier services, licensing and the establishment of a resilient on-premise SharePoint infrastructure delivering satisfactory performance across the NATS estate. The project deliverables were subsequently revised in light of emerging complexity and additional business requirements. Due to this extended scope, the total project budget was amended in November 2012 to £5.5m to cover additional implementation, development, deployment and internal labour costs.

3. INFORMATION GOVERNANCE

An Information Management Steering Group (IMSG) composed Figure 1 of sponsors from each business area was established in the early LIMs and IPOCs were the primary point of contact for the project phases of the SharePoint rollout with a monthly agenda of project team to collate business areas requirements for Team HUB design, updates and relevant information governance decision-making. In permissions, retention and content migration. Each business area early 2013 a new Information, Records and Archives Management was required to complete a pre-engagement audit questionnaire to Policy was approved by the IMSG, which after a long period of better understand each function and review existing content consultation obtained Executive approval in February 2014. The management practices. LIMs and IPOCs were strongly new policy clarified NATS’ commitments and articulated an encouraged to prepare for engagement with the project team by information governance framework that would prove vital to the housekeeping their existing Livelink and shared drive folder SharePoint project and to subsequent information management structures, deleting any expired content, and identifying any initiatives [see Annex 1 below]. documents that could not be held in SharePoint due to size or file type limitations. In particular, LIMs and IPOCs were tasked with The policy obliged business areas to appoint a Local Information rationalizing deep nested folders within Livelink that would not Managers (or LIM) to ensure the appropriate management work well in a SharePoint environment. direction, processes and tools were in place to efficiently manage information arising from their functions. For larger business areas, LIMs and IPOCs were also directly involved in the delivery of LIMs would be supported by Information Points of Contact communications about the Team HUB rollout and assisted in the (IPOCs) from each sub-function. As SharePoint site collection scheduling of “What’s in it for me?” and hands-on transition owners, LIMs and IPOCs manage and maintain local Team HUB training for all business area employees. They also ensured their content and structure, set permissions and approve local changes team’s attendance at detailed site design workshops and to sites on a routine basis. They participate in the NATS coordinated user acceptance testing, migration verification and information management community (e.g. the Electronic sign-off prior to go-live. The project team offered floorwalker Document and Records Management Forum), share their support and drop-in surgeries to supplement SharePoint knowledge with colleagues, answer local user enquiries in the first computer-based training and wiki guides. Issues captured by the instance, promote best practice and aspire to continuous LIMs and IPOCs were routed to the project team for resolution. improvement in the management of their information. A total of 28 LIMs and 204 IPOCs across all business areas were appointed, 5. INFORMATION ARCHITECTURE trained and supported during the project. PRINCIPLES

4. PROJECT DELIVERY It is essential that information governance arrangements are robust enough to enforce a disciplined approach to key SharePoint The SharePoint project team was composed of 14 NATS and architectural principles and design decisions. Examples from the supplier staff, including SharePoint architects and subject matter NATS project included: experts, project managers, information management analysts and NATS change agents. The 18-month implementation plan divided NATS into 8 key tranches of 12 weeks’ duration, with each 5.1 Naming conventions and site hierarchy tranche containing several business areas. The balancing of the effort required across each tranche was more an art than a science. The names of web applications, managed paths, sites, libraries and The core business functions of the operational centres, lists should follow standard naming conventions. All engineering and programmes obtained significant extensions to should be succinct to minimise the risk of the total file path their tranches at the project board. exceeding the technical limit of 255 characters and exclude

Page 43 spaces. Where not a Community or Project site (which have their can be presented in the document header/footer itself as a smart own managed path), all new Team HUB site collections should be tag label. created using a “Functions” managed path and named using a 3- letter acronym agreed with Information Management and the 5.4 Records management configuration relevant LIM to describe that function. Document IDs should be set to replicate the relevant site collection 3 letter acronyms as a prefix to create a unique identifier across the SharePoint farm. The implementation of records management and retention controls was a core element of the NATS SharePoint business 5.2 Team HUB template case. SharePoint 2010 provides basic records management functionality, but the project team developed an increasing

awareness of the limited capabilities when scaled to meet the Site libraries should be structured according to the functional demands of a large and complex organisation [3, 4 and 5]. breakdown of activities relevant to each business area, with their Critical issues included the lack of aggregation to rationalize name and number approximately equivalent to the top-level disposition decisions, the weaknesses in the audit trail of folders of legacy content on shared drives following a thorough disposition and the lack of a unique document ID that could housekeeping exercise in liaison with the Information survive routing across site collections. Management team. Folders should be replaced with metadata where practical. In line with the collaborative vision of the NATS and its SharePoint delivery partner explored ways of SharePoint project, permissions to access site collections and configuring an effective records management solution using out- functional libraries should ordinarily be set to all NATS staff of-the-box functionality and limited customisation, but the known unless there is a compelling business reason to restrict access to a issues could not be confidently resolved in a cost effective manner more limited group i.e. due to the presence of protectively-marked and without placing future SharePoint upgrades at risk. NATS content. Library check-in/out functionality should be set to default thus reviewed the marketplace for a specialist SharePoint records to OFF to avoid issues with bulk upload, updating in datasheet management plug-in written and supported by a reputable third- view, co-authoring and storing drafts to local storage. Versioning party supplier. should be set to permit major and minor versions. A default All This was also an opportune moment for NATS to seek a method Documents view should be enforced using columns selected from of effectively capturing and applying retention schedules to email the default Content Type to standardize the user experience. records. The lack of integration between SharePoint 2010 and Outlook 2010 does not allow for an intuitive and effective way to 5.3 Content types and metadata manage email records. RecordPoint were the winning vendor, who also supplied the Email Manager Outlook plug-in by Colligo. NATS is still in the process of deploying these tools onto its The standard NATS-Document content type must include the production SharePoint and desktop environments and planning columns detailed in Table 1 below. their full implementation. Table 1 6. PROJECT ACHIEVEMENTS

Metadata Mandatory? Description At end of the deployment in April 2014, the Team HUB web Name √ File name (to application contained 1310 Team sites representing 36 business naming areas and attracted 1400 unique visitors each day. Team HUB conventions) User training had been delivered to approximately 1400 employees defined across all office locations, and every NATS SharePoint user was granted ownership of an individual “My Site”. Two computer- Title Plain English based training packages were available to all staff via the intranet, description supported by 422 wiki pages of frequently asked questions. Over NATS Subject local picklist the duration of the deployment, 776 issues were captured and created by LIM resolved by the project team, and 42 significant change requests were implemented by a Change Control Board to deliver business NATS Owner √ Default to creator requirements outside of the original scope of the rollout. Over 1.7 NATS Protective √ Default to library TB of legacy content was migrated from Livelink to SharePoint. Marking setting The project ultimately delivered within the constraints of the revised budget, and in line with the scope and schedule agreed by Document ID √ Auto filled the project board. For the first time in recent memory all NATS business areas are using the same enterprise content management system. The organisation as a whole has learned a lot about Child content types should be created for each major Office SharePoint over the past 18 months, and most LIMs are engaged application and linked to a standard document template issued by with the platform and have ideas about taking it forward. There NATS Internal Communications e.g. NATS-Word, NATS-Excel, are good examples of LIMs exploiting the potential of SharePoint NATS-PowerPoint. The mandatory column NATS Protective to build sites that cut across organisational boundaries and enable Marking can be set to default accordingly for each particular more efficient collaboration between teams e.g. Swanwick and location based upon the normal sensitivity of the record series and Prestwick centres share libraries of operational instructions;

Page 44

Business Development and NATS Services share libraries of eventually establish a successful rhythm, at times there were commercial bids. bottlenecks as the tranche issues doubled-up. NATS also required additional training support (provided by Ipso 7. LESSONS LEARNED Facto Ltd) for the LIMs and IPOCs, who reflected that earlier training opportunities would have provided a better insight to the NATS commissioned an independent auditor to conduct a mid- full capabilities of the new platform, enabling more innovation at flight project review in March 2013. This was followed up in the design workshops and ultimately helping to build better April 2014 by an experienced independent SharePoint consultant solutions. NATS suffered from a lack of SharePoint experience in who conducted an evaluation of the project and conducted some comparison to most of the other organisations benchmarked (C & informal benchmarking against 5 organisations who had E had learned from previous unsuccessful SharePoint 2007 implemented SharePoint [See Annex 2]. One month after the projects). NATS also suffered from a lack of settled personnel, formal end of the project, over fifty LIMs, IPOCs and supplier and key posts were vacant when the rollout started. representatives attended an afternoon workshop to review the lessons learned, and celebrated the conclusion of a long, complex 7.3 Information architecture but ultimately rewarding project. Early Team Hub workshops devoted much effort to the creation of 7.1 Stakeholder engagement a full catalogue of content types from each business area, with a view to specifying metadata for each content type and linking Stakeholder engagement was complicated since ambitious but them to retention and lifecycle management policies. In practice, unrealistic expectations were set in the early phases of the project the 12-week engagement model did not give enough time for the and not recalibrated when the scope was changed. Neither of the business areas to agree on bespoke content types. The project corporate-wide implementations benchmarked (organisations B board approved the simplification of the standard site template in and D) promised records management, workflow or extranets. Feb 2013, a few months into the rollout. The use of the managed Even C which is proceeding with a very slow, phased and targeted metadata column “NATS Subject” was unclear, and by default the implementation, have postponed records management and whole Subject term set was available for every site library. It was extranets until a later stage. unrealistic to expect a central authority to manage the “NATS Subject” list to the level of granularity needed by the business and On the other hand, business areas were to some extent unprepared from tranche 4 onwards, NATS Subject was an optional column for engagement with the project team. There was widespread of controlled values populated by Local Information Managers as misunderstanding of the gravity and scale of the LIM and IPOC appropriate. roles, and the majority of business areas struggled to meet the resource requirements for effective information governance during In some instances the project team delegated nodes of the Term the transition to SharePoint. Not all of the LIMs had the time Store to LIMs, however SharePoint Managed Metadata was and/or the SharePoint knowledge to configure document libraries exposed as an immature feature of SharePoint 2010 to be used to adapt SharePoint to their own needs. Some business areas were with caution and in agreement with each business area. NATS fully occupied with existing projects or change initiatives and content type ambitions were thus eroded, creating some confusion could not fully exploit the information management opportunities and inconsistency. In comparison, organisations D and B had presented to them by Team HUBs. made their key information architecture choices before the start of the roll out and stuck to them all the way through. 7.2 Underestimation of effort Despite the project team’s efforts to standardize templates and architectural design decisions, there is inconsistent use of folders, content types and metadata. It proved to be unrealistic to expect It was recognized at an early stage that the deployment of existing sites to be migrated to a new version of the site template, SharePoint is much more than the installation of the software. and there are variations in templates that may potentially Team HUBs profoundly affected the way that business areas complicate user support. A number of valid special cases were worked and represented a personal challenge to many staff identified across the business, and bespoke site templates were members familiar with the legacy platform. Business areas also created for projects and for engineering assets. required far more assistance to carry out a detailed analysis of their existing content than was originally planned and needed The recommendation to create broader, flatter site structures enhanced support to define and implement appropriate reflecting business functions was not always followed. “Function” information management structures. An extended period of was often interpreted to mean “Department” and hierarchical site engagement and user acceptance testing within each business area collections were constructed based upon temporary organisational and additional analyst support was needed to capture structures. It did not take long before the project team received requirements, increasing the budget allocation. requests to relocate sub-sites to new site collections due to Even with the additional budget, the project team was perceived organisational change – with associated complications of losing by the business areas as lightly resourced and heavily reliant upon identifiers specific to the parent site collection. the LIM and IPOC network. Due to the ambitious and congested project schedule, engagement with the next tranche of business 7.4 Migration areas had frequently already started prior to the full resolution of issues from the previous tranche. Although the deployment did The decommissioning of Livelink was a central element of the SharePoint business case, and the project was overshadowed by

Page 45 the need to migrate Livelink and file share content to business working with SharePoint Workspace was disappointing, and area Team HUBs. Time that could have been spent in designing optimizing access to Team HUBs from mobile devices was not the information structures of the new Team HUBs and innovating fully considered. On the other hand, some business areas were in ways of working was instead used in migration mapping. frustrated that advanced SharePoint features were not fully The path of least resistance for some business areas unprepared enabled and available at an early stage to LIMs and IPOCs e.g. for engagement with the project was to simply “lift and shift” SharePoint Designer for custom workflows, InfoPath data Livelink folders into SharePoint. In some cases those folder collection and Business Connectivity Services to external data structures were deep nested folders and impractical in a sources. SharePoint environment. Replicating folder structures in this Search performance was consistently disappointing due to a manner failed to exploit the strength of SharePoint in managing proliferation of document versions eroding confidence in the content with metadata. Where content was important enough to be SharePoint search results. The project team was unable to needed in Team Hubs ideally it should have been imported with prioritise change requests to the FAST search configuration and in metadata values added to the fields required in the library. this respect did not fully meet business expectations. Most of the other organisations surveyed took a simpler approach to migration. C did not migrate any comment from its 2007 7.7 Platform stability and support instance to SharePoint 2010. E will import content from its legacy electronic records management system into a SharePoint The concurrent delivery of the desktop virtualization project records centre. In contrast, NATS attempted the more difficult across all business areas complicated the Team HUB deployment task of importing Livelink content area by area into each relevant in a variety of ways, and it is strongly recommended that team HUB. organisations deploy SharePoint on a desktop environment with the latest versions of standard office applications. Supplier The migration task itself was lengthy and complicated, with a management was hindered by a complex contract, lengthy wide range of errors reported due to filenames, sizes or formats. contract negotiations, and changes in personnel. Business areas The migration from Livelink frequently broke links to documents perceived that there was very limited access to the technical that existed in other documents and databases across NATs, and a resource who could fix user issues. bespoke Livelink search redirection service had to be built to provide a workaround. Even though a significant amount of The early phases of the project established a platform that was not content was migrated and deleted from the legacy system, fully resilient. Major system outages occurred during the Livelink retains some unique content and is currently still in deployment that caused delays and undermined confidence in the operation in read-only mode, a clear example of where the project platform and project. A comprehensive technical improvement did not fully meet its objectives. plan restored trust in SharePoint; however NATS continues to experience technical issues with third-party tools for backup, 7.5 Security levels and access controls administrative reporting and document migration. The launch of extranets was significantly delayed due to technical issues. Permissions remain the key administrative overhead for site collection owners, and a frequent source of support calls in light 8. BENEFITS REALISATION of their apparent fragility and complexity. The project team recommended the placing of service requests to edit Active NATS defined projects benefits that were arguably too generic, Directory groups rather than directly editing SharePoint groups. It making them less useful as a guide to decision making in the was only after the end of the project that a tool was provided to project and difficult to measure. In September 2014, 6 months LIMS and IPOCs for updating Active Directory in real-time. after the closure of the project, LIMs and IPOCs were invited to Further investigation of permissions management plug-ins to set complete a questionnaire to summarize their experience of the security based on metadata values (e.g. protective marking) will deployment and answer the key question “To what extent do you be undertaken. Business areas have also requested enhanced feel that Team HUB rollout has improved the overall level of reporting capabilities to easily investigate the permissions information management within your business area?” [See Figure allocated to an individual. 2 below]. 7.6 Limitations and missed opportunities

The implementation of records and email management plug-ins will address some of the key functionality gaps in SharePoint. However the configuration of the retention process in the records management plug-in will require the resources of IS, suppliers and the LIMs. In hindsight, greater efforts should have been made to implement RecordPoint earlier in the project and NATS should now take a long term view of the application of retention rules. A range of product weaknesses complicated the SharePoint rollout. Some applications (e.g. AutoCAD, Acrobat) did not easily integrate with SharePoint. The software boundaries and limits were frequently encountered. Users were frustrated in performing Figure 2 simple tasks like moving documents across libraries. Offline

Page 46

From 29 responses, the average measure along the scale of harnessed and supported. NATS has reached a tipping point in its improvement was 5.24 on a 10 point scale. The wide spread of use of SharePoint, with a critical mass of adoption. responses reflect the differing levels of exploitation of SharePoint’s potential e.g. business areas that have simply 10. ACKNOWLEDGMENTS imported Livelink folder structures into SharePoint have seen This project would not have succeeded without the remarkable little or no improvements. The low scores typically represent the efforts of the NATS LIMs and IPOCs, the SharePoint project earlier tranches, before the project team made key adjustments. team, including Hitachi Consulting Services and Information The informal survey suggest that business areas have seen a Solutions. The author wishes to thank PricewaterhouseCoopers positive impact in their governance of information and are reaping LLP and Thinking Records Ltd. for auditing this project. the rewards of engaging with the project team, beginning their Acknowledgements are also due to all of the sources of the SharePoint journey and learning en route. There is broad articles curated via http://www.scoop.it/t/managing-records-in- acceptance that SharePoint has been beneficial to NATS and has sharepoint-2010, a constant reference during the project. the potential for further innovation. It is interesting to compare this response with results of a survey 11. REFERENCES of more than 600 members of the AIIM community in 2013 [6]. [1] Kim, W. Chan and Mauborgne, Renee. 2003. Tipping In particular, in response to the question “Thinking about the Point Leadership. Harvard Business Review, The scope and development of your SharePoint ECM project, how Magazine (Apr, 2003). would you describe progress”? The AIIM survey suggests that a DOI=http://hbr.org/2003/04/tipping-point- majority of SharePoint deployments (61%) are stalled, struggling leadership//1 or failing and only 6% report an unqualified success [See Figure [2] Walker, G. 2013. “Duties of the CIO”. Presentation 3 below]. given at the BCS event “CIO-DNA” on 4 November 2013 in London, UK. DOI=http://www.bcs.org/content/conWebDoc/51322 [3] The National Archives. 2011. Records Management in SharePoint 2010 – implications and issues. DOI=http://www.nationalarchives.gov.uk/documents/inf ormation-management/review-of-records-management- in-sharepoint-2010.pdf [4] Miller, B. 2012. Managing Records in Microsoft SharePoint 2010. Overland Park, Kansas. Arma International. [5] State Records New South Wales. 2012. Sharepoint 2010: recordkeeping considerations. DOI=http://futureproof.records.nsw.gov.au/category/sha repoint-and-recordkeeping/ [6] AIIM. 2013. SharePoint 2013 – clouding the issues. DOI=http://www-01.ibm.com/common/ssi/cgi- bin/ssialias?infotype=SA&subtype=WH&htmlfid=ZZL Figure 3 03061USEN 9. CONCLUSION The end of a corporate roll out creates the risk of a loss of 12. ANNEX 1 organisational focus and momentum. In attempting to transform information governance using SharePoint 2010, NATS arguably NATS information, records and archives management policy took on a project that was too ambitious, with insufficient 1 Policy knowledge of the product, in too short a time frame. At this stage is has an enterprise content management system with an 1.1 Global leadership and innovation in air traffic solutions, architecture that needs further optimization. The priority over the airport performance, and air safety requires the effective next two years should be to support business areas to improve the management of NATS information, records and quality and consistency of document libraries and where archives to the highest international standards. This appropriate support the transition from documents to data through must be underpinned by an appropriate infrastructure of the innovative use of lists and external data connections. organisational commitments, consisting of a chain of managerial accountability, and sufficient resources and LIMs and IPOCs are already talking about SharePoint 2013 and expertise to implement a framework of controls and Office 365, and Information Solutions will be creating a procedures. SharePoint innovation team to help deliver tangible benefits to the business through targeted developments. The implementation project was just a small step on a long SharePoint journey for 2 Scope NATS. SharePoint has the potential to give further business 2.1 This policy applies to the management of all NATS benefit if the ideas and energy of the LIMs and IPOCs can be information assets, records and archives created or received by any NATS employee or contractor in the

Page 47

course of NATS’ business or held by third parties (e.g. protective markings, security levels and access controls to regulators, customers, partners, suppliers), including: information. a) Physical and electronic documents, intranet and internet Disposition: the range of processes associated with implementing web pages, emails, blogs, instant messages and SMS records retention, destruction or transfer decisions which are text messages. documented in retention schedules. b) Data within structured data systems (such as SAP, the Retention schedules: detail the mandatory retention period for Business Intelligence (BI) Data Warehouse, databases records and the appropriate disposition action at the end of each and operational systems) since issues relating to records period. They demonstrate conformance to legislation, statute, retention, categorisation and security will apply equally regulation, common or local procedure, contract, specification or to such data sources and equivalent controls must be put as required for the operation or maintenance of equipment or in place. services. Information: an asset recognised by its capacity or potential to 3 Objectives provide (directly or indirectly) data or any knowledge, regardless of format or medium. Examples of information would include: 3.1 The implementation of this policy has the following documents, photographs, maps, diagrams, electronic files, objectives; databases, audio-visual data files, email, voice and facsimile a) To empower NATS employees and support responsible, transmissions and recordings. intelligent and effective decision-making based on Non-record: information that is not captured in NATS records timely and accurate information. management systems, including non-business documents e.g. b) To enable the efficient delivery of NATS services and spam, junk mail, copies or extracts of documents distributed for functions, facilitating organisational transformation. convenience or reference, ephemera that does not set policy, c) To make decisions and performance transparent to establish guidelines, procedures, certify a transaction or become a stakeholders through effective records management and receipt, and personal messages. Information with limited or no improved communications. retention value that is created or received, such as draft, duplicate or routine information that does not add value to the overall d) To reduce administrative costs by providing timely business activity. access to full and accurate records and streamline the handling of information, reducing time lost in retrieval Record: information created, received or maintained by NATS as and the duplication of work. evidence and an audit trail of business activity or transactions. Records include policy statements, standards and implementation e) To minimise accommodation and information storage plans, directives, decisions and approvals for a course of action, costs by ensuring the timely and secure disposal of documents that initiate, authorise, change or complete business expired physical and electronic records. transactions, briefing papers, reports and background papers, f) To provide organised and reliable records for the agendas and minutes of meetings, correspondence, key project management of property, assets, finance, human documentation and case files. A sub-set of records may be resources, organisational performance, safety and risk. selected as archives for permanent preservation. Not all physical and electronic information created, received or maintained by g) To facilitate collaboration across NATS and partner NATS are records (see Non-record definition above). organisations through the appropriate sharing of information. h) To maintain and exploit NATS’ institutional memory 5 Policy statements and knowledge over time, helping to establish lessons NATS makes the following organisational commitments: learned, maintaining competitive advantage and assisting research and innovation. i) To comply with legal requirements for maintaining 5.1 Information ownership privacy, security, confidentiality, authenticity and 5.1.1 All information will be assigned a unique Owner who is integrity of information and support the NATS responsible for its management unless they formally Information Security Policy. transfer that ownership to someone else (for example j) To protect NATS and its employees against litigation. when changing roles). k) To safeguard NATS’ vital records and support business 5.1.2 Although the information Owner is responsible for the continuity. management of all the information that they own, all information created or received as part of any l) To ensure the long-term preservation of NATS’ archival individual’s role at NATS remains the intellectual records to facilitate future business activity, research, property of NATS as a whole and not to any individual, reuse and public access. group or business area. 5.1.3 NATS will ensure that all employees, contractors, 4 Definitions suppliers and partners are aware of their information management responsibilities when they assume or Categorisation: the logical grouping of records that assists the change their roles and provide relevant training management, retrieval and disposal of information. Categorisation opportunities. In managing information, Owners shall is not to be confused with security classification, which assigns comply with all relevant statutory, regulatory and

Page 48

security requirements – including not to destroy business activities or changes in legislation and information where there is a legal obligation or business regulation. need to retain it. 5.3.3 Records will be given a meaningful and consistent title 5.1.4 Information ownership shall be addressed as part of according to file naming conventions, and organised organisational changes to ensure that information and categorised coherently and consistently in (including records and archives) retains correct accordance with corporate guidance issued by ownership. Upon termination of contracts with NATS, Information Management. all manual and electronic records in the custody of an 5.3.4 Appropriate intellectual and technical information employee or contractor role must be transferred relating to each record (e.g. Owner and Title metadata) appropriately and any information management will be collected in accordance with the corporate responsibilities (e.g. to update the NATS website or guidance issued by Information Management. intranet) must be immediately re-assigned. 5.3.5 One version of the truth shall be maintained. 5.2 Information assets recognised as a valuable corporate Copies/duplicates of documents and other information resource should not ordinarily be stored. The master version of 5.2.1 Information shall be created, used, shared, stored and an information asset should be stored under the disposed of in accordance with licence terms, law, supervision of its owner in a shared location; all other statute, regulation, business requirements, generally users of this information should refer to the master accepted practice and relevant contracts or agreements. version. 5.2.2 Electronic and manual information assets will be stored 5.3.6 Unnecessary duplication should be avoided by in an efficient and effective manner throughout their integrating different manual and electronic information lifecycle, from creation through to disposition. systems into a single records management system that is 5.2.3 NATS will create full and accurate records of its pragmatic and fit for purpose. business activity in line with operational and legal 5.3.7 Employees responsible for updating the NATS website requirements, supporting business processes. Records or intranet will do so in a timely fashion, and ensure that must be authentic, reliable, accessible, complete, the content is appropriate, current, accurate and comprehensive, compliant and secure. compliant with corporate guidance issued by Corporate 5.2.4 NATS will design and implement records management Communications. systems reflecting NATS business policy and 5.4 Secure management and sharing of information requirements and in general accordance with assets international records management standards and best 5.4.1 NATS is one organisation and internal access to practice. information will only be limited if required by the arms- 5.2.5 Adequate search aids and tools shall be maintained to length separation of activities, sensitivity as defined by enable effective and efficient retrieval (or disposal) of the Protective Marking Scheme or a specific business information assets. requirement established by the Owner. 5.2.6 The costs of managing information to comply with this 5.4.2 In accordance with the NATS Security Policy and policy should be commensurate with the value of the related NATS Protective Marking Standard, all NATS information being managed. Consideration should also employees and authorised contractors will: be given to the size of risk (for example in terms of • Ensure that all the information they own is potential legal liability) associated with an information assigned the correct level of protective marking, asset. security and access controls to prevent 5.2.7 Consideration should be given to any unique unauthorised disclosure of NATS information Intellectual Property Rights (IPR) belonging to NATS • Ensure that protectively-marked material is secure contained in any information asset. If this is the case, when circulated, stored and disposed of the information should also be treated in accordance with any relevant IPR policies. • Ensure that only authorised employees within their business area use NATS information systems

• Ensure that information is not disclosed to 5.3 Quality, reliability and timeliness of information unauthorised individuals 5.3.1 The quality of information held in NATS systems will • Promptly report any breach or potential breach of be supported by documenting protocols for data security e.g. lost keys, misplaced papers or collection, entry, verification and maintenance - noting electronic media to their Line Manager and to any special arrangements for protection and retrieval. Corporate Security 5.3.2 Regular reviews of information assets should be held to 5.4.3 Information Owners, in liaison with Business Process reduce costs associated with information management, Owners, will identify records that are essential for e.g. rework; redundancy; over-processing; over- business continuity and ensure that back-up procedures production. Such reviews should be considered when are in place to promptly restore these vital records in the there are significant changes that affect the nature of event of a disaster as defined in NATS resilience plans. NATS’ business or operations, e.g. reorganisations, new

Page 49

5.5 Retention and disposition of information assets standards and retention rules, adopting corporate 5.5.1 The retention and disposition of NATS manual and information management standards where applicable. electronic records will be managed according to the • Maintain a network of Information Points of Contact to NATS Records Retention Schedules issued by champion and implement information management Information Management. standards. 5.5.2 The maximum period for the retention of non-records is • In conjunction with their HUB Publishing Forum two years. representatives, ensure that their HUB content remains 5.5.3 NATS’ General Counsel shall be able to suspend relevant and timely. information destruction and disposal should it be • In conjunction with their EDRM Forum representatives, necessary, e.g. in the case of an internal or external ensure that Team HUBs are managed in line with investigation. corporate standards and best practice. 6 NATS roles & Responsibilities • Ensure that information management improvement NATS General Counsel: activities are considered during business and budget planning. • Sponsors this Information, Records and Archives Management Policy (in partnership with the Chief • Ensure information management requirements are Information Officer). included in the contractual terms and conditions for new support or service contracts. • Approves record retention periods and maintain the overall list retention schedules for NATS records. • Ensure the effectiveness of the Information Management strategy to enable new ways of working is • Applies ‘legal hold’ to suspend the disposal of records regularly assessed and tracked against the objectives and and information if required. desired benefits within the business area.

Business Process Owners (BPOs): Information Points of Contact (IPOCs) • Ensure that information created, used or referred to by • Serve as a link between business area teams and the their process meets the policy commitments set out Local Information Manager network. above by maintaining effective systems of information management control. • Support and assist employees within their business area with a wide range of information and records management activities e.g. capture and organisation of Business Support Managers (BPOs/BSMs/BMs): information and records; promotion of email best practice and disciplined use of shared electronic folders; • Ensure that information managed within their business liaison with Content Editors to manage intranet and area meets the policy commitments set out above by internet content, where appropriate; preservation of maintaining effective systems of information information security in the office; and coordination of management control. the team’s use of NATS information services. • Nominate and support one or more employees to ensure

the tasks of the Local Information Manager and Information Points of Contact are performed. Information management team (within the Office of the Chief Information Officer): • Provide appropriate representation for their business area on the Information Management Steering Group, • Implements the necessary standards, competency Electronic Document & Records Management (EDRM) frameworks, procedures and systems to support this Forum and Hub Publishing Forum. Policy. • Provides professional advisory services across NATS on all information, records and archives management Local Information Managers (LIMs), on behalf of issues, promoting and co-ordinating best practices. their business area: • Facilitates training for Local Information Managers and • Act as a strategic focal point for information, records other stakeholders to implement NATS information, and archives management and develop expertise and records and archives policy commitments within their experience in this area via briefing and training. business area. • Participate in the Information Management Steering • Designs, implements and regularly reviews policies, Group and assist in the development and records retention schedules and corporate guidance on implementation of NATS-wide policies. managing information. • Ensure that appropriate local governance structures are • Monitors, designs and controls workflows to ensure in place and that all employees are aware of their efficient management, archiving and preservation of information management responsibilities and NATS records. performance objectives. • Chairs the Information Management Steering Group, • Co-ordinate information management activities EDRM and Hub Publishing Forums. including development and review of local policies,

Page 50

• Provide guidance and advice to management and Information Management Steering Group (IMSG): employees in the area of information security. • Comprises senior representatives from each functional • Assess the risks to NATS information and information area. resources and identify appropriate risk management approaches. • Sets the strategic direction for information, records and archives management. • Set information security standards and provide information security training and support to NATS. • Oversees and approves this Information, Records and Archives Management Policy, and its supporting • Coordinate the monitoring and periodic audit of processes and systems. compliance with information security policy, standards and procedures. • Identifies, prioritises, co-ordinates and manages information management projects, and make recommendations to relevant business, information and Business Intelligence Competency Centre (BICC) team: technology investment boards. • Provides guidance and advice to management and • Manage the risk relating to NATS information assets. employees in the area of business intelligence. • Report to the NATS Senior Management on a regular • Provides an appropriate and resilient business basis. intelligence (BI) platform, including storage capacity, • Identifies training needs, defines training, and reviews service delivery infrastructure and BI applications to delivery. support the NATS’s information management and business intelligence requirements.

• Provides appropriate technical support for all Electronic Document & Records Management (EDRM) components of the BI service. Forum:

• Comprises Local Information Managers and/or Information Points of Contact. Corporate Communications team: • Discusses and make recommendations to the IMSG • Disseminates information fit for publication to the concerning management of electronic documents and public. records. • Provides news media with information needed to ensure • Promotes the delivery of records management training transparent, fair, complete and accurate reporting on and advice to employees as appropriate. NATS. • Consults with and communicates to their business areas • Establishes a communications structure to inform on matters relating to records management. employees about NATS developments.

Hub Publishing Forum: • Comprises Hub Contributors writing for the intranet. Facilities Management • Defines, develops and owns appropriate standards and • Manages the contract and services for offsite storage of processes for the management and exploitation of the physical records. intranet HUB Publishing tool. • Identifies requirements and opportunities to continually All employees and contractors: improve the intranet, and enable the sharing of best • Keep accurate and complete records of their business practice. activities, including all relevant correspondence and emails. Information Solutions team: • Manage these in accordance with this Policy and any • Provides an appropriate and resilient information other relevant processes or business area arrangements. technology platform, including storage capacity and records management applications to support the 7 Legislative framework NATS’s information management requirements. 7.1 This policy supports compliance with all NATS’ • Ensures integration of relevant IT infrastructure into the regulatory obligations in accordance with our licence NATS Security Policy and resilience plans. and legal context: • Provides appropriate technical support for all • Companies Act 2006 components of IT systems including capacity, application licenses, upgrades and updates. • Data Protection Act 1998 • International Standard for Records Management BS ISO 15489 Security /Cyber Security teams:

Page 51

• International Standard for Quality Management Systems business. They migrated 40% of content from their BS EN ISO 9001 previous document system and deleted the rest. • Limitation Act 1980 C. One cultural organization is rolling out SharePoint Note that this is not an exhaustive list. (with the Automated Intelligence plug-in) in a targeted approach, to areas they think will benefit from SharePoint. They have rolled out to 100 users in one 7.2 Particular attention is drawn to the Data Protection Act year and are defining specific content types/workflows 1998 (DPA) which makes provision for the regulation for teams as needed as they proceed. They had a of the processing of information relating to living previous SharePoint 2007 implementation but did not individuals. The aim of the Data Protection Act is to migrate from it. provide a balance between the rights of individuals and D. A provider of care services rolled out to 4,000 people in legitimate data processing operations. Information shall two years with a simplified roll out that involved the contain personal details (as defined by the DPA) only definition of only one generic corporate content type, where it is strictly necessary. Information containing and the provision of libraries for each piece of work. No personal information as defined by the Data Protection folders were allowed. Act shall not be shared with external organisations without prior agreement from the Data Protection E. A government organization plans to implement to Officer. See Chapter 6 of the Corporate Security 14,000 people in 18 months (with the Automated Manual for more details. Intelligence plug-in). They have a legacy electronic records management system and plan to migrate all its 7.3 NATS is not directly subject to the Freedom of content into the SharePoint Records Centre. Information Act (FOI) or Environmental Information Regulations requests but other organisations with which NATS deals might be.

13. ANNEX 2 Five other organisations were consulted during the lessons learned exercise. A. A public transport organization had an unsatisfactory SharePoint 2007 implementation. They created a site for every single department and invited them to use it during an overnight roll out to 29,000 people. The result was a huge number of unused, underused or abandoned sites. B. A government body rolled out to 800 staff in 9 months, using only included document libraries and no collaborative features. They set up around 150 content types and disallowed folders. Each area of the business had one document library and was told to select as many

of the content types as were relevant for use in the

Page 52 From Casanova to MoReq2010: Ages of Records Bogdan-Florin Popovici National Archives of Romania Bd. Regina Elisabeta nr. 49 Bucureşti, sector 5 +4021 315 25 03 [email protected]

ABSTRACT 2. RECORDS AND ARCHIVES. CYCLES It is a common place to talk about lifecycle of records or about ages AND AGES of records. This is so common for all records management and Based on various experience and reasons, in continental Europe archival practitioners that the sources and the rationale for such emerged a theory of “ages of archives”. That is an archive (seen as classifications are often forgotten. A review of origins for such the whole of records belonging to one creating agency, not only approaches could be relevant today, not as a historical endeavor as historically records) is born, grows and dies. A very good such, but to understand the “why” and “for what purpose”. Starting description of this theory can be found in the famous Italian from Casanova or Schellenberg, divisions in the life of records were archivist Casanova’s book (1928). Casanova identifies “current”, made in order to facilitate their management, to reveal their “repository” and “general” archives2. Archives, he claims, are, at ownership and to delineate organizational roles and the beginning, often use so they are kept in the office; this would responsibilities. be the current phase. After a period, they are moved into a Therefore, the answers are important today because the technology repository, close enough for a possible use, but distant enough not of creating and keeping records is changed and the environment to impede the current activity. And after another number of years, that led once to “cycles” and “ages” might have changed as well. records are moved into general archives. The French archivists In digital environment, it might be harder to differentiate between Yves Pérotin, wrote an article in a magazine about “les trois ages active and semi-active records; but is it still necessary to des Archives”(1962)3. He identified the “current archives”, differentiate? And more, can we talk about digital records solely or “repository archives” and “archived archives” and it might be the rule is rather hybrid archives? considered as the birth of the French archival classification in “archives courantes, intermédiaires et définitives”4. This paper seeks to go through some conceptual facets of records and archives management, revealing some impacts on specialized The above system is largely used in Europe, with the notable software. exception of Germany. In this country, traditionally, records may have two “ages”5. Categories and Subject Descriptors Based on German model6, also on Anglo-Saxons tradition of Management; Documentation managing records and on U.S. NARS efforts in 1940s7, Th. General Terms Schellenberg made a flat separation in the ages of archives. In his approach, records are: Management, Documentation “All books, papers, maps, photographs, or other Keywords documentary materials, regardless of physical form or records management; archives management; recordkeeping; e- characteristics, made or received by any public or records private institution in pursuance of its legal obligations or in connection with the transaction of its proper business 1. INTRODUCTION and preserved or appropriate for preservation by that In the history of recordkeeping, the records managers or archivists institution or its legitimate successor as evidence of its often relied upon technical professions for the activities of records functions, policies, decisions, procedures, operations, or preservation. In order to make sure that a vellum or a paper acting other activities or because of the informational value of as a carrier, were still able to preserve the information recorded, the data contained therein”. they called the conservators or restaurateurs with their technical while archives are: knowledge in the field. Today, for digital records, situation would be alike, unless the “technical staff” would not claim the “Those records of any public or private institution recordkeeping positions too. Not seldom, some IT specialists today which are adjudged worthy of permanent preservation assert that there is no need for other information professionals, for reference and research purposes and which have except for them, that librarians and archivists or records managers been deposited or have been selected for deposit in an are about to disappear as professionals as long as THE tool will archival institution.” “replace” them all1. Based on this separation he made, Schellenberg went deeper, identifying within “records age” stages of the records life-cycle: Far from aligning to such assumptions, the present paper seeks to analysis if the classical recordkeeping theory of the life of records “Records management is thus concerned with the whole is still valid and if the recordkeeping responsibilities associated will life span of most records. It strives to limit their creation, have any future at all. and for this reason, one finds "birth control” advocates in the record management field as well as in the field of human genetics. It exercises a partial control over their current use. And it assists in determining which of them

Page 53 should be consigned to the “hell” of the incinerator or change of preservation space for records, a change of responsibility, the “heaven” of an archival institution, or, if perchance, a change of finding aids, in many situations. In the first age or stage they should first be held for a time in the “purgatory” or of life, records are created in the registries or in offices. The “limbo” of a record center.”8 responsibility for maintaining records is the task of workers, and Since Schellenberg presentation, the life-cycle theory developed records are housed in the offices; varying from country to country, and evolved. A synthetic presentation of what life-cycle model one may found a record-level registration. In the semi-active stage means today would be the following: (and intermediate archives), the records managers is in charge with the folders received from creating offices. Often, the “finding aids” “In stage one, the record is created, presumably for a is at series or at folder level, and they are preserved outside of legitimate reason and according to certain standards. In regular “production space”, in a central repository of the the second stage, the record goes through an active organisation. In the inactive stage, records became inactive and period when it has maximum primary value and is used records managers and/or archivists may start appraise “stuff”. or referred to frequently by the creating office and others While cycle is closed, the age theory foes further, for defining the involved in decision making. During this time the record “definitive archives” that falls under archivists’ jurisdiction and in is stored on-site in the active or current files of the the archival repositories. creating office. At the end of stage two the record may be reviewed and determined to have no further value, at Not on the contrary, but from a different perspective, the records which point it is destroyed, or the record can enter stage continuum pays least attention to preservation, emphasizing three, where it is relegated to a semi-active status, which creation, aggregation and use of information, stressing the need for means it still has value, but is not needed for day-to-day preparing the future management of records (in different respects) decision making. Because the record need not be by harvesting meta-information even from the first phase. E. consulted regularly, it is often stored in a off-site storage Shepherd and G. Yeo notices that the new interpretation of records center. At the end of stage three, another review occurs, life seen as a continuum is the abstraction of it, arguing that at which point a determination is made to destroy or send “[s]pecific practices will vary from one working context to another, the record to stage four, which is reserved for inactive but models based on the lifecycle concept or the entity life history records with long-term, indefinite, archival value. This can help to identify stages and actions within a records management programme, and thus provide a useful framework for small percentage of records (normally estimated at 17 approximately five per cent of the total documentation) is planning and implementation.” sent to an archival repository, where specific activities are undertaken to preserve and describe the records”9. Instead of “life-cycles”, some archivists pleaded for “continuum”. 3. DIGITAL RECORDS: NEW One already classical article announcing a new theory came from APPROACHES OR ONLY ADJUSTED Canada, from Jay Atherton10. A more articulated and profound elaboration of the continuum theory came from Australia, and ONES? mainly from research team in Monash University. Beyond But how these model apply to digital world of records today? Is awarding to Atherton the first use of term and idea of unification, there any relevancy for a life cycle or we only have large systems the Australians’ theory of continuum had such a great impact, that that deals with everything? Can we speak indeed of the dead of many tends to grant them the privilege of foundation for this records cycles and responsibilities for manager/archivist, melting approach11. The main features of the theory was presented in two everything under information governance umbrella? articles of Frank Upward12, in 1996-1997, but 1—the ferment of In my opinion, it is obvious that the classic records and archival this theory came from Australian archival practice13 and 2—the manager bearings are (if not already) to change. Despite a hybrid articulation of theory was the result of the efforts of Records environment today that might still accommodate old fashion Continuum Research Group at Monash University14. professionals, the new technology for creating and managing As it is known today, records continuum is defined as “[T]he whole records triggered a change in competencies and responsibilities of extent of a record's existence; refers to a consistent and coherent recordkeeping professionals. The records manager today does not regime of management processes from the time of creation of care to move pillars of folders from offices to repository; all of them records (and before creation, in the design of recordkeeping are in the IT system, and one can hardly say the difference systems), through to the preservation and use of records as active/semi-active unless the system notified. That is, the role of archives."15 the records manager shifted from acting in the course of the lifecycle of records to delivering the input at the beginning (in designing/setting requirements for the system) and at the end of the The “ages” and “cycles” approaches were formulated in full paper lifecycle. And this apparently makes the lifecycle itself era, where records were mainly paper based and alternative media meaningless. (although also analogical) were not so relevant. The continuum I said apparently because this linear course of action can be model was developed later, in the dawn of digital area. This undermined by the lifecycle of technology. As MoReq2010 chronology might led to the assumption that is an issue of outlines and specifies, in digital environment there is a lifecycle of modernity and actuality over the two. In fact, a more deeply technology, and one may wonder if this is not the new cycle of analysis proves that previous assumption is not exact. In a study of records too. MoReq2010 is the only specification strongly 16 InterPARES , these two main approaches were named as: Chain emphasizing the need for exportation of records out of a system, in of Preservation Model (for the “ages” approach) and Business- order to avoid the technological obsolescence. I believe that we driven Recordkeeping Model (for continuum). That is, in a broader shifted from records-based cycle to technology-based cycles and framework, the first model focuses on preservation and marks we should map the responsibilities of paper record managers to the distinctly the phases and (implicitly) responsibilities built in these one’s belonging to new… e-records manager. phases. Resulting from practical experiences, each stage marked a

Page 54 One cycle is over, but… will all the information be migrated? Yes, A final issue I would like to address is the need for a mixture the regulations might say so, the standards might say so, but what between different types of information systems. I would like to the budgets will say? Shall the organizations pay for transferring of stress that I do not believe that system that can combine all non-operative information only because the best practices or even recordkeeping functionalities at a reasonable price can exist. Or, at the need for history will say so? Doubtfully, considering what they least, not now. Or, at least, not affordable for everybody. Therefore, did with paper records. This is why the paper based records were we are still sentenced to have production system (ECM, EDMRS, often associated with dark, dusted cellars, because of the “great” CRM etc.) and preservation systems (like OAIS compliant concern and care of the organizations for their records… As we products). Also, as the myth of paperless office is waiting yet to already know, since the paper records are more resilient that digital come true, the rule seems more and more the existence of a hybrid records in case of abandonment, I consider this new cycle is one management systems, for paper and e-records. Since the new cycles that records manager should observe carefully, in order to attempt of life determined by technology status is somewhere between 5- at least a documented set-aside of those records, if their fully 10 years, and the retention periods of records may go over this transfer into a new system will not be possible. A fully managed period, it is clear that a long time preservation systems should have life of e-records would be magnificent, but so it would have been hybrid records management functionalities, since not all the records with paper records; unfortunately, there was rarely so. Very nice in such a system should be kept “forever” and such a system should standards get out of the laboratory or records/information thinkers, deal with retention periods and dispositions in due time. As far as I but I think we should wait for real-life situations, for instance where studied the different system on the market, the producers are only disposition of out-of-system e-records would be attempted, to see concern about offering a “permanent” life for digital assets, even if then how this can be really made. How records from 20 years ago the would only need, let’s say, 50 years… will be appraised, at file/folder/class level since they are not migrated to new platforms?18 Above all, how it will be managed? 4. CONCLUSION Another strong issue in these years is the advent of long digital As a conclusion, the models for life of records and responsibilities 19 preservation. Following the famous Jeff Rothenberg’s saying , of recordkeeping professionals from the paper world are not such systems seems more and more necessary not only for necessary obsolete ones, but they (will) suffer a change in meaning, historical archives, but also for what was considered as semi-active being more and more substantiate as a technological phase in the or inactive records. In such circumstances, some argued that the existence of records and not a simple management one. Due to archivists missions is over, since the creators themselves assure the limited technological cycles, some of the task belonging to long term preservation. Why should we need archives since, mostly archivists-of-the-past (in term of long time preservation) might rely in a post-custodial paradigm, institutions and companies keeps and on records/information-managers-of-the-present. On the other maintain their records? time, the tasks of archivists/information-manager-of-the-past might A first remark on this would say that this approach is not new. keep being the collection of various sources of information asset, Historically speaking, even such technological shift did not as today, adding new metadata, documenting provenance. probably exist, it was a shift of knowledge or a shift of mandate. Therefore, I truly believe that, with the mandatory skills on IT, the When chancelleries moved from Latin language to national ones, specialists in recordkeeping will have a job to do in the future. in one or two generations all the past records became a bunch of useless papers, completely not understandable for regular people. Those records were not an information asset anymore, as long as 5. REFERENCES they were intellectually inaccessible. Also, when modern 1 For me, a relevant expression would be even the present institutions appeared, it was a look towards the future, not to the publication guidelines that seemed to be dedicated to technical past; the administrative rules were changed, so the information in papers only (see, for instance, the index based on Computing old records were also useless, from organizational points of view. Classification Index). In most cases, this is how National Archives were founded, as a 2 E. Casanova, Archivistica, Siena, 2nd ed., 1928, pp. 21-23. state repository for the production-useless papers that might, 3 Yves Pérotin, L’administration et les trois âges des archives, in however, be still interesting from a cultural point of view. How can « Seine et Paris », no. 20, octobre 1961 (excerpt), p. 4. be an assurance that such a scenario is not repeatable? How can 4 See, in this respect, Manuel d’archivistique, Paris, 1970, p. 122. somebody know that a public organization will pay for ever to 5 Luciana Duranti, Archives as a Place, in “Archives and maintain on long term its “historically” records/information, and Manuscripts”, vol. 24, No. 2, p. 249. The same in Hungary, see A not decide at a given moment to move them elsewhere, if society szocialista országok jelenkori levéltári terminologiajának szotára, will still consider them relevant from a cultural point of view? In , 1988. my opinion, nor just it might happen, but this is the most realistic 6 Luciana Duranti, Archives…, p. 249. scenario. And then the archivists (or their equivalent) will face 7 Sue McKemmish, Yesterday, Today and Tomorrow: A Continuum exactly the same challenges as today: not one schema for the of Responsibility, first published in Proceedings of the Records arrangement or description of records, but hundreds; not one Management Association of Australia 14th National Convention, process documented, but thousands; not several type of records, but 15-17 Sept 1997, RMAA Perth 1997, 1997. Online at: millions. It would be very nice to hear about some experiences of http://www.infotech.monash.edu.au/research/groups moving many huge database to National Archives, with different /rcrg/publications/ recordscontinuum-smckp2.html structures, different data types and different triggers and behaviors. 8 All quotes were from Th. Schellenbeg, Modern Archives, How this will be dealt with? And in this case, of course the original Chicago, 1956, p. 37-38. metadata would not be enough, of course the archivists will need to 9 Philip C. Bantin, Strategies for managing electronic records: a document more of their actions, of the context, on the provenance new Archival paradigm? An affirmation of our archival traditions?, etc. Maybe the actual archivists will be overrun by the task, but p. 3 at http://www.indiana.edu/~libarch/ER/macpaper12.pdf. This have no doubt that it will be a need for e-archivists in the future. general presentation may fit in most of the cases, but the way stages and other accompanying details are arranged may differ. See Ira

Page 55 Penn, Gail Pennix, Jim Coulson, Records Management Handbook, 13 .McKemmish, loc.cit. Cambridge, 1994, p. 13; E. Shepherd, G. Yeo, Managing records. 14 http://john.curtin.edu.au/society/australia/ A handbook of principles, Facet Publishing, 2003, p.8; James B. 15 Original Australian definition in AS 4390:1996, Part 1, 4.22. Rhoads, The Role of Archives and Records Management in 16 Luciana Duranti and Randy Preston (eds.), International National Information Systems: A RAMP Study, Paris, 1983, p. 2; Research on Permanent Authentic Records in Electronic Systems Jay Atherton, From Life Cycle to Continuum: Some Thoughts on (InterPARES) 2: Experiential, Interactive and Dynamic Records, the Records Management-Archives Relationship, in “Archivaria” Padova, 2008. 21 (Winter 1985-86), p. 44. 17 Shepherd, Yeo, op.cit., p. 10. 10 Atherton, op.cit., p. 48. 18 This is in fact the main vulnerability of many standards and 11 See, for instance, Shepherd, Yeo, op.cit., p.9. regulation in the area. No matter how important an information can 12 Frank Upward, Structuring the Records Continuum - Part One: be as an asset, information a perishable product; hence, its cost of Postcustodial principles and properties, in “Archives and maintain and managing should be justified. Over-regulation in the Manuscripts”, 24 (2) 1996 (online at field of RM (which, whatever is said, is a supporting process) http://www.infotech.monash.edu.au/research/groups/rcrg/ would only lead to a risk-based examination and, I am afraid, to an publications/recordscontinuum-fupp1.html); idem, Structuring the almost full rejection of best-practices if they are too costly. And Records Continuum, Part Two: Structuration Theory and technologies for managing e-records over time are (yet) really Recordkeeping, in “Archives and Manuscripts”, 25 (1) 1997 (online costly… at http://www.infotech. 19 “Digital objects last forever—or five years, whichever comes monash.edu.au/research/groups/rcrg/publications/recordscontinuu first”. It was mathematically and funny represented as min (∞, 5). m-fupp2.html

Page 56 Introducing MoReq, 4th Edition, and what comes next Jon Garde Senior Product Manager at RSD The Hollies Breadcroft Lane Maidenhead SL6 3NU United Kingdom [email protected]

ABSTRACT This presentation is the announcement of the latest version of MoReq, the fourth edition of the specification since it was first published in 2001 and the successor to MoReq2010. The specification itself will be released by the DLM Forum to coincide with the Triennial Conference in Lisbon. The presentation will be accompanied by an introduction to the main themes of MoReq, an overview of the new modules, an explanation of the layout of the specification, as well as a demonstration of MoReq in use. The demonstration will include a new web services based interface to MoReq Compliant Records Systems. Keywords MoReq, Specification, Standard, ISO 15489, Electronic Records, Records Management

(Final text not received in time)

Page 57 Search, Discovery and Harmonization of Diverse Digital Contents Mikko Lampi Aki Lassila Timo Honkela Mikkeli University of Applied Sciences Disec Ltd. University of Helsinki, Department of Patteristonkatu 3 D Sammonkatu 12 Modern Languages 50100 Mikkeli 50101 Mikkeli National Library of Finland, Centre for +358504364161 +358400869955 Digitization and Preservation [email protected] [email protected] Nykykielten laitos, Kieliteknologia PL 24 00014 HELSINGIN YLIOPISTO +358504480953 [email protected]

ABSTRACT can be used in metadata construction and harmonization. Such This paper provides an overview of search, discovery and techniques provide support for interoperability in cases when a harmonization of diverse digital contents. Each concept is full harmonization of conceptual content is difficult or impossible described in detail and illustrated with use cases and examples. due to theoretical or practical reasons. The theory of conceptual The requirements and drivers are studied in order to make the spaces (Gärdenfors 2014) provides a rigorous framework for the harmonization possible. The process and technologies for description of data in cases where logical formalization may fall harmonization are also discussed. Information extraction and short. Furthermore, harmonization can also be achieved by indexing are presented as the foundation for these concepts. utilizing ontologies, vocabularies and other linked data, Emphasis is also put on search and access with strong use case transformations and mappings. examples. Analysis and advanced discovery are reviewed from a As is generally known, information can be indexed and further scientific point of view in contrast to some use cases. Three use analyzed. Language and entity identification supports natural cases are used throughout the paper: Open Source Archive, language processing and understanding about language specific Capture and Finna projects. The paper represents views from each properties. Finnish language processing is used as a use case in project in combination of pragmatic experience and developments this paper. Indexing provides fast access and additional ways, and scholarly approach. such as faceting and geospatial analysis to discover, access and visualize the information. Harmonized metadata can be linked and Categories and Subject Descriptors exposed as public or private linked data. The usage of native H.3 [Information Storage and Retrieval]: H.3.1 Content linked data technologies enables the efficient exploitation of analysis and Indexing, H.3.3 Information Search and Retrieval, information and open data (Bizer et al. 2009). H.3.4 Systems and Software, H.3.5 Online Information Services, Searching has become more than just using a web search engine H.3.7 Digital Libraries like Google or Bing. Searching is now associated with discovery platforms with full-text and natural language search possibilities General Terms which also include features such as visualizations, facets and Management, Performance, Experimentation, Standardization, mashups. In addition, usability and user experience are very Languages, Theory. important factors in search and access. The platforms should support complete machine-readability and data interoperability. The trustworthiness of data sources is another important aspect. Keywords Information, Metadata, Digital Content, Extraction, Indexing, As our use cases in order to demonstrate the concepts and Harmonization, Search, Discovery, Access, Information Analysis, technologies, this paper provides three projects: Open Source Digital Archive, Digital Repository Archive, Capture and Finna. Open Source Archive (OSA) is a project executed by Mikkeli University of Applied Sciences (MAMK) and funded by European Region Development Fund. 1. BACKGROUND The project started in June 2012 and ends in December 2014. The Information governance in distributed environment is challenging primary objective of OSA project is to find and develop open due to the complexity and diversity of contents, data sources and source tools and solutions for digital archives. Its key features standards (W3C 2009). However, harmonization enables include archival materials and lifecycle management, long-term discovery, access and analysis for the information in digital preservation, ingest, search and access. OSA software is based on information systems. Therefore, the availability of harmonized well-known open source software. Later in this paper, OSA is data promotes the discovery of useful information and relations used to refer to the digital archive software unless stated within the data that otherwise might remain undetected. In otherwise. The OSA project is based on Capture, which was a addition, knowledge discovery can also take place based on data modeling and digital archive definition project by Central unstructured data. Unsupervised data and text mining techniques Archives for Finnish Business Records (ELKA) and MAMK. It can be used to automatically find key phrases and taxonomies that was executed during 2011 - 2012. The primary deliverables from

Page 58 Capture were a concept of a harmonized metadata model and the then return objects from the data store based. While most of the specifications for a modern and flexible digital archive system. operations could be completed without indexes they would often be very inefficient. The performance difference is even more The third case, Finna (www.finna.fi) was started in 2012 and it is drastic if the data is read from a file system or disks instead from part of the Finnish National Digital Library program which aims memory. to “ensure that electronic materials of Finnish culture and science are managed with a high standard, are easily accessible and Furthermore, full-text indexing is very useful for unstructured securely preserved well into the future.” (National Digital Library, digital contents. It enables full-text search which users are used to http://kdk.fi/en) Briefly, Finna is an online discovery service for when using search engines like Google. Other benefits for full- all Finnish materials by libraries, archives, or museums. The items text indexing include statistical information based on the indexed can be books, drawings, old advertisement brochures, scientific terms and their respective hit rates. Full-text search is discussed in articles, etc. Finna’s long-term objective is to provide information more detail in section four in this paper. from each and every Finnish memory organizations’ content in a One of the most used indexing solutions is Apache Solr, which meaningful way. Finna relies heavily on indexing, the was also used in OSA and Finna. In addition to indexing, Solr harmonization of metadata and other issues discussed further in provides search features and tools for simple analysis. It can be this paper. extended with various plugins such as information extraction with The paper is organized as follows. The second section is about Apache Tika. extracting and indexing information. Section three discusses Language processing is critical part of full-text indexing. It metadata harmonization and some practical examples of it. In provides the accurate and valid identification of terms and section four, search and access are reviewed via the use cases. entities. Some languages, such as Finnish, have inflected forms Section five reviews discovery and analysis. This paper is and thus require the basic forms of words to be determined. This concluded with discussion about the results and future research can be very problematic without vocabularies. There are also and development suggestions. other entities, such as proper nouns, which need to be detected and indexed correctly. In some cases specific entities need to be 2. EXTRACTING AND INDEXING removed to protect privacy or confidentiality. In the OSA project, DIGITAL CONTENTS Apache Solr was used in combination with Voikko library for accurate Finnish language indexing and queries. Due to the nature The first step in harmonizing digital contents is to extract the of the Finnish language, Voikko includes also extensive metadata and the file content in machine-readable form. vocabulary in addition to the grammatical rules. Voikko is an Extraction requires that each has a compatible parser. open source project and used in projects like LibreOffice. There are easily tens of formats for rich text documents, audio, Integration of Voikko and Solr was developed as open source by moving image, pictures and other available and valuable digital The National Library of Finland as part of the Finna project contents. Each format requires a parser library which can extract (http://www.kdk.fi/en/public-interface/software-development). its technical metadata, embedded descriptive metadata and the actual content. For archival usage, one must know the significant In addition, indexed terms can be linked to ontologies or information for the specific format in order to correctly preserve vocabularies for formal definitions and interoperability. For it. Different tools provide different technical outputs which need example, indexed Finnish place names could be linked with the to be mapped and processed before forwarding the information for national spatio-temporal ontology SAPO. The information would harmonization and indexing. After the initial extraction data is in be more usable in a geospatial information system than usable form but by no mean harmonized or normalized. unnormalized terms. A widely used extraction solution is Apache Tika. It can be used to extract information from documents and detect the language 3. METADATA HARMONIZATION automatically. Tika will identify the file and automatically select a Metadata harmonization is a process consisting of multiple steps, suitable parser if known. Automation can be achieved by both technical and non-technical. The main drivers for integrating Tika or other data extraction solutions with indexing harmonization are interoperability and feasibility (e.g. Nilsson engines. Tika will be implemented in the OSA project and is 2010). While lots of entities described are subjective to widely used in other archival software developed in MAMK. humanistic sciences, for instance, rather than technical, the information systems require structured and machine-readable data. Indexing is necessary for efficient access to huge amount of The results are new or better services for consumers and better textual data, such extracted contents of rich text documents. understanding about the materials. Usually index itself is a binary format data store. It does not replace or make obsolete the original data but supports its usage. Different fields and industries have specified their own metadata Indexing is required to enable feasible and efficient processing of standards to support their contents and activities. For example, time consuming tasks such as full-text search and certain analysis MARC21 (http://www.loc.gov/marc/) is widely used in libraries processes. Analysis and data mining are described in more detail and LIDO (http://lido-schema.org) in museums. Most of the later in this paper. standards have in common that they support well the specific The basic principle in indexing is similar across different metadata and objects but are not intended for managing technical solutions. Databases and other data stores can be information systems management or information exchange. LIDO, indexed for faster read operations and information retrieval. Write for example, covers all kinds of museum objects such as art, operations become slightly slower but the performance gain is architecture, cultural history, the history of technology, and multiple. This is because writes are usually done less often while natural history. LIDO enables creation of normalized records for reads are more or less continuous. Search engines use indexes to museum context. These records can be further enhanced by rapidly find relevant information based on the search terms and

Page 59 providing ontology linking. Semantic records can then be shared suitable transformation method such as XSLT. This way with other systems and environments. harmonization is a lossless and two-way process. Metadata interoperability is one of the primary drivers for The harmonization process should be automatic, which means the metadata harmonization. Interoperability requires that metadata data models and interfaces have to be machine-readable. It is records are machine-readable and compatible with each other. achieved by providing sufficient technical information for Dublin Core Metadata Initiative defines metadata interoperability processing the data models, metadata and contents. The data itself as the ability of two or more agents, such as information systems has to be structured or otherwise machine-readable. Finally there and software components, to exchange metadata so that the needs to be APIs for data operations. The APIs can be public or interpretation remains consistent with the original context and private and a public API can be used to deliver non-public information (Nilsson 2011). content. More about open data and open API concepts can be read elsewhere. Interoperability means that normalized records conform to metadata models which can then be mapped to ontologies, Finally, harmonization is not all about technology. A very vocabularies and other metadata models, which can be internal important factor is communication between all involved parties. models or metadata standards. Both Finna and OSA have adopted Understanding the context and meaning of the materials is mapping as the primary method for harmonizing metadata (see essential in preserving it unaltered during the process. For Finna 2014). The basic principle is to map various input formats example, in Finna project the harmonization work was from the into an internal umbrella model. Finna creates a machine-readable beginning a mutual task with users and involved organizations. index for materials originating from Finnish archives, libraries The understanding about the needs and usage of metadata have and museums. OSA harmonizes data first into a master data model grown during the project and it is a continuous process. which is then used to generate the index. Finna and OSA serve different purposes but the reasons for harmonization are the same. 4. SEARCH AND ACCESS They both need to ingest diverse contents and provide access and Search and access in this context is more than a textbox-based management in a coherent manner. It is not feasible to implement search engine like Google or Bing. It is a combination of a different user interfaces, application logic and user experience for discovery portal, browsing catalog, recommendation and curation each kind of data. engine and technical platform. Search is a method of finding During Capture project additional drivers for interoperability were interesting records and objects from possibly huge data sets but identified. Firstly, it was confirmed that there is a need for digital selected sources. archive and repository services, preferably hosted as SaaS. It was Traditional search engines provide some information based on built as part of OSA project. This approach had lots of different user entered search terms from various and unknown sources in files, metadata, standards and formats put into one system which some format depending on the source with very limited metadata. all the tenants share. It has a single core repository, Fedora The algorithms and indexes are good but all else is just counting Commons, which manages the content and the metadata. Fedora on luck. With digital archives, repositories and other kinds of can manage all the files and metadata formats as separate streams collections, one cannot afford Google like results: if it is not on but it can end up in the complexity creep and hard to manage the first page, it probably won’t be found; and if Google cannot environment. It is more efficient to harmonize as much as find it, it doesn’t exist at all. possible. (Lampi & Alm 2014). Now, let’s look at the differences on search and access features in An umbrella metadata model, known as Capture model, was OSA and Finna. Search in both systems provides a highly designed to tackle the harmonization challenges. The Capture configurable search page. It includes a familiar full-text search model was designed to be compatible with several national and and depending on system multiple additional search fields for international metadata models such as Dublin Core, SFS 5914, boolean logic expressions, pre-fetched facets e.g temporal, spatial JHS 143 and SÄHKE2 (Alm 2013). It can be extended to support and content type searches and some visualizations for those. In other standards and custom metadata definitions as needed. OSA, it is possible to estimate the search results accuracy and Because of the extent of the unified model, a smaller piece of it count before rendering the results. Of course, full-text search can can be defined as a content model for various content types. Each be used like Google or Bing search. content type is fully compatible with the main model. Metadata values can be links to ontologies and vocabularies. Content Next, search is performed against the sources. Finna and OSA are described with Capture model form a linked data network which not web search engines. Instead they find contents in their can be private, public or a hybrid. (Lampi & Alm 2014). indexes. Finna finds materials submitted by Finnish libraries, archives and museums. OSA is more complex since it has public Furthermore, an important lesson learned is that a harmonized contents as well as restricted and confidential content per model cannot dictate too many restrictions. The umbrella model organization. By default OSA searches materials based on the user needs to support all kinds of needs and provide a coherent internal information such as organization, roles and access rights. If no harmonization framework. Restrictions like cardinality and locale user information is found, it will search only public materials for a based settings need to be applied in interfaces pulling and pushing specific organization. OSA is not a portal like Finna. Each user the data. In OSA, mappings and transformations are integral part interface is for a single organization only. Currently, there is no of the architecture. Because OSA is a multi-tenant environment cross organizational search but it is technically possible to build. each organization has its own set of mappings which binds the Put differently, Finna and OSA use reliable and selected sources. data to user interfaces and APIs. Each mapping is also archived so Full-text search can understand languages, identify words, that the original meaning and knowledge on how to read it are synonyms and other entities. OSA will also search the contents of preserved. The mappings can be executed technically with any rich text documents such as PDFs.

Page 60 The search results are returned with harmonized and standardized the vector space model that has a long history in the information metadata and in easy-to-read and understandable format. Rich retrieval research (Salton et al. 1975). This idea has been metadata enables a configurable result page and additional systematically explored in the formulation of the theory of methods of refining the results. Finna uses a template-based, conceptual spaces (Gärdenford 2014). modular interface with some customization options. It provides a OSA demonstrates discovery and analysis by utilizing the object selection of the results with small thumbnails and nutshell network created by Fedora Commons. Each entity archived or information on the search. Then the search can be refined with stored in OSA is a compound object consisting of multiple data facets or additional search terms. OSA has completely streams. Fedora Commons uses a specific stream to store each configurable results view which can automatically adapt to object’s relation information in RDF/XML format. Relation returned data. For example, if all the search results are pictures a information is then indexed to a resource index which is a RDF thumbnail view can be shown. Each organization can define the database. By default Fedora Commons 3 ships with Mulgara significant metadata which is displayed automatically. The search which can be queried e.g. with SPARQL language. Objects in view can show different amount of information based on if the RDF database form a linked data network. OSA supports relations user is logged in or not, and depending on the roles and access of any kind between the objects but currently only Dublin Core rights. Harmonization makes it possible to use common search relations and a content model definition are being used. The terms and facets to search and filter digital contents. Both systems relations network enables analysis on how entities are related and provide access to diverse metadata records and files in a coherent how distant the relation is. Another use case is the archival manner. They support storing the original metadata as additional hierarchy catalog which can be built automatically and information. dynamically from isPartOf relations. Metadata records, previews and such can be displayed for the Discovery was found useful in Capture project when planning search results. All the data available in the index can be used for how the existing object network could enrich new objects during searching and can be exposed as a facet. Facets are valuable the ingest and the description process. The basic concept is that an before and after the search. Before search, facets can provide object gains partial or complete context from surrounding linked suggestion and completion features and help to choose search objects such as agents, places, events and actions. These terms that will return meaningful results. After the search, they contextual objects can be formalized via ontologies or can help to profile the results and filter the records. OSA provides vocabularies. (Lampi & Alm 2014). It improves the description download, preview and management options according to roles. speed and information duplication is minimized. Enrichment can Due to the origins of its materials, Finna can also display take place during ingest or access depending on the need. The additional information on findings e.g. whether they are available principle is the same regardless of the timing. The process can be for lending like books on the libraries. automatic or controlled by a user. It can add the information to the The technical solution under the hood in both systems is Apache object’s metadata or just modify the index leaving the original Solr. The front-end and search logic is built on top of that with object unaltered. These developments done in the OSA project are different technologies. In OSA, the front-end is based on earlier experimental and not in production use. development done by MAMK’s digital archive projects and 6. SUMMARY services. Finna uses open source VuFind and custom-made back- Based on the experiences and lessons learned in three case end for management. Both projects have put lots of effort at projects, it can be said that harmonization is an integral part of usability. The development model in OSA and Finna is based on search, discovery and access. Depending on the source materials agile methodologies and emphasis is put on listening to feedback harmonization can require extraction and normalization before from participants. indexing can be done. The current trend in repositories and archives is towards 5. DISCOVERY AND ANALYSIS digitalization which causes fast growth in amount and the In addition to relying on metadata, it is possible to extract useful diversity of digital contents. The experience and research done in information from the text collections themselves using text memory organizations could help commercial companies. This is mining. Text mining can be used to help in the formal description due to the fact that challenges and drivers are more or less similar of the content through automatic term extraction (Paukkeri et al. with every kind of content regardless of the owner organization. 2008) or taxonomy learning (Paukkeri et al. 2012). Complex morphologies can also be modeled using a data driven approach In addition, new tools related to big data, analysis and data mining that has been successfully implemented in the Morfessor method could add value to existing data that is stored in the information (Creutz & Lagus 2007). In conceptual modeling, a data driven systems of archives, museums and libraries. However, in order to approach is also possible. In an early work, term-document utilize new technologies and methods the data must be in good matrices were analyzed using the self-organizing map algorithm to conditions regarding usability. Regarding usability, there are create the maps of documents (Kaski et al. 1998). The similarity different aspects in content usability: machine-readability, context relations between the documents emerge based on the contents of awareness and user experience to name a few. the documents without any predetermined categorizations. Since Furthermore, content analysis could be used in completely new then, this kind of topic modeling has become very popular (see applications such as data based leadership and decision making. Steyvers & Griffiths 2007, Brauer et al. 2014). Not only the Statistical information about index usage could prove useful in relations of documents and their topics can be analyzed in a data developing services which consume the harmonized content. driven manner, but using the relations and features of words can This paper covered a lot of development and research done in be analyzed using similar methodology (Honkela et al. 2010, multiple organizations and projects. As seen, many of the Lindh-Knuutila & Honkela 2013). From the conceptual point of concepts and topics are merging and creating new features and view, semantic modeling in these approaches takes place within

Page 61 adding value to existing applications. Projects like OSA and [11] Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). Finna are not completed when their initial projects come to an WEBSOM–self-organizing maps of document collections. end. They require the constant development and evaluation of the Neurocomputing, 21(1), 101-117. latest research and tools in the field. This is the kind of dialog that [12] Kettunen, K., Kunttu, T., & Järvelin, K. (2005). To stem or has been going on during the past few years and it is also the right lemmatize a highly inflectional language in a probabilistic IR direction for future collaboration. environment?.Journal of Documentation, 61(4), 476-496. Still, there is plenty of room for future research and development. [13] Koskenniemi, K. (1984). A general computational model for To identify a few: managing the information overload, automatic word-form recognition and production. In Proceedings of the curation and preservation of important knowledge and 10th International Conference on Computational Linguistics experiences, as generations before us have done. and 22nd annual meeting on Association for Computational Linguistics (pp. 178-181). Association for Computational 7. REFERENCES Linguistics. [1] Alkula, R., & Honkela, T. (1992). Tekstin tallennus-ja [14] Lampi, M., & Palonen, O. (2013, January). Open Source for hakumenetelmien kehittäminen suomen kielen Policy, Costs and Sustainability. In Archiving Conference tulkintaohjelmien avulla: FULLTEXT-projektin (Vol. 2013, No. 1, pp. 271-274). Society for Imaging loppuraportti (Development of text storage and information Science and Technology. retrieval methods with natural language processing components). Valtion teknillinen tutkimuskeskus, [15] Lampi, M., & Alm, O. (2014). Flexible Data Model for informaatiopalvelulaitos. Linked Objects in Digital Archives. Archiving Conference Proceedings (Vol. 2014 pp. 174-178). Society for Imaging [2] Alm, O., Strömberg, J. (2013). Summary of Final Report for Science and Technology. Capture Project. http://www.elka.fi/useruploads/files/Summary.pdf [16] Lindén, K., Silfverberg, M., & Pirinen, T. (2009). HFST tools for morphology–an efficient open-source for [3] Bizer, C.; Heath, T. & Berners-Lee, T. (2009), 'Linked Data - construction of morphological analyzers. In State of the Art The Story So Far', International Journal on Semantic Web in Computational Morphology (pp. 28-47). Springer Berlin and Information Systems 5 (3), 1--22 . Heidelberg. [4] Brauer, R., Dymitrow, M., & Fridlund, M. (2014). The [17] Lindh-Knuutila, T., & Honkela, T. (2013). Exploratory Text digital shaping of humanities research: The emergence of Analysis: Data-Driven versus Human Semantic Similarity Topic Modeling within historical studies. In Enacting Judgments. In Adaptive and Natural Computing Algorithms Futures: DASTS 2014 Conference (Danish Association for (pp. 428-437). Springer Berlin Heidelberg. Science and Technology Studies), 12–13 June 2014, Roskilde University, Denmark. [18] Nilsson, M. (2010). From Interoperability to Harmonization in Metadata Standardization: Designing an Evolvable [5] Creutz, M., & Lagus, K. (2007). Unsupervised models for Framework for Metadata Harmonization. Stockholm: KTH. morpheme segmentation and morphology learning. ACM http://kth.diva- Transactions on Speech and Language Processing (TSLP), portal.org/smash/get/diva2:369527/FULLTEXT02.pdf 4(1), 3. [19] Nyberg, K., Raiko, T., Tiinanen, T., & Hyvönen, E. (2010). [6] Finna (2014) Mappings from Different Formats to Finna’s Document classification utilising ontologies and relations Index. between documents. In Proceedings of the Eighth Workshop https://www.kiwi.fi/display/finna/Kenttien+mappaukset+eri+ on Mining and Learning with Graphs (pp. 86-93). ACM. formaateista+Finnan+indeksiin [20] Paukkeri, M. S., Nieminen, I. T., Pöllä, M., & Honkela, T. [7] Gärdenfors, P. (2014). The Geometry of Meaning: Semantics (2008). A Language-Independent Approach to Keyphrase Based on Conceptual Spaces. MIT Press. Extraction and Evaluation. In COLING (Posters) (pp. 83- [8] Hormia-Poutanen, K., Kautonen, H., & Lassila, A. (2013). 86). The Finnish National Digital Library: a national service is [21] Paukkeri, M. S., García-Plaza, A. P., Fresno, V., Unanue, R. developed in collaboration with a network of libraries, M., & Honkela, T. (2012). Learning a taxonomy from a set archives and museums. Insights: the UKSG journal, 26(1), of text documents. Applied Soft Computing, 12(3), 1138- 60-65. 1148. [9] Honkela, T., Könönen, V., Lindh Knuutila, T., & Paukkeri, [22] Salton, G., Wong, A., & Yang, C. S. (1975). A vector space M. S. (2008). Simulating processes of concept formation and model for automatic indexing. Communications of the ACM, communication. Journal of Economic‐ Methodology, 15(3), 18(11), 613-620. 245-259. [23] Steyvers, M., & Griffiths, T. (2007). Probabilistic topic [10] Honkela, T., Hyvärinen, A., & Väyrynen, J. J. (2010). models. Handbook of latent semantic analysis, 427(7), 424- WordICA—emergence of linguistic representations for 440. words by independent component analysis. Natural Language Engineering, 16(03), 277-308. [24] W3C (2009) Improving Access to Government through Better Use of the Web. W3C Interest Group Note 12 May 2009. http://http://www.w3.org/TR/egov-improving/

Page 62 Database Preservation Toolkit: a flexible tool to normalize and give access to databases

José Carlos Ramalho Luis Faria Hélder Silva University of Minho KEEP SOLUTIONS Lda KEEP SOLUTIONS Lda Braga Braga Braga Portugal Portugal Portugal [email protected] [email protected] [email protected] Miguel Coutada University of Minho Braga Portugal [email protected]

ABSTRACT prove ’db-preservation-toolkit’ (http://keeps.github.io/ Digital preservation is emerging as an area of work and re- db-preservation-toolkit/), an extracted component from search that tries to provide answers that will ensure a contin- the RODA project (http://www.roda-community.org). ued and long-term access to information stored digitally. IT Platforms are constantly changing and evolving and nothing Therefore, ’db-preservation-toolkit’ was improved with re- can guarantee the continuity of access to digital artifacts in spect to performance and also with new features addiction their absence. in order to support more database management systems, ad- dress some missing features of the other products, support This paper focuses on a specic family of digital objects: of a new preservation format (SIARD) and provide an inter- Relational Databases; they are the most frequent type of face where it is possible to access and search the information databases used by organizations worldwide. of the archived databases.

Database Preservation Toolkit enables the preservation of Categories and Subject Descriptors relational databases holding the structure and content of H.4 [Information Systems Applications]: Miscellaneous; the the database in a preservation format in order to provide D.2.8 [Software Engineering]: Metrics—complexity mea- access to the database information in a long term period. sures, performance measures

If in one hand there is a need to migrate databases to newer ones that appear with technological evolution, on the other Keywords hand there is also the need to preserve the information they Digital Preservation, Databases, Migration, Significant Prop- hold for a long time period, due to legal duties but also due erties, Digital Object to archival issues. That being said, that information must be available no matter the database management system where 1. INTRODUCTION the information came from. In the current paradigm of information society more than one hundred exabytes of data are already used to support In this area, solutions are still scarce. Main products for rela- our information systems [6]. The evolution of the hardware tional database preservation include CHRONOS and SIARD. and software industry causes that progressively more of the The first one is, in most of the cases, unreachable due to the intellectual and business information are stored in computer associated costs. The second one is not really a product but platforms. The main issue lies exactly within these plat- a preservation format. forms. If in the past there was no need of mediators to understand the analogical artifacts today, in order to under- The main idea behind this work was to explore the main fea- stand digital objects, we depend on those mediators (com- tures and limitations of the existing products in order to im- puter platforms). In the eventual absence of appropriate mediators, who can guarantee the preservation of the digi- tal artifacts? In other words, who has the responsibility to support the continuity of access to digital data [1]? Despite the concrete responsibilities and considering that there is no generic solution, several researchers and research projects aim to face this problem.

Although digital information can be exactly preserved in its original form by only copying (preserving) the bits, the problem appears when we notice the very fast evolution of

Page 63 those platforms (hardware and software) where the bits can formation Packages), a subset of the preserved information be transformed into something human intelligible [4]. Digi- more adequate for delivery to end-users. Currently RODA tal archives and digital libraries are complex structures that is capable of storing and give access to the following types without the software and hardware – which they depend on of digital objects: text-documents, still images, relational – the human being, or others, will certainly be unable to databases, video, audio and emails. experience or understand them [3]. Normalization plays an important role in RODA. It was not Our work addresses this issue of Digital Preservation and fo- possible to archive every kind of text-document or every kind cuses on a specific class of digital objects: Relational Databases of still image. Even with databases normalization was neces- [4]. Relational databases are a very important piece in the sary as each Database Management System (DBMS) had its global context of digital information and therefore it is fun- own data model. So we had to take mesures towards format damental not to compromise its longevity (life cycle) and normalization. Every digital object being stored in RODA also its integrity, liability and authenticity [8]. These kinds is subjected to a normalization process: text documents are of archives are especially important to organizations because normalized as PDF files; Still Images are converted to un- they can justify their activities and characterize the organi- compressed TIFFs; Relational databases are converted to zation itself. Current studies claim that 90% of the infor- DBML[8] (Database Markup Language). mation produced in a daily basis is stored in a relational database. The RODA project is divided into many different compo- nents and services having the Fedora Commons at the core Currently, in this project, we aim to support more database of its framework. Fedora implements the common digital formats on ingestion, more database preservation formats repository features, as digital objects and metadata stor- as AIPs and new ways to explore the archived databases. age and the ability to create relationships between objects. In the following section we will describe the project context Fedora Commons also provides search capabilities by using and its roots. Next we analyze the relational databases class the Lucene search engine under the hood. On top of that, of objects; we should be able to completely characterize this we have developed the RODA Core Services, i.e. the basic type of digital objects so that one may choose what are the RODA services, which can be accessed programatically. Fi- issues (the things) important/valid/necessary for preserva- nally, the RODA Web User Interface allows the end user to tion. Following section establishes the significant properties easily browse, search, access and administrate stored infor- for relational databases digital preservation. The significant mation, metadata, execute ingest procedures, preservation properties are addressed, individually and globally, over dif- and dissemination tasks. ferent levels of abstraction. At the end we will draw some conclusions, specify the future work to be done and also In spite of all the efforts invested in the development of enumerate some questions that emerge from the research. RODA, there was still no support for real active digital preservation. Once the materials got into the archival stor- age they remained untouched and, therefore, susceptible to 2. RODA: THE BEGINNING... technological obsolescence, especially at the format level. In mid 2006, the Portuguese National Archives (Directorate- At the same time, at the University of Minho, a project General of the Portuguese Archives) have launched a project called CRiB (Conversion and Recommendation of Digital called RODA (Repository of Authentic Digital Objects) aim- Object Formats) was being devised. This project aimed ing at identifying and bringing together all the necessary at assisting cultural heritage institutions as well as normal technology, human resources and political support to carry users in the implementation of migration-based preservation out long-term preservation of digital materials produced by interventions. Among those services were format convert- the Portuguese public administration. ers, quality-assessment tools, preservation planning and au- tomatic metadata production for retaining representations’ As part of the original goals of the RODA project was the de- authenticity. velopment of a digital repository capable of ingesting, man- aging and providing access to the various types of digital ob- The CRiB system was developed as a Service Oriented Ar- jects produced by national public institutions. The develop- chitecture (SOA) and is capable of providing the following ment of such repository was to be supported by open-source set of services: technologies and should, as much as possible, be based on existing standards such as the Open Archival Information System (OAIS) [2], METS [12], EAD [11] and PREMIS [7]. File format identification; • The OAIS model is composed by three top processes: ingest, Recommendation of optimal migration options taking administration and dissemination. In RODA we have spec- • ified the workflows for each of these processes. The ingest into consideration the individual preservation require- process takes care of new information added to the repos- ments of each client institution or user; itory. This information is delivered by the producer as an Submission Information Package or SIP. The SIP structure Conversion of digital objects from their original for- • had to be formally specified so that third-party institutions mats to more up-to-date encodings; were able to communicate with the repository. During ingest SIPs are transformed into AIPs (Archival Information Pack- Quality-control assessment of the overall migration pro- ages). The dissemination process takes care of consumer re- • cess - data-loss, performance and format suitability for quests by transforming AIPs into DIPs (Dissemination In- long-term preservation;

Page 64 Generation of preservation metadata in PREMIS for- 4. OAIS, SIPS AND DATABASES • mat to adequately document the preservation inter- RODA follows the Open Archival Information System Refer- vention and retain the objects’ authenticity. ence Model (OAIS) [2]. OAIS identifies the main functional components that should be present in a archival system ca- pable performing long-term preservation of digital materials. After obtaining supplementary funding to continue the de- The proposed model is composed of four principal functional velopment of RODA, the team decided to use CRiB as its units: Ingest, Data management, Archival storage and Ac- preservation planning and execution unit. cess; and two additional units called Preservation planning and Administration. Figure 1 depicts how these functional The RODA project follows a service-oriented architecture to units interact with each other and with all the stakeholders facilitate the parallel development and update and allow het- of the repository (internal and external). erogeneous technology and platform independence between XML XML its various components. The CRiB project is also service- PREMIS EAD descriptive preservation OAIS oriented, to allow the implementation of services that are XML NISO Z39.87 technical ? Preservation only possible in specific platforms and technologies. This Planning XML XML structural XML METS paper provides a description of both projects and about the METS EAD Data Management ? DM DM integration of CRiB as on of RODA’s components, allow- SIP Ingest Access ing the use of its features for normalization processes during DIP XML Producer

PREMIS AIP AIP Consumer ingest, metadata generation, preservation planning and for- XML Archival Storage METS XML mat migrations, and even dissemination services. NISO Z39.87 Administration

In this paper we are going to focus on the digital preservation Management of databases. We will raise the relevant questions on this topic and we are going to discuss the decisions we took in the past and the ongoing work.

Figure 1: RODA general architecture 3. PAST Preserving digital data is a complex technological puzzle. Databases are one of the most complex digital object types 4.1 Ingest process The ingest process is responsible for accommodating new to deal with. To simplify the problem we decided to ad- materials into the repository and takes care of every task dress the problem by layers: data, structure and semantics. necessary to adequately describe, index and store those ma- These layers match database significant properties and tell terials. For example, in this stage the repository may trans- us what to preserve and how to measure the quality of the form submitted representations to normalized formats ad- digital preservation strategy being followed. The data layer equate for long-term preservation and request the user to extracts data and migrates it to the preservation format. add descriptive metadata to those objects to facilitate their Structure layer does the same with the database structure. future retrieval using available search mechanisms. It is also The semantics layer will deal with all the remaining database common practice to store the original bit-streams of ingested features that should be preserved. materials together with the normalized version (just in case a more advanced preservation strategy comes along to rescue Our first approach was to deal with the first two layers, the those old bits of information). preservation of the database data and structure, i.e., the preservation of the database logical model. We developed New entries come in packages called Submission Information a RODA component that extracts the first two layers from Packages (SIP). When the ingest process terminates, SIPs its specific database management environment (DBMS). Its are transformed into Archival Information Packages (AIP), first version used DBML[8] neutral format for the represen- i.e. the actual packages that will be kept in the reposi- tation of both data and structure (schema) of the database. tory. Associated with the AIP is the structural, technical and preservation metadata, as they are essential for carry- This component was presented and demonstrated at the ing out preservation activities. Open Planet workshop, ”Database Archiving”, held at Dan- mark National Archives in 2012. During the workshop it The SIP is the format used to transfer new content from the became clear that more formats should be supported and producer to the repository. It is composed of one or more we should also change the preservation format. Although digital representations and all of the associated metadata, there is no standard for a database preservation format, packaged inside a METS envelope. The structure of a SIP SIARD[10] is being adopted in several european institutions supported by RODA is depicted in Figure 2. The RODA and projects and when compared to DBML it already sup- SIP is basically a compressed ZIP file containing a METS ports part of the semantics layer and had some scalability document, the set of files that compose the submitted rep- properties. So we decided to have it also as our preservation resentations and a series of metadata records. Within the format. Back then we also decided to support other DBMS SIP there should be at least one record of descriptive meta- formats like DB2, and other preservation formats like AADL data in EAD-Component format1. However, one may also (used by Sweden, Norway and Finland) as input and output find preservation and technical metadata inside a submission formats of our toolkit. This way our toolkit will become a real interoperability tool. 1An EAD record does not describe a single representation.

Page 65 Compressed ZIP file Since late 1990’s, XML was accepted as the neutral format for information representation and information interchange. envelope This is due, mainly, to two factors. On one hand, XML doc- Descriptive metadata uments are purely textual files, structured and independent of any hardware or software platforms. On the other hand, it is widespread and more and more public domain tools are Preservation metadata available to help users transforming XML documents. XML was the obvious choice for the base format of our rep- Technical metadata resentation files. Both DBML and SIARD use XML as the base format.

Representation DBML and SIARD are the only XML based database preser- File 1 vation formats. Although they are easy to process by both

File 2 machines and humans, converting a database into DBML or SIARD is not easy and it is not a task humans can do ... by hand. So, the next step was to create a tool capable of File n generating DBML from different DBMS.

We also keep an SQL version of the database information in the version supported by the original DBMS. This has to do with the preservation policy, we always keep the original object or, at least, the closest version of it (we do not know Figure 2: Submission Information Package structure what the future may bring and we can’t predict how the actual DBMS will evolve). package, although this last set of metadata is not mandatory 6. DATABASE SIP BUILDER as it is seldom created by producers. Nevertheless, it was In RODA’s context we soon realized that we could not just felt important that RODA should support those additional deliver a format and demand from producers to send us the SIP elements for special situations such as repository succes- information packages accordingly. In projects like this one sion, i.e. when ingested items belong to another repository it is important to have wide acceptance from the community that is to be deactivated. of users.

Before SIPs can be fully incorporated into the repository We developed a tool to create these SIPs. This tool was in- they are submitted to a series of tests to assess its integrity, tegrated in RODA but due the growing interest it has eman- completeness and conformity to the ingest policy. cipated as a tool that can be integrated with other systems and tools. Its architecture is presented in figure 3. If any of the validation steps fails, the SIP is rejected and a report is sent to the archivists group as well as to the pro- Import modules Streaming data model Export modules ducer. The producer may then fix the problem and resubmit MySQL MySQL a new version of the SIP. Oracle'12 SQL Server

SQL Server PostgreSQL 5. DATABASE SIP PostgreSQL DB2 Database SIPs are very similar to other SIPs. The differ- DB2 MS Access ence relies on the representation files. For the other for- MS Access ODBC mats we only had to choose one normalization format to ODBC use for the representation files: for images we chose TIFF, DBML for text based documents we chose PDF and so on. But for databases there wasn’t such a format. Each DBMS sup- DBML SIARD ported its own format. Even SQL has some different ver- SIARD sions. So, we had to create and specify a new format. PhpMyAdmin

A neutral format that is hardware and software (platform) independent is the key to achieve a standard format to use Figure 3: Database SIP builder architecture in digital preservation of relational databases. This neutral format should meet all the requirements established by the We are addressing database significant properties by layers. designated community of interest. Each layer raises different problems that have to be solved with appropriate solutions. In fact, EAD is used to describe an entire collection of rep- resentations. Our SIP includes only a segment of EAD, suf- ficient to describe one representation, i.e. a element 6.1 Data layer and all its sub-elements. The team has called this subset of Extracting data from a DBMS it is not difficult, we just have the EAD an EAD-Component. to connect to the DBMS and issue an SQL statement like

Page 66 ”SELECT * FROM”. In DBML all the data is dumped in Preservation formats DBML SIARD a single XML file. We had the idea to segment the data but SIARD already did that. That was one of the rea- sons that took us to support SIARD as the preservation format. DBML had to change to be able to take care of db-preservation-toolkit real databases and most of the needed changes were already implemented in SIARD.

Fast viewer resources Lucene / Solar index 6.2 Structure layer Each DBMS stores the structural information in its own specific way and to overcome this situation we had to develop specific connectors, import modules, for each one. For each Fast viewer application Web interface & REST API DBMS we created a connector that connects to the database and knows how to extract its structural information. If we need to support a new DBMS in the future we just need to program a new import module for that DBMS. In the last version, we added support for DB2 creating a new import module for this DBMS.

6.3 Semantics layer Figure 4: New Database Viewer This layer corresponds to the behavioral part of a database and is where the focus of the discussion in this area is. We include in this layer: views, stored procedures, rights, roles, data on a search engine like Lucene and having that engine user management, APIs, interfaces, and other feature we can as the interface to access data. This way we won’t need an come across. external DBMS or an external database cache to access the data making the access functionality simpler and faster. Currently there are partial solutions for it. DBML does not support it, it only deals with the first two layers. 8. FUTURE WORK As future work we still have to improve some features and SIARD enables SIP creators to store views, stored proce- to run some tests. dures and constrains capturing a significant part of the database behavior. These behavioral components are captured in We are working on new small projects pursuing the idea SQL99 and stored inside an XML envelope. of reverse engineering the relational model. Since we are free from the DBMS why shall we stick with the relational For some consumers we are still missing many things: forms model? Relational model is optimized for transactions. If that the application uses to capture input from users, re- we have an archived frozen database we won’t be executing ports, etc. In most of these cases, we try to capture de transactions. If we don’t need the relational model we can knowledge with application metadata and application im- undo the database normalization going towards the original ages/screenshots. conceptual database model.

7. DATABASE ACCESS During a phd thesis we have been working to create algo- We can see dbtoolkit as a SIP builder but also as a tool that rithms to migrate data from a relational model into an onto- enables several ways to accessor deploy archived databases. logical model close to the database conceptual model [5]. In It can deploy a DIP very similar to the original SIP, an a more recente work we created a SIARD to RDF converter SQL based original database or, like figure 3 shows, any and implemented a simple RDF navigator for databases [9]. other format that has an export module contributed by some community programer. This way, we can look at this tool 9. ACKNOWLEDGMENTS also as a database converter between different DBMS. This work is supported by the European Commission under FP7 CIP PSP grant agreement number 620998 - E-ARK. Back in RODA, we needed a nice user interface that would enable users to explore the archived databases. We took php- MyAdmin, simplified it and we end up with a tool that allows 10. REFERENCES users to browse databases, to access data, to access struc- [1] F. Berman. Surviving the data deluge. ture information and to execute some SQL queries. This Communications of the ACM, 51(12), 2008. new access component works with MySQL export module [2] Consultative Committee for Space Data Systems. and uses a local MySQL DBMS to cache the database. The National Aeronautics and Space Administration, 2002. problem with this approach is that it does not scale for large [3] M. Ferreira. Introdu¸c˜ao`apreservacao digital - databases or a large database quantity. Conceitos, estrat´egiase actuais consensos. Escola de Engenharia da Universidade do Minho, Guimar˜aes, Pursuing the scalability idea we are launching a new project Portugal, 2006. to create a faster viewer with simpler interfaces. The project [4] R. Freitas. Preserva¸c˜aodigital de bases de dados is illustrated in figure 4. The idea is to dump and index the relacionais. Master’s thesis, Escola de Engenharia,

Page 67 Universidade do Minho, Portugal, 2008. [5] R. A. P. Freitas. Relational databases digital preservation. PhD thesis, Engineering School, University of Minho, Portugal, 2013. [6] P. Manson. Digital preservation research: An evolving landscape. European Research Consortium for Informatics and Mathematics - NEWS, 2010. [7] PREMIS Working Group OCLC Online Computer Library Center & Research Libraries Group. Data dictionary for preservation metadata: final report of the premis working group oclc online computer library center & research libraries group. Technical report, Dublin, Ohio, USA, 2005. [8] J. Ramalho, M. Ferreira, L. Faria, and R. Castro. Relational database preservation through modelling. In Extreme Markup Languages 2007, Montr´eal, Qu´ebec, 2007. [9] F. Rocha. Preserva¸c˜aode Bases de Dados com SIARD. Master’s thesis, Engineering School, University of Minho, Portugal, 2014. [10] Swiss Federal Archives - SFA. Siard - format description. http://www.bar.admin.ch/themen/00876/00878/, 2008. [11] The Library of Congress. P´agina oficial do ead vers˜ao de 2002. http://www.loc.gov/ead/, 2002. http://www.loc.gov/ead/. [12] The Library of Congress. Mets webpage. http://www.loc.gov/standards/mets, 2006.

Page 68 Practical experiences and challenges preserving administrative databases Mikko Eräkaski National Archives of Finland Rauhankatu 17 PO Box 258, FI-00170 Helsinki +358 50 363 5769 [email protected]

ABSTRACT experiences have shown that preserving databases is not merely a Over the recent years the National Archives of Finland has technical challenge. Public sector organizations also generally received databases and registries from various governmental underestimate the research value of their databases, which leads to bodies. The information contained in these registries and a lack of awareness of preservation issues, appraisal principles, databases will be transferred to National Archives’ long-term and the appraisal duty of the National Archives. preservation service in order to ensure the authenticity, integrity and usability of information over time. This paper introduces how 2. NATIONAL ARCHIVES PRACTICE information can be separated from database structures and The strategy of the National archives is to ensure that its norms transferred to archives. The key aspect of preserving database and guidelines correspond to the international standards and information is comprehensive documentation of extracted data. requirements used in digital preservation. During the past years This is done by applying national SÄHKE2-standard and the National Archives has developed tools and methods based on ADDML-standard developed by Norwegian National Archives. experiences in other European archives and their database preservation strategies. Categories and Subject Descriptors The main strategy of the National Archives is to preserve only H.2 DATABASE MANAGEMENT: H.2.7 Database data, not functionalities or data processing rules and algorithms. Administration E.1 DATA STRUCTURES Data is extracted from a database system and separated from database structures. Data is stored in XML or CSV format without General Terms any software dependent features or binary files. As part of this Documentation process all binary files must be extracted from database and converted to suitable format. The National Archives hasn’t set strict rules for the form of the data files. Instead the core Keywords requirements concern a description of database and obligatory Long-term preservation, National Archives of Finland, databases, metadata elements. This description is needed in many levels in ADDML-standard, legislation order to fully understand extracted data and the context of its creation and use. The description of data and transfer to the 1. INTRODUCTION National Archives is done using standardized SIP structure and The domain of digital preservation is widening from relatively metadata. Additional documentation concerning the context, data simple documents to more diverse materials such as complex origin, database management system (DBMS), data models, databases, data warehouses, geographical data and research data. processing rules and usability guidelines are preserved also in a Archives need to adopt new kind of methods and techniques in PDF format. The documentation that needs to be included in SIP order to preserve and keep databases accessible and trustworthy has so far been evaluated on a case by case basis. over time. When dealing with complex databases, it is crucial to The SIP structure is defined by the national SÄHKE2-standard, cooperate with authorities and other archives. which is an information model, designed for electronic records The Nordic countries have world leading scientists as well as management systems (ERMS) [2]. The National Archives have expertise in using administrative personal data in research. This is developed SÄHKE2 SIP-structure in order to transfer records primarily due to extensive administrative registries and a wide from diverse electronic records management systems (ERMS) to usage of personal identification numbers. Large national databases its long-term preservation service in unified structure. SÄHKE2- and registries provide unique source material to study macro-level structure is also applied in database and registry data, which effects and complex causal questions. Finnish administrative ensures that all material transferred to National Archives, is registries are widely used in research today, but in the future, this transferred in the same structure with similar metadata. information may be compromised if not properly preserved for the SÄHKE2 metadata is used to describe database and registry data long term [1]. at the collection-level as well as the records-level. SÄHKE- metadata consists mainly of contextual and administrative The National Archives of Finland is facing a major challenge metadata describing origin, function, information content and when it comes to preserving databases. Hundreds of registries possible restrictions. SÄHKE-structure ensures the integrity and and databases have been maintained by the public administration permanence of SIP, which is validated automatically in ingestion during last decades, but only few of them are being maintained workflow. steadily for long-term preservation purposes. Our previous

Page 69 ADDML-standard is used to describe data: tables, fields, older data may differ from newer data. Usually database variables, codes and their relationships. ADDML is Norwegian documentation doesn’t include information about changes. In the Archival Service’s standard for technical metadata. ADDML future, organizations should have processes how to keep (Archival Data Description Markup Language) is used describing documentation up to date in the longer term. Also older data is a collection of data files organized as flat files. ADDML describes often poorly described. In the case of older registries a flat file structure when it is to be exchanged from one system to documentation can be completely lost, available only in paper another. ADDML standard is relatively flexible, which means that format, or most likely not up to date. If some parts of the data are it can be adjusted for local requirements and practices for obscure then the information can be of no value. describing different levels of content. This also allows each archive to define its own rules as to how to apply standard [3]. 3.1 Complex legislation In Finland, legislation concerning personal data is complex. The objectives of the Personal Data Act are to safeguard the right to privacy and to promote the development of good processing practice. The Act also regulates destruction and preservation of personal data: “If a personal data file is no longer necessary for the operations of the controller, it shall be destroyed, unless specific provision have been issued by an Act or by lower-level regulation on the continued storage of the data contained therein or the file is transferred to be archived…”[4] Public authorities often have problems understanding complex legislation and the Personal Data Act is read to imply that data must be destroyed. Public authorities are often unaware of appraisal principles, and the appraisal duty of the National Archives, which has a determinative role in the process of the appraisal of public records and data. As a result data in many governmental registers and databases have not yet been appraised by the National Archives, Figure 1. SÄHKE2-standard and ADDML-standard are used because records creators have not sent their appraisal proposals to describe context, structure and content of database. concerning their registries to the National Archives. Altogether Database data is usually not self-documenting so without this has led to situations where public authorities have destroyed sufficient metadata data it can be completely unclear. ADDML register data that should be preserved as part of the National describes the meaning of the data by providing a technical and Cultural Heritage. structural data description in a standardized format. It can also describe datasets consisting of more than one table, because it can describe their relationships. The National Archives is cooperating with the Norwegian and Swedish National Archives in order to 4. REFERENCES further develop the ADDML-standard. [1] Gissler, Mika & Haukka, Jari: Finnish health and social welfare registers in epidemiological research. Norsk 3. PRACTICAL CHALLENGES epidemilogi 14/2004. Databases stay usually in operation for several years and are, in [2] SÄHKE2-standard: http://www.arkisto.fi/se/saehke2- most cases, constantly updated, which causes some challenges for maeaeraeys (available only in Finnish) preservation. The first question is whether or not to archive a [3] ADDML-standard: database if it is still in operation. It is possible to archive http://www.arkivverket.no/arkivverket/Arkivbevaring/Elektro snapshots of the entire database at regular intervals or to just nisk-arkivmateriale/Standarder/ADDML archive inactive data, which is no longer modified. The National [4] Personal Data Act (523/1999), English translation: Archives has exercised the latter method. Most transferred http://www.finlex.fi/en/laki/kaannokset/1999/en19990523.pd databases have been older databases in which data is no longer f altered. The second question concerns documentation. If database has been in operation for longer period it has usually been altered over time. Fields and codes may have changed over time and

Page 70 LONG-TERM PRESERVATION OF DATABASES THE MEANINGFUL WAY Janet Delve Rainer Schmidt Kuldar Aas University of Portsmouth AIT Austrian Institute of Technology National Archives of Estonia School of Creative Technologies GmbH J. Liivi 4 Eldon Building, Winston Churchill Donau-City-Straße 1 Tartu, 50409, Estonia Avenue, Portsmouth, PO12DJ, UK 1220 Vienna, Austria +372 7387 543 +44 2392 845524 +43(0) 50550-4273 [email protected] [email protected] [email protected]

ABSTRACT mobile, social, Big Data, Transforming archives through Long-term preservation of databases has been discussed in some information technologies detail over recent years, for example as part of the PLANETS project, and we have seen the rise of standards like SIARD and General Terms Database Archiving, Database Preservation, Online Analytical ADDML to address this issue. However, these tools / standards Processing (OLAP), Big Data, Data Warehousing, Online are not particularly geared towards the reuse of preserved data, Transaction Processing (OLTP), SIARD, SIARD-DK, ADDML, addressing as they do the use case of accessing a single database normalization, denormalization. snapshot covering just one instance in time, and then allowing pre-defined or custom queries to be carried out on this. This paper will show how the EC-funded E-ARK project Keywords (http://www.eark-project.com/) is addressing wider use cases of Reuse, Database Preservation, Data Mining, Data Warehousing, database archiving and access. A gap analysis carried out in the OLAP, OLTP, E-ARK project, Digital Preservation (DP), early phases of the project has identified the fact that archives are Denormalization. not able to carry out Big Data querying / data mining across a variety of archived databases carrying related entities etc., as 1. INTRODUCTION opposed to querying single databases as mentioned above. In the final panel discussion at the Goportis “Getting Ready for th Part of the E-ARK project approach is to address wider use cases Digital Preservation” meeting in Hamburg on October 20 2011, by using a combination of state-of-the-art techniques taken from Seamus Ross asserted that the biggest challenge then facing the data warehousing, Online Analytical Processing (OLAP), data DP community was database archiving. In conversation at the mining and semantic annotation. Overall this approach means 2014 iPRES conference in Melbourne, Professor Ross mentioned that: to Dr Delve that he saw databases as being one of the greatest inventions of the 20th century, being used as they are in so many • during the pre-ingest or ingest workflow denormalized different walks of life [Ross, personal communication]. Ross’s representations will be created of the original relational views coincide with the approach to digital archiving taken by the database; 3 year E-ARK project, which was set up in February 2014 to • the database content will be semantically enriched according develop a pan-European digital archiving system to cater for both to available centrally controlled vocabularies; records and databases. E-ARK included databases because they are such key digital “workhorses”: driving applications large and • the enriched representations will be stored next to the small; mainframe and web-based; long-standing and recent; and original database; because the data they contain and their robust functionality need • when users are interested in a special topic which might be to be retained for posterity. This paper begins by charting a brief covered in multiple database snapshots, they are allowed to history of database development, focusing on the relational create semantic queries which identify appropriate OLAP database, and delineating how it changed from being mainly cubes and can use additional data mining techniques to transaction-oriented (via Online Transaction Processing - OLTP) combine and make sense of the data in these. to now also being analysis-driven (as seen by the advent of Whilst the work is still ongoing, the paper will shed some light on analytical databases, data warehousing, data cubes, the details of this approach, and present a conceptual multidimensional databases, dimensional modeling, Online technological solution. Analytical Processing - OLAP etc.). Following this overview comes a brief review of how some of the national archives in E- ARK currently carry out database archiving. We then discuss Categories and Subject Descriptors some of the innovations in E-ARK where we are using data Ingesting and preserving databases and special records, warehousing / OLAP techniques as part of the OAIS process for Perspectives on past and present projects, Re use of public archiving databases. We conclude by outlining further work. information, Role of standards, Strategy and approach, The cloud,

Page 71 Figure 1. Entity-Relational Model © Connolly and Begg

that every attribute describing that entity is related to the key, the 2. BACKGROUND TO DATABASE whole key (keys can be multi-part) and nothing but the key (so it DEVELOPMENT should not depend on a non-key attribute). This then results in a data model comprising many discrete entities that have to be 2.1 Database definitions connected by joins across the keys (called foreign keys). For large Connolly and Begg [2] define a database as a “shared collection and complex models, this can result in an unwieldy tangle of of logically related data (and a description of this data), designed connections which are costly to query in terms of creating joins, to meet the information needs of an organization”. It is the fact then ensuring that queries are performed in the best order. that database data is logically related that gives databases their all However, one reason normalization has been so popular is that it important structure. Date’s database contribution is seminal but caters well for transaction processing in many domains such as his definition of a database system as “basically just a banking, insurance, retail etc. computerized record-keeping system” [3] belies the mathematical complexity that underlies these digital organizers. Connolly and Begg outline the development of the database through the early 2.3 OLTP hierarchical databases, and they focus on the rigorous process of By isolating each entity, it is then possible to minimize the normalization which lies behind the relational database model. disruption to the database when inserting, updating and deleting When computer science started as a discipline, there was a data, as these operations only affect the entity concerned. In this concerted effort to formalize it as a serious academic discipline, manner, data relating to one entity is not included in another with the aim that it would have the same gravitas as other entity, and there are no duplicate entries. In terms of database subjects, for example, mathematics. It was in this sprit that the processing, this is extremely efficient, and is the main reason the mathematician and Oxford graduate Edgar F. (Ted) Codd, then relational database has remained popular for so long. However, working for IBM, devised the relational system which was the the quid pro quo is that all the individual entities have to be basis for the relational database. The database he outlined in 1972 linked via joins and these can be processor-intense, especially if [1] was founded on actual symbolic relations, with a solid poorly constructed (which is unfortunately all too easy to do and mathematical foundation. In this way, the data models, and is why relational algebra exists to indicate the best sequence in eventually the tables and associated Structured Querying which to join and query the database tables). These problems Language (SQL) of the database would be well-formed and become particularly important when considering massive reliable, producing data that can be used and analyzed with centralized databases, perhaps belonging to an international confidence in a wide variety of situations. There are also special company, with many millions of transactions. mathematically-based relational languages a) to describe the formation of queries (relational algebra), which is particularly 2.4 Centralized database problems useful for query optimization, and b) to ensure queries are Two different problems arose with large centralized databases. logically formed (relational calculus). First, local users wanted more control, and did not want to wait for the central database to process transactions, so they would take 2.2 Database Normalization a copy of just the material relating to their business area or Relational data modeling in a nutshell is centered on entities location, and would put it into their own mini database, and use it which are chosen due to their importance to the modeler. for their own queries. The difficulty with this is that from one However, once chosen, there are strict rules called normal forms, central database, a myriad of mini local databases would spring which ensure that each entity has at least one unique key, and that

Page 72 up, each being customized in different ways. Trying to analyze the • migrate the snapshot into open formats while changing data from a global perspective then became unsatisfactory, as each as little as possible of the original data structures part of the organization relied on their own mini database for current information, and these were often contradictory. The • when access is required, the database snapshot is second problem was that with large volumes of data in huge reconstructed, based on the data in open formats, in a databases, users would frequently want to carry out analysis modern database management system (DBMS). across time to measure past performance and to try to predict Of course, such an approach is highly practical and solves the future action. Obtaining the relevant data from each normalized database preservation issue for most interested bodies, including database turned out to be extremely time consuming, and broad also government and scientific archives. E-ARK especially analytical questions such as “how do house prices compare across benefits from the experience of consortium member the Danish five states / counties over the last five years?” proved just too National Archive (DNA), which archives all its data in the form time-consuming to answer. of databases, and uses its own version of SIARD: SIARD-DK. As far as the authenticity of the data is concerned, we can also state 2.5 Analytical databases that the approach of keeping the data models in active use and W.H. (Bill) Inmon provided the solution to both problems [5] by preservation as close to each other as possible is probably the best introducing the data warehouse which is subject –oriented (as possible next to emulation, which in the case of database and opposed to application - oriented) and takes in snapshots of system preservation is nowadays regarded as being too expensive databases over time. It also integrates this data with other relevant for practical purposes for most memory institutions. information, and crucially, the data in the warehouse is not updated, it is only ever added to, so the raison d’etre for The main problems of this approach are related to access and re normalization (separating entities to minimize update disturbance) use. The preservation of near to original data models in different is no longer there. Data warehouses thus have a much more snapshots requires users to go through rather a lot of steps before flexible architecture, and it is this feature that makes them useful getting to the actual data they need: when considering archiving databases, especially those with many • locate relevant database snapshot(s) complex joins. Data warehouses are often modeled on a star schema [6], which is created by taking the relational model and • load it into a database management system recasting it with a central fact table. For a retail example, the fact table would contain the items that are numerical, so the number of • execute relevant queries, which you might also need to products sold etc. This fact table is surrounded by dimension construct yourself after consulting the specific data tables that describe the products. Most importantly, these model dimension tables do not have to be normalized, so they can be What the users actually need does of course differ in great detail. much more intuitive and user-friendly (and archivist-friendly) and We can, as an example, look at the main three user groups of can contain rich detail, especially regarding time. If required, the public archives: citizens seeking specific information around their dimension tables can be a mixture of normalized and rights; government employees needing information for their work; unnormalized data, then known as a starflake schema. OLAP is a scientists and researchers carrying out large-scale analysis of both complementary technique to data warehousing, where these historic and current data. dimensions are used to query the data which is visualized as a cube (see below). For an example of a humanities data warehouse For the first two user groups the most usual need is to find which is based on census data, see [4]. This architecture has now information about a specific entity at a given point in time or over become mainstream, and “ordinary” relational databases can now a time period. As an example, this might be about the details and be set up to be analytical as opposed to transactional, and extra ownership of a piece of property either in 1986 or between 1970 – features appeared in the standard query language SQL in the 2000. The main difficulty in carrying out such queries according SQL99 version to cater for some of the new types of analytical to the current logic in database preservation is that instead of queries. using the most obvious search phrase in the archival catalogue, the address of the building, the user needs to look for “the This background discussion has thus served as an introduction to database snapshot which includes details about properties in the the problems of saving accurate relational database models with 1980s” or, in the case of a longer time period, “for all database all their complicated joins, and has also introduced data snapshots from the 1970s to the 2000s which include details about warehousing techniques which can help to access data across properties”. In the long term we also have to take into account several archived databases. However, before looking at new that the scope of data gathered into single databases can change techniques being used by E-ARK, we need to inspect the current quite drastically, as an example due to shifting data gathering and practices regarding database archiving. management mandates between different government institutions. Taken all the above, we can state that the level of content and 3. CURRENT PRACTICE IN DATABASE technical knowledge needed from the users to carry out even the ARCHIVING simplest queries on top of archived databases is just too high, The area of database archiving has been an active issue for already especially when compared to the ease of use of current more than four decades. However, the principles of the government service portals. approaches developed back in 1970s and 1980s have remained For researchers the most usual use case is to look into a specific more or less the same and could be summarized as the following topic, locate ALL data relevant to this topic and analyze it all in three step process: common. The problem for such use cases is that the amount of • take a temporal snapshot of the original database database snapshots to go through is growing too large. As an example, when a researcher wants to carry out analysis on

Page 73 building ownership over three decades, and the archives operate a 4.3 Loading Archived Databases logic of archiving snapshots every five years, the need is to go Archival formats for preserving relational databases have been through, learn to understand the data models of and execute developed to archive relational data sets independently of the relevant queries on six different database snapshots. And of database management system which was used to create, store, and course, when the researcher needs information from N different access the data. Tools like CHRONOS and SIARD are capable of databases then the amount of snapshots to go through would exporting database content to disk using an open archival format, probably be N times 6. thereby preserving the original structure and functional elements As a first summary we can therefore state, that the main problem at different levels [7]. Communication with the RDBMS is to reusing archived databases is that it relies too much on the typically handled using standardized APIs (like JDBC) and same “original data model, temporal snapshots” logic which is not drivers enabling an application to connect to SQL database and sufficiently simple and useful for any of the relevant user groups other tabular data sources. Database archiving tools also enable of archives. Therefore the main aims of the discussions below are users to load archived data back into a live database system and/or to explain a bit what some of the most current technologies in data enable users to directly query exported data. The Database warehousing and Big Data are able to do in terms of generating Preservation Toolkit [http://keeps.github.io/db-preservation- user friendly representations of archived data for a variety of use toolkit/] which is extended in the context of the E-ARK project, cases. supports conversion of live or backed-up databases into preservation formats, the conversion between database export formats, as well as loading preservation formats back into live systems. 4. “BIG DATA” TECHNIQUES as used in E- ARK 4.4 Extracting and Aggregating Data The serialization and de-serialization of single relational databases 4.1 Scalable Data Analysis and the associated archival formats are essential for developing The E-ARK project is developing a reference implementation for and implementing the archival workflow and information package a scalable e-Archiving service. This platform will, besides specifications developed in E-ARK. A major interest in the scalable storage and repository services, also implement advanced context of the E-ARK project, however, is search and access of search and data analysis strategies. The current prototype setup is database records beyond single archived databases. Data based on a scalable architecture involving technologies like warehouses, typically used in the business domain, provide Apache Hadoop [http://hadoop.apache.org] for scalable storage concepts to integrate data from multiple sources into a single and computation, the Apache PIG [http://pig.apache.org] data platform, for the purpose of data analytics and reporting. Data analytics platform, and the Lily [http://http://www.lilyproject.org] warehousing makes use of tools enabling Extract-Transform-Load content repository. Being implemented atop data-intensive (ETL) processes to derive, extract, and aggregate data originating technologies, the e-Archiving service is capable of storing and e.g. from RDBMSs or flat files. The load phase, adds and updates efficiently processing large volumes of archived data on multiple the data sets within the data warehouse, which are modeled computer nodes. Combined with content extraction and according to a well defined structure. Data warehouses typically information retrieval tools (including for example Apache Tika have a low transaction rate and aggregate historical data, which and SolR), the platform is used to generate content-based and can be contrasted to the processing transactional data sets. For searchable data-sets, enabling users to query information beyond online analytical processing (OLAP), data is typically organized the metadata level. While there are a range of challenges which along the abstraction of data cubes. These enable the user to need to be addressed regarding the processing of complex objects analyze data and create reports along dimensions like time, like images and documents at scale [8], a major and so far hardly- location, and other units, required to generate for example Web addressed challenge is the development of search and analysis analytics, or sales statistics. As analytical queries are complex and strategies across archived relational databases. resource demanding, OLAP systems often need to organize the data sets in a read efficient manner or pre-aggregate them in order 4.2 Database Representations to be able to generate timely results. The currently developed E-ARK SIP, AIP and DIP specifications (Submission, Archival and Dissemination Information Packages 4.5 A NOSQL-based Approach from the OAIS model) provide built-in support for handling While there is ongoing work within the E-ARK project to make relational databases. This includes archiving databases at multiple use of existing OLAP systems for performing analytical queries layers which can include the primary object, serialized and across archived databases, the work carried out in context of the semantically enriched representations (e.g. based on XML scalable e-Archiving service is focused on providing a generic schema), as well as representations that are prepared for later method for searching archived databases which can be analysis steps. Likewise, E-ARK is supporting access to archived implemented on top of the Apache Hadoop software stack. The databases at different levels. This includes (a) access based on goal is to support the analysis of a large number of records based generic databases that can be loaded and accessed through an on a simple and non-relational model. The approach is supported Relational Database Management System (RDBMS), (b) access by the design of E-ARK information packages enabling the based on aggregated and pre-processed data sets using OLAP- preservation of databases at multiple layers, as mentioned before. based methods such as denormalization, and (c) access to single In this respect, work is dealing with a strategy to generate records which result from queries which have been executed denormalized versions of archived databases in order to generate across multiple archived databases. flat and non-relational data representations, which can be loaded into a distributed NOSQL store like HBase. Search and analysis will be supported based on full text indexing as well as the

Page 74 Apache PIG data analytics platform. Current experiments aim at Management (6th edition). Addison-Wesley. Harlow, supporting OLAP-like data aggregation and preprocessing based England. on automatically detected dimensions. Here, dimensions like for [3] Date, C.J. 2003. An Introduction to Database Systems (7th example in the form of country/town/zip-code/street are detected edition). Addison-Wesley Longman. Reading Massachusetts. automatically by analyzing the relations within a database. This 5. information is then exploited to physically organize the data along the dimensions, enabling efficient range queries and aggregation, [4] Healey, R. and Delve, J. 2007. 'Integrating GIS and Data and avoiding expensive scans over the entire data store. Warehousing in a Web Environment: A Case Study of the US 1880 census. International Journal of Geographical Information Science (IJGIS). Volume 21. Issue 6. Taylor and 5. CONCLUSIONS AND FURTHER WORK Francis. 603-624. It is central to E-ARK to archive both records and databases, and work to define the definitive E-ARK SIPs, AIPs and DIPs that [5] Inmon. 2005. Building the Data Warehouse. Wiley. Foster fully cater for both data types continues, alongside the City. U.S.A. developments outlined above. [6] Kimball, R and Ross, M. 2013.The Data Warehouse Toolkit (3rd edition). Wiley. Foster City. U.S.A. 6. ACKNOWLEDGMENTS [7] Lindley, A. 2013. "Database Preservation Evaluation Report This work is supported by the European Commission under FP7 - SIARD vs. CHRONOS," Preservation of Digital Objects CIP PSP grant agreement number 620998 - E-ARK. (IPRES). 10th International Conference; ed. José Borbinha, Michael Nelson, Steve Knight, 2-6 Sept 2013. [8] Schmidt, R., Rella, M., Schlarb, S. 2014. "ToMaR -- A Data 7. REFERENCES Generator for Large Volumes of Content," Cluster, Cloud and Grid Computing (CCGrid). 14th IEEE/ACM [1] Codd, E.F. 1972. “Further Normalization of the data base International Symposium. 26-29 May 2014. 937, 942, relational model” in Data Base Systems, (Rustin R., ed.). Prentice Hall. [2] Connolly, T.M. and Begg, C.E. 2014. Database Systems: A Practical Approach to Design, Implementation and

Page 75 Data protection in the archives world – fundamental right or additional burden? Dr. Jaroslaw LOTARSKI Job SUETERS Legal officer Document Management Officer European Commission European Commission [email protected] [email protected]

ABSTRACT overstated; these data, like population registers or notarial acts, are being preserved under the specific national legalisation. Since the very existence of the archiving institutions, they store However, it is likely that personal data which is not legally and make available information which may be considered as necessary to be preserved but which constitutes a memory of a “personal data”, in the current sense of this concept. The society in a particular moment of history will not be preserved in development of the modern IT infrastructure and threats to the the archives. They will perhaps be preserved by private actors on privacy of the individuals associated with it generated, during the the archives of the internet… The role of specific record keeping last quarter of the 20th century, the establishment of institutions as guardians of the collective memory is being comprehensive data protection legislation on international, redefined and their role challenged by other forms of preservation European, national and even subnational level. Even if the of records of human activity. archiving of personal data was not originally considered as a significant threat to privacy, the development of the data When personal information is present in an archive, the first issue protection rules brought a profound impact on archiving that an archivist faces is whether data protection rules apply. Their processes. Firstly, the data protection rules are contributing to application is limited to the “personal data” defined as limit personal information from being archived, thus limiting the “information relating to an identified or identifiable natural amount and quality of information being preserved for future person”. The scope of this definition is wide: it concerns not only generations. Secondly, when “personal data” are archived, their information related to private live but also to the professional accessibility is likely to be limited, creating a burden for the activities of a person, it covers all forms and supports of archiving institutions and a restriction for the users of the information (paper, electronic, sound, image,…). It has, at least in archives, in particular researchers. the large majority of jurisdictions, an important limitation: data protection rules apply only to living individuals and therefore The existing data protection legislation attempts to balance the concern only recent records. In absence of the regulation on the conflicting rights to privacy and data protection on the one hand European level, various national regimes coexist in case when it is and preservation and accessibility of archives on the other hand. impossible or excessively difficult to establish whether a person is living. For instance, a person might be considered deceased, and CONTENT therefore all records concerning him/her not falling under data At the European level the fundamental rights to privacy and data protection rules, 100 years after birth. protection are recognised by Article 8 of the European In case the data protection rules do apply, the “data controller”, Convention of Human Rights[1] and Articles 7 and 8 of the EU which will in principle be the record keeping institution, has to Charter of Fundamental Rights[2]. More specifically, the right to inform the persons concerned about the processing of their the protection of personal data is regulated by the Council of personal data. Such information should include notably the Europe Convention 108 of 1981[3] and the EU Directive (EC) identity of the controller, the purposes of the processing and, if 95/46 of 1995[4]. The national legislations on the protection of applicable, the data processed, the recipients, etc. There is personal data should in principle be in line with these European however no obligation to inform the person concerned if the instruments. As the processing of personal data by the institutions provision of such information proves impossible or would involve and other bodies of the European Union is concerned, Regulation a disproportionate effort or if recording or disclosure is expressly (EC) 45/2001[5] is applicable to them. laid down by law, in particular for processing for the purposes of The first and potentially fundamental impact of these rules on historical or scientific research. But in these cases appropriate archiving is the application of the data minimisation and data safeguards for the persons concerned should still be provided. retention principles. In fact, the personal data must be adequate, Another consequence of the application of the data protection relevant and not excessive in relation to the purposes for which rules is the right of the person concerned to access his or her data, they are collected and/or further processed (Art 6.1.(c). of the request their rectification, object to the data processing or even Directive (EC) 95/46)[4]. The correct application of this principle request the data to be deleted. The recent judgement of the Court limits the amount of data collected and subsequently archived. of justice of the EU in ruling concerning Google (case C-131/12 More importantly, the data must be kept in a form which permits of 13 May 2014) was precisely related to a right to be forgotten by identification of data subjects for no longer than is necessary for internet search engines. the purposes for which the data were collected or for which they are further processed (Art 6.1.(e) of the Directive (EC) 95/46)[4]. The obligations of confidentiality and security of data have a Even if exceptions for the historical and scientific use of data exist consequence on the accessibility of the data. In principle only in the current legislation, the amount of personal data deposited in persons who are subjects of the records may be granted access to archives is likely to be limited. The risks for destruction of data them, except if the access is permitted on the base of a specific which are used to justify some subjective rights is probably legislation. Access by third parties should in principle be allowed

Page 76 only for a legitimate purpose and be subject to an appropriate contract imposing an obligation of confidentiality. Constraints flowing from the data protection legislation may impose the conservation of the personal data in anonymous and/or pseudonymised form which normally will not be considered as personal data. The medium on which data are preserved play a very important role in striking the right balance between the legitimate public interest in research and the rights of individuals. There is obviously a fundamental difference between the conservation of data in paper files and on electronic storage which can be easily accessed and researched via the internet. The new EU Data Protection regulation, which will eventually become directly applicable in all the EEE Member States, is still in a preparatory phase. The data processing regime for archiving purposes is the subject of negotiations between the actors of the legislative process. The new regime will have to define the balance between different freedoms, individual rights and public and private interests; specify the “appropriate safeguards” for the persons concerned by the data processing by the archives; determine the scope of access to personal data and define the space left to the national and European legislator in the implementation of data protection in the context of archiving.

REFERENCES

[1] European Convention of Human Rights. http://www.echr.coe.int/Documents/Convention_ENG.pdf [2] Charter of the Fundamental Rights of the European Union. 2012/C 326/02 http://eur-lex.europa.eu/legal- content/EN/TXT/?uri=CELEX:12012P/TXT [3] Convention for the Protection of Individuals with regard to Automatic Processing of Personal Data http://conventions.coe.int/Treaty/en/Treaties/html/108.htm [4] Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data http://eur-lex.europa.eu/legal- content/en/ALL/?uri=CELEX:31995L0046 [5] Regulation (EC) No 45/2001 of the European Parliament and the Council of 18 December2000 on the protection of individuals with regard to the processing of personal data by the Community institutions and bodies and on the free movement of such data. http://eur- lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2001:00 8:0001:0022:en:PDF

Page 77 Reducing complexity of hybrid data at ingest Tarvo Kärberg Project Manager J. Liivi 4, Tartu 50409, Estonia +372 738 7585 [email protected]

ABSTRACT The appraisal of records, acquisition and preservation of archival 1. INTRODUCTION records is regulated by the Archives Act in Estonia. It states that Data providers in Estonia have an obligation to prepare (classify, the transferor shall bear the expenses of transfer of archival describe etc.) and transport the archival records to the archive records to the National Archives of Estonia (NAE), including according to the requirements and guidelines set by the archives. expenses incurred during re-arrangement, descriptive work and Previous experiences have shown that agencies need help to transport of archival records according to the requirements. In overcome the difficulties that the archiving of records may bring practice, these expenses can vary very significantly as agencies along. and persons performing their public duties have produced archival • One of the first obstacles that the agencies encounter is records in several ways which means that acquisition may require building a classification schema. It may be very time- quite a lot of (manual) work. As it may get too complicated and consuming to perform it manually. They agencies would expensive to prepare and transfer data to the archives the like to do it in a more automated way. producers haven’t been too keen to get used to digital • The same goes for the descriptions – some descriptions preservation. are available in EDRM systems and it would be To overcome the complexity of preparing and transferring the reasonable to reuse them when describing the archival archival records to the archives the NAE has developed a pre- items. ingest tool – the Universal Archiving Module (UAM). It tries to • Another obstacle may be collecting the content from fix the gap in the current situation by giving producers a set of several sources. The digital content is usually spread functionalities which are gathered into one place, are relatively between several locations and bringing them into one easy to use, can be partially automated and are approved by the place for archiving can be difficult. archives. Those functionalities include building a classification • One of the obstacles may be finding an easy way to schema (both manually and automatically from XML files), validate your work, check whether you have done adding descriptive metadata (both for paper based and digital everything right when preparing the data for archiving. records, manually or automatically), describing digital content • The final obstacle may be the transfer. Agencies want to (automatically identifying and characterizing the computer files) send their work to the national archives, but they want a and validating (automated control against rules and requirements smooth, controlled and secure solution for doing that. set by several laws and guidelines) the transfer. To help agencies with those tasks the NAE provides them a The presentation will introduce some practical examples of how special tool – the Universal Archiving Module (UAM) that can the UAM can help solving complex situations which have help agencies to deal with all previously mentioned obstacles. occurred in recent practice. The focus is on hybrid data as this is one of the crucial issues on what archives need to tackle. The 2. CLASSIFYING AND DESCRIBING THE producers are used to maintain the paper based and digital archival records in a different way, but when it comes to search DATA and access to descriptive information the users would like to get Agencies have often some descriptions available in digital form. everything from one place. Therefore it is very important to Descriptions may be in EDRM or other information systems or behave proactively at the (pre)-ingest stage and pay very close even in some Excel files. If an agency would like to reuse these attention on how the ingest process is being conducted. descriptions when preparing their records for archiving then it’s should export them somehow from the source system. For example, producing a XML file from Excel is quite Categories and Subject Descriptors straightforward when using Save-As dialogue. The national H.3.7 [Digital Libraries] archive doesn’t declare any specific requirements for export. Only one rule should be followed – the export should be in XML General Terms format. As the XML format can take various shapes then every Experimentation, Standardization. system will need a mapping between its metadata elements and UAM input in order to use the UAM import functionality. Keywords Using UAM makes the preparation process smoother as it Digital preservation, knowledge, UAM, ingest, archive. provides functionality for creating a classification schema and adding descriptive metadata (both for paper based and digital records) manually or automatically. Metadata fields and hierarchy used in UAM are based on ISAD(G), ISAD(CPF) standards and

Page 78 are therefore commonly suitable for any record’s type. More checked during the saving of the views in UAM. The second level specific metadata elements can be used for lower hierarchy levels is „forced validation”, an additional validation (checking the (item, computer files). The elements set can be easily extended as correctness of the classification three etc.) that is done when each level contains type XML tag. “Validation” button is pressed. The final validation will be When descriptions and items contain identifiers then all of them automatically performed right before the transfer. automatically find their right place in the archival classification Validation levels duplicate some rules, but they are mainly hierarchy. complementing each other. For example, some metadata is not 3. HARVESTING THE CONTENT mandatory when the archivist starts the describing process and the Agencies have often some valuable information encapsulated in absence of it will be not treated as a problem until the archival computer files and those computer files have not been put into the classification schema has been marked as approved by the EDRM systems, but are saved on some network drives. national archives. These computer files are usually organised in some way for better There are also some rules which only indicate a mistake and are finding purposes, but they are not automatically related to the treated as warnings. It means that they are not compromising the classification schema used in an agency. Therefore it is important transfer, but will be highlighted in the delivery agreement later. to create somehow the relations between computer files and appropriate records or files when preparing the data for archiving. 5. TRANSFERRING THE DATA The transfer process is very important for agencies as they give This can be done in two ways in UAM. One option is that the their work to the archives for validation and ingest. They want to archivist selects computer file(s) or some catalogue in the perform this step preferably in an automated way. operation system and UAM imports the computer file(s) to the indicated place in UAM and automatically characterizes and There are two ways to deliver information packages to the migrates them if needed. This is useful when there is a relatively archives using UAM. First, the classification schema, the small amount of computer files to import as it requires quite a lot descriptions and the computer files can be sent to the archives of manual intervention. over the secured channel called ‘X-Road’ (a secured layer between e-services and databases in Estonia). When sending The second way is to use XML files as the list of available information packages over X-Road the packages will be computer files for import. The XML file can be created from the automatically split into smaller pieces for the most efficient console of the operation system by printing out the listing of the performance purposes. The system checks that no pieces go lost contents of a directory. This is extremely useful when the during the transfer and informs the user about the result of the computer files are organised by series or functions of the agency process. on the network drives. The second option is to save the information packages to some It is important to note that all actions are logged and it is possible data carrier and bring them to the archives. This is recommended to check whether a computer file is the result of a migration or if it only to those agencies that do not have a connection to the X- is the original file. As no computer files are being deleted during Road channel. the migration process, it is possible to repeat the migration (into some other file format) later if needed.

4. CONTROLLING THE QUALITY 6. ACKNOWLEDGMENTS Archivists can make mistakes during the preparation process or The author gratefully thanks the Digital Archives at the National archival descriptions or classification schema can contain some Archives of Estonia, especially the Director of the Digital discords. Therefore, it is natural that the work should have some Archives, Mr Lauri Leht. quality control. UAM provides a three-level validation procedure to the archivist. First, a manual input validation which means that when the data is inputted to some metadata field manually it will be automatically validated. Existence of mandatory values is

Page 79 Interdisciplinary Approach for Hybrid Records Management in Belgian Federal Administrations: The HECTOR Research Project Prof. Dr Marie Demoulin Sébastien Soyez Université de Montréal (Canada) State Archives in [email protected] and University of Namur (Belgium) [email protected] Prof. Dr Seth Van Hooland Prof. Dr Cécile de Terwangne Université Libre de Bruxelles (Belgium) University of Namur (Belgium) [email protected] [email protected]

ABSTRACT policies for such heterogeneous records results in a lot of This paper introduces the context, objectives and methodology of confusion, redundant or lost information, waste of valuable the HECTOR research project (Hybrid Electronic Curation, resources, and potential legal conflicts which hamper the Transformation and Organization of Records), an interdisciplinary efficiency of public services. project combining law and information sciences whose main Most studies conducted in the field of records management purpose is to develop a model for the transformation, organization concentrate either on the paper-based office or on its digital and curation of hybrid records (i.e., paper-based, digitized and counterpart, but very few focus on the management of hybrid digital born) in Belgian federal administrations, in order to records in an era of digital transition. Even if great efforts have facilitate the transition to a trustful, secure and effective electronic been made in terms of standardization, records managers have to government. deal with multiple standards and best practices that seem difficult to apply to a mix of paper-based and electronic records. Similarly, Categories and Subject Descriptors from a legal perspective, the first regulatory measures taken by E.1 [Data]: Data structure – Distributed data structures, Graph lawmakers have been to set up a framework for digital signatures & networks, Records; E.5 [Data]: Files – Organization/Structure, and to recognize the use of electronic communications for Sorting/Searching; H.3.1 [Information storage and retrieval]: commercial and administrative purposes, leaving hybrid Content Analysis and Indexing – Indexing methods; K.5.2 [Legal documents in a grey area where their legal value and authenticity aspects of computing]: Governmental Issues – Regulation; K.6.1 remain a challenge. Such a situation conveys important risks for [Management of computing and information systems]: Project federal administrations, notably in terms of confidence in public and People Management – Life cycle. services, potential legal conflicts, access to crucial information, and the waste of valuable resources. It is therefore of strategic importance to rationalize and organize residual paper-based and General Terms semi-electronic records management. Management, Documentation, Performance, Reliability, Security, Human Factors, Standardization, Theory, Legal Aspects. The goal of the research project HECTOR (Hybrid Electronic Curation, Transformation and Organization of Records) is to offer clear guidelines on how to streamline hybrid records management Keywords practices within the Belgian federal administration through a Hybrid Records Management, Appraisal & Disposal, Metadata transverse and systematic approach. The expected result is a more schemes, Authenticity, Digital Evidence. coherent and effective hybrid documents management strategy that will improve access to public information; enhance trust, 1. INTRODUCTION transparency and security; minimize the use of paper from a Despite major technological advances and standardization efforts sustainable development perspective; and finally, adapt, if needed, within the field of electronic records management, the reality of the current conditions for long-term preservation of the public and private administrations provides countless examples informational heritage of federal public services. which debunk the myth of the paperless office. Every kind of In order to provide an interdisciplinary approach, the HECTOR organization is currently confronted with a hybrid environment of project’s research team gathers experts from the fields of legal records: paper-based, digitized, and digital-born, some of which sciences (Research Centre Information, Law and Society of the are (re)printed for various reasons. The lack of clear management University of Namur), information sciences (Information and Communication Science department of the Université Libre de Bruxelles), and archival sciences (Digital Preservation & Access Division of the State Archives Belgium). The team is completed by cross-domain expertise in both archival sciences and the law from the University of Montreal (École de Bibliothéconomie et des Sciences de l'Information).

Page 80 2. SCIENTIFIC CONTEXT their existing metadata by linking them to external data sources. It Throughout the 1980s and 1990s, large IT players implemented is precisely in this context that the concepts of Linked and Open the idea behind the records continuum1 within Electronic Data (LOD) have gained momentum. Recent initiatives such as Document and Records Management Software (EDRMS). These OpenGLAM and LOD-LAM illustrate how these evolutions are software packages were heralded throughout more than two being implemented in libraries and archives. decades as a cost-effective solution for managing the growing Linked data principles potentially hold a tremendous value for the production of electronic documents within large organizations. archive and records management community. HECTOR The reality of vendor lock-in (i.e. the enormous costs to the specifically wants to focus on how linked data principles can be customer of switching to a different EDRMS), as well as the used to more efficiently produce, manage, and distribute access to rapidly-changing technological landscape, made the EDRMS functional thesauri, business classifications, and filing plans. All implementation and evolution process a lengthy and painful too often, these tools are managed and distributed in formats such experience for most organizations. as PDF, Word, or Excel, making them impossible to manage Out of this situation arose the necessity to develop functional through an automated process. With the help of concrete case requirements, concentrating on what the functionalities of a studies, HECTOR will investigate the possibilities of using SKOS records management system should be, rather than on how they (Simple Knowledge Organization System) to facilitate the should be implemented practically. Two important contributions machine interoperability of business classifications used for in this direction were Designing and Implementing record management purposes. This approach will not only allow RecordKeeping Systems (DIRKS) on the one hand, developed by for a more efficient distribution of access to classifications from a the National Archives of Australia and officially published as a centralized authority (Belgian State Archives) for administrative standard in 1996, and the Model Requirements for the bodies, but will also serve as a basis for automated content Management of Electronic Records (MoReq) on the other hand, extraction and aggregation, with the help of, for example, Named- originally published in 2001 by the European Commission, Entity Recognition and text mining tools. superseded by a second version in 2008 (MoReq2), and revised in Beyond the evolution of the medium —from paper to digital— it 2012 (MoReq2010). appears that information sciences and archival sciences have However, the monolithic and centralized approach of the EDRMS evolved in separate worlds, at least in Belgium, notably with model proved very hard to implement in a context where regard to the normative and legal environments that treat them documents and records are increasingly scattered amongst separately. This trend is gradually changing: from the records applications and media. Every large administration is currently managers’ side, the latest version of MoReq shows a relative confronted with paper-based originals that need to be conserved integration of archival principles, and from the archivists’ side, a for legal reasons, digitized versions of paper documents, and professional standard developed by the International Council on digital-born documents. James Lappin acknowledges the failure of Archives (ICA-Req) has become an ISO standard in the records the top-down, centralized EDRMS model compared to more management field. However, the management of hybrid collaborative approaches and proposes to replace it with a new documents, although not excluded by MoReq2010, needs to be records repository model based on the concept of a central clarified in the context of classification, description (metadata), repository that is responsible for a unique and secured storage of appraisal, and disposal. The challenge of the disposal content, which can be used by other applications through web requirements within MoReq is indubitably connected to the issue services and which can connect with a filing plan and retention of the appraisal process. There is a need to define clear criteria for rules2. The functionalities of a modular approach based on the the retention of records, including the question of their legal repository model are illustrated in MoReq 2010, which has been value. thoroughly reworked due to the failure of the EDRMS approach. The HECTOR project wants to demonstrate how we can 3. METHODOLOGY From a horizontal point of view, the research will be conducted rationalize the creation and management of filing plans and with an interdisciplinary approach, closely combining information retention schedules through the use of linked data principles. The sciences and law. Within information sciences, interdisciplinarity combination of increasing budget cuts and growing electronic will also be implemented through the integration of archival collections is currently forcing information providers (archives, sciences at an earlier stage in the elaboration of hybrid records libraries, public administrations) to rethink the ways in which they management strategies, instead of confining archivists to a provide access to their resources. The traditional model of depository and preservation role. This “integrated archival” intellectually (and therefore manually) indexing documents has approach is highly encouraged by Rousseau and Couture3 in order already been under pressure for a number of years. Both funding to anticipate the long-term preservation of records, for instance in bodies and grant providers expect short-term results and appraisal/disposal process modelling, during which the primary encourage cultural heritage institutions to gain more value out of and secondary value of records should be jointly taken into account. 1 Defined as “a consistent and coherent regime of management From a vertical point of view, the project adopts a bottom-up processes from the time of the creation of records (and before approach in order to take into consideration the particularities of creation, in the design of recordkeeping systems) through to the hybrid documents management and to meet the concerns of preservation and use of records as archives” (Australian norm 4390 on records management - 1996). 3 Rousseau, J.-Y. and Couture, C., Les fondements de la 2 Lappin, J. 2010. “What will be the next records management discipline archivistique, Presses de l’Université du Québec, orthodoxy?” RMJ 20(3): 253. Montreal, p. 50.

Page 81 federal administrations. As a starting point, HECTOR is based on determination of the respective retention durations of the an exploratory analysis of a selection of relevant and generalizable original paper and its digital copy (disposal schedules based case studies within Belgian federal administrations representing on the administrative, legal, organizational, and patrimonial users of hybrid documents. value of the document); The analysis of simple hybrid documents management will be - Classification and description of the digitized documents/files based on digital and paper statements of offence at the Federal (metadata schemes and filing plans), including elements of Police, the Local Police, the Federal Ministry of Employment and authenticity, traceability, retention, and accessibility; at the Courts, where different projects of digitization of paper - Policy on access to digitized information, with regard to the reports are combined with digital-born records projects. Then, the balance to be found between transparency (access to public analysis will be enlarged to include complex, hybrid files, such as sector information) and privacy/confidentiality; inspection files at the Federal Agency for the Safety of the Food Chain (FASFC), the Federal Agency for Nuclear Control (FANC) - Policy for the joint management of paper and digital and the State Archives in Belgium, and human resources files at documents/files (including a hypothesis on where such a joint the Federal Ministry of Finances in relation to the current e-HR management is necessary). project handled by Federal Ministry of ICT. This research is based on a win-to-win cycle, where research 4. EXPECTED RESULTS results will feed back into the existing theoretical background and To enlarge the scope of the results, the research team will directly benefit the case study administrations involved in the elaborate two transverse models, one for hybrid documents project. Gradually, the scope of the study will integrate more management and the other for hybrid files management, which potential users from other administrations. could be applied in other administrations and services. The models will more specifically focus on: 5. REFERENCES [1] Lappin, J. 2010. “What will be the next records management - Documents and files digitization processes that preserve the orthodoxy?” RMJ 20(3): 252–264. authenticity of the document (traceability and integrity) and ensure its effective use (quality, accessibility and content [2] Rousseau, J.-Y. and Couture, C. 1994. Les fondements de la exploitation); discipline archivistique. Presses de l’Université du Québec. Montreal. - Appraisal and disposal policy after digitization (including destruction policy), with criteria for the need for retention or destruction of the paper original and its digital copy; the

Page 82 Research projects as a driving force for open source development and a fast route to market

RODA, SCAPE and E-ARK - a case study

Hélder Silva, Miguel Ferreira, Luís Faria KEEP SOLUTIONS, LDA R. Rosalvo de Almeida, no5 Braga, Portugal {hsilva, mferreira, lfaria}@keep.pt

ABSTRACT 1. INTRODUCTION Research projects, specially in the computer science domain, RODA1 is an open source digital repository specially de- have consistently provided outputs as open source products signed for archives, with long-term preservation and authen- or updates to long-standing open source projects. This oc- ticity as its primary objectives. Created in 2006 on a 2 year curs due to the shared openness nature of both research and project lead by the Portuguese National Archives in partner- open source, that enables re-use by the community spanning ship with the University of Minho, it would later lead to the new developments in both research and open source prod- creation of the KEEP SOLUTIONS company2 which up to ucts. But when an open source project serves a community now continues to develop RODA, foster its open source com- and a real-world problem, the impetuousity of research can munity, and provide commercial services for maintenance, clash with the inertia of real-world application. Neverthe- support and on-demand feature development. less, research projects can bring the much needed innovation to open source projects, and open source projects can bring SCAPE3 was a co-funded project by the European Com- the much needed route to market that research funders look mission under the Seventh Framework Programme. It ran for the outputs of the research they fund, ensuring the bud- from 2011 up to 2014 and aimed to develop scalable ser- get spent in research actually reaches the community and vices for planning and execution of institutional preservation improves the world. strategies on an open source platform that orchestrates semi- automated workflows for large-scale, heterogeneous collec- This paper presents an analysis of this dynamic with a case tions of complex digital objects. study about RODA, an open source repository for digital preservation, used in memory institutions such as archives, E-ARK4 is an ongoing project co-funded by the European and two research projects, SCAPE, focused on digital preser- Commission under the Competitiveness and Innovation Frame- vation scalable services, and E-ARK, focused on standard- work Programme. It will run from 2014 to 2017 and it aims ization of information packages, integration with real-world to develop a pan-European methodology for electronic doc- applications, and database preservation. ument archiving, synthesising existing national and interna- tional best practices, that will keep records and databases The paper further tries to identify good practices for using authentic and usable over time. existing open source projects in research and assure that research outputs are further carried into main versions of This paper provides an overview of how research projects, open source projects and find their way to the final user. namely SCAPE and E-ARK, were included in the RODA project, how their results were incorporated in the main fea- Categories and Subject Descriptors tures, how the roadmap is aligned with future developments, H.3.7 [Information Systems]: Information Storage and and how the output of research reaches the end users. Retrieval—Digital Libraries 2. RODA Keywords RODA is a complete digital repository system that pro- Preservation, Repository, Research, Open Source, Integra- vides functionality for all of the main units that compose the tion OAIS reference model. RODA fully implements an Ingest workflow that validates SIPs and migrates digital objects to preservation friendly formats, and provides Access by deliv- ering different ways to search and navigate over available data as well as visualising and downloading stored digital material. Data Management functionalities allow archivists 1http://www.roda-community.org 2http://www.keep.pt 3http://www.scape-project.eu 4http://eark-project.com

Page 83 file corruption detection using checksums or as complex as file format migration and quality assurance. From a techni- cal point of view, RODA and its Preservation actions (set of plug-ins which can be manually executed or scheduled for later execution) make it easy to perform either of these preservation tasks. But from a management point of view, a well founded decision for selecting optimal preservation task to ensure the continuous access to the information available at the repository must be made.

Preservation Planning is defined as the task responsible for monitoring the environment of the OAIS and which provides recommendations and preservation plans to ensure that the information stored in the OAIS remains accessible to, and understandable by, and sufficiently usable by, the Designated Community over the Long Term, even if the original com- puting environment becomes obsolete [1]. This process can be done by the repository manager, periodically, in a manual or semi-automated fashion. But as the information in the repository grows and becomes more diverse, manually mon- itoring all risks that might afflict file formats and plan the proper action to perform can become infeasible. Automa- tion of some of the steps in preservation planning becomes, Figure 1: RODA SIP structure. therefore, crucial to maintain a trustworthy repository and the authenticity of the curated digital objects. to create and modify descriptive metadata and define rules These are just some of the constraints observed on the RODA for preservation actions, e.g. scheduling integrity checks on implementation before the research projects and that were stored digital objects or initiate a migration process. Ad- an object of research on the same projects. On the next ministration procedures allow the definition of access rights sections the research project results will be presented and to data and operational permissions for each user or group. the process of integration of these research results into the open source project will be described. Before RODA is able to ingest Information Packages, which in OAIS are called Submission Information Packages (SIP), a formal or informal agreement between the Producer and 3. RESEARCH INITIATIVES & RESULTS the Repository must be made in order to specify the contents In the SCAPE project, RODA was used as a reference im- of the Information Packages (e.g. specifying required sets of plementation by integrating with Scout5, the preservation information and in which standards they should be encoded) monitoring tool, Plato6, the preservation planning tool, and and any timeframe applicable. As RODA has its own SIP Taverna7, a domain independent workflow management sys- format (see Figure 1), anyone who wants to deposit into tem used to run preservation tasks. These integrations allow the repository has to produce RODA compliant SIPs. To RODA to enact a preservation lifecycle, that continuously accomplish that, one can use the desktop tool (RODA-in) monitors the existence of preservation risks on the content, which allows the creation and upload of RODA SIPs, or devises a preservation plan to mitigate them, executes the directly use RODA Web User Interface (RODA-WUI). plan transforming the content, and monitors back again to verify if the problems were solved. Both these approaches have limitations as they don’t scale and they use RODA own niche SIP format. This may be- 3.1 Scout come a problem when massive creation of SIPs is required Scout, a preservation monitoring tool, supports the scal- and existing systems do not produce RODA’s SIP format. A able preservation planning process by implementing an au- way around it is to develop programs that integrate existing tomated service for collecting and analysing information on systems with RODA, by producing SIPs in RODA’s format, the preservation environment [6]. but there are too many systems and not all institutions have the resources to develop their own ad-hoc integrations. If no Scout works by configuring source adapters, that obtain and mandates for having a same way to build and share Infor- normalize information from different sources in order to save mation Packages are in place, being them recommendations that information in a knowledge base. Those sources, as il- or legal impositions, integration and sharing of information lustrated in Figure 2, can be content (e.g. file-system us- between systems or entities of the same country is hard and age), organization policies, format and tool registries, the becomes even harder on a broader international context. Web (e.g. using Natural Language Processing to extract knowledge from websites), and even human knowledge. In After ingest, OAIS recommends that actions for ensuring conjunction with the information collected, Scout allows the that the information available in the repository keeps ac- cessible and understandable to its Consumers must be put 5http://openplanets.github.io/scout in place. These actions are defined and monitored by a 6http://www.ifs.tuwien.ac.at/dp/plato/intro Preservation Planning process, which can be as simple as 7http://www.taverna.org.uk

Page 84 creation of queries, a mechanism to allow reasoning in the in- formation gathered to detect changes. On top of the queries, triggers (a watch condition) can be created to periodically evaluate them. And when that condition is not met, Scout Registries Policies allows, for example, an e-mail notification. Upon notifica- Web tion additional actions like Preservation Planning can be initiated. Human knowledge Scout is currently integrated with RODA, and to perform Content this integration changes had to be made in order to be able to configure Scout to monitor the several aspects of the reposi- tory. RODA already exposes APIs to allow integration with Scout others systems, but not always those APIs expose the infor- mation required for all purposes. In RODA particular case, we wanted to monitor both the content (i.e. files) as well as repository events (e.g. ingest finished). To make this integration possible, the following features were needed: NotiÞcation

8 Report API : OAI-PMH [7] interface that exposes repos- Figure 2: Scout sources diversity. • itory events information like ingest started, ingest fin- ished, etc. FITS plug-in: RODA plug-in responsible for doing 2. Evaluation of potential preservation strategies by ap- • characterization on every file stored in RODA using plying selected tools to a manageable sub-set of ob- FITS tool9. jects that should cover the essential characteristics of the collection being analysed;

The first one is directly integrable with Scout (using source 3. Analysis of the results and decision taking of whether adaptor for repository Report API), whereas the second one any of the strategies should be applied, and if affir- needs an extra tool: C3PO - Clever, Crafty, Content Profil- mative, an executable preservation plan can also be ing of Objects10, which analyses the technical properties of created. large sets of objects based on metadata generated by char- acterisation tools such as FITS and Apache Tika11 and pro- vides aggregated information of those technical properties At the very end of the process of creating a preservation (e.g. file size, MIME type, compression scheme). The FITS plan, the output is the plan in the form of an XML file. To deploy the plan into RODA, a service named Plan Man- plug-in output is fed into C3PO which in its turn is config- 12 ured in Scout to be a source of information (using source agement API was developed, which allows the creation, adapter for C3PO). retrieval, update and deletion of Preservation Plans from the repository. Alongside with CRUD13 operations, it also By formalizing in Scout a set of conditions that must be allows search using SRU [14] as the search protocol and met in order to state that no preservations risks exist, the CQL [11] as the syntax for representing the queries. Also, mandatory responsibility of an OAIS compliant repository for management purposes, this API allows the monitoring of of ”Follow documented policies and procedures which en- Preservation Plans in the repository (i.e. if they are active, sure that the information is preserved against all reasonable if they are being executed in a certain moment in time, if contingencies” [1] is addressed. they executed with success, etc.). As soon as a plan is deployed into RODA, a unique iden- 3.2 Plato tifier is associated to that particular plan. This way, when Plato, a preservation planning tool, implements a well doc- a plan is executed and changes are performed in an intel- umented and validated preservation planning methodology lectual entity, RODA is able to relate each other. It does and integrates registries and services for preservation action that by creating a PREMIS [9] event (per intellectual entity) and characterisation [2, 10]. that connects, for preservation purposes, a plan and the rep- resentation files of that intellectual entity. This way, when Plato provides a Web interface that allows preservation plans browsing the preservation timeline of a particular intellec- to be built interactively, guiding the planner through a well tual entity in RODA, if any preservation event was created defined decision making process: due to preservation actions executed on the context of a preservation plan, this plan can be immediately consulted, 1. Definition of high-level requirements and break down describing why the action was executed and detailing all the to measurable criteria thus creating an objective tree; decision-making process for selecting the exact action that was executed, including the tested alternatives, used sam- 8 https://github.com/openplanets/scape-apis ples, experiment results, final decision and execution details. 9http://projects.iq.harvard.edu/fits 10https://github.com/peshkira/c3po 12https://github.com/openplanets/scape-apis 11http://tika.apache.org 13CRUD stands for Create, Retrieve, Update and Delete

Page 85 Having RODA already the capability of performing preser- Having mechanism to manipulate data in the repository as vation tasks (through Preservation actions), adding an ex- well as to manage preservation plans is not enough as a ternal tool with given proofs on building preservation plans mechanism is needed to process the preservation plan and that formally describe requirements, analyses alternative so- run it in Taverna Suite. For this, a tool called Plan Manage- lutions to mitigate preservation risks and allows well-founded ment Webapp was created. Besides managing preservation decision, allows RODA to support even more of the dig- plans available at the repository (creation, edition and dele- ital preservation processes defined on the OAIS and ISO tion), it allows to execute a preservation plan. 16363 [5] as mandatory for digital preservation and reposi- tory trustworthiness. When executing a preservation plan, the Plan Management Webapp retrieves the entire plan from the repository (as ini- 3.3 Taverna tially only metadata is retrieved for listing purposes), sets Taverna is a domain independent workflow management sys- the execution status to ”being executed”, identifies the ob- tem, i.e. a suite of tools used to design and execute scien- jects that must be changed and retrieves them from the tific workflows [16]. It includes Taverna Engine (used for repository. Then, it isolates the executable plan (i.e. Tav- enacting workflows) that powers both Taverna Workbench erna workflow), executes it in Taverna Suite providing ap- (the desktop client application) and Taverna Server (which propriated objects as input and collects the results. Then, executes remote workflows). Taverna is also available as a if everything goes as expected and if those results need to Command Line Tool for faster execution of workflows from be sent back to the repository, it does so by using the Data a terminal without the overhead of a GUI. Connector API. Also, to wrap up, it sets that plan execution status to either ”success” or ”failure”. Using Taverna Workbench, one can interactively create work- flows. A workflow is defined as“the automation of a business On the one hand, having two new APIs makes it easier to process, in whole or part, during which documents, informa- integrate RODA with third-party tools/systems. On the tion or tasks are passed from one participant to another for other hand, having the possibility of running preservation action, according to a set of procedural rules”14. In prac- tasks with Taverna workflows (which are tightly connected tice, and in Taverna’s particular case, it can be seen as the to a Preservation Plan) increases compliance to ISO 16363 way to describe, manage, and share complex scientific analy- as it better fulfils the Preservation Planning requirements. ses. This can be achieved by combining several components called Services, either sequentially or in parallel, that can be A report on the compliance of the system presented above, of several types such as: named SCAPE Preservation and Watch Suite or SCAPE Preservation Environment, that brings together RODA, Scout, Plato and Taverna, with ISO 16363, assessed solely from a Web services (local or remote in either REST or WSDL • software technology perspective (ignoring therefore organi- format); zational, financial or physical infrastructure requirements), Local scripts (Bash scripts, R scripts); show that 69 of the requirements are fully supported, 2 are • partially supported, 6 are not supported, and 31 are out Beanshell (Java code snippets); • of scope (ignored) [4]. Almost all of the requirements are Local services (pre-defined Beanshells for specific tasks supported solely by this software suite, and the rest can be • such as file/XML/text manipulation, database connec- supported by manual procedures, which is a vast improve- tivity through JDBC, etc.); ment from previous versions. Sub-workflows. • 4. FUTURE RESEARCH RODA is used on a pilot of the E-ARK project, which will After finishing the workflow design, it can be run immedi- develop a total of six different pilots. This full scale pilot ately in Taverna Workbench as well as in Taverna Server or will be conducted jointly with the Portuguese Agency for Taverna Command-line. Public Services Reform (AMA)15 and the Instituto Superior T´ecnico16 as RODA will be the long-term archival solution As RODA originally does not provide functionalities for man- and these two entities the data providers. aging and running Preservation Plans (available in the repos- itory), which in this case contain Taverna workflows, changes One of the goals is to support a pan-European SIP format had to be made. This way, two new APIs were created: which will make it easier to create Information Packages to be transferred and ingested into archives in a way that is Data Connector - REST API for manipulating intel- efficient, reliable and applicable across all European coun- • lectual entities and associated representations in the tries. Another goal is to enhance RODA ingest process, to repository; be more flexible and customizable, thus making it easier to integrate with third-party systems without the need of hu- Plan Management - REST API to retrieve available man intervention, making the system more scalable. • preservation plans from the repository, to manage their state (enable/disable) and the status of a particular The pilot will demonstrate that the pan-European SIP struc- execution (being executed, execution successful or ex- ture designed in E-ARK is adequate to support the con- ecution failed). 14Quoted from http://www.taverna.org.uk/introduction/why- 15http://www.ama.pt use-workflows 16http://tecnico.ulisboa.pt

Page 86 tent types currently supported by RODA (i.e. relational information restrictions or flexibility that can have an im- databases, text documents, video, audio and images) and, pact on how ingest is done and what information can, or provide a framework for automatic SIP creation by Docu- must, be kept on the system, introducing changes on the ment Management Systems. data model. This is an accepted risk on the capability to merge these changes into the main source code, bringing the This project will also focus on access to the content, spe- results of the project to the users. The risk is mitigated cially complex content such as relational databases, finding by aligning the objectives of the research project with the scalable methods to provide access to archived databases roadmap of the open source project itself, assuring that the and also providing methods to allow an advanced analysis deep changes are profitable for the whole community, and and reuse of database content by using, for example, data testing the changes with real world cases, keeping in close warehousing and OLAP technologies [3]. contact with the target community. The latter is given by the nature of the E-ARK project, that is funded by a com- petitiveness and innovation framework programme, which 5. ROUTE TO MARKET shapes the project objectives not in ”blue sky” research, but As any open source project, RODA has its source code freely in the creation of a favourable ecosystem and market growth. available. Also, as it is a good practice in software develop- This is materialized in the E-ARK project by the focus on ment, RODA source code is versioned. RODA uses Git [15] pilots, that drive and test the developments on real world as a versioning system, and publishes its source code on its 17 cases, integrating the systems with reference institutions in main repository in GitHub. the European context, ensuring the alignment with the com- munity and testing in real-world scenarios. When a research project starts, a fork [8] of the main source code repository is created enabling a separated trend of de- velopments to be carried out. Then, in the end of a partic- 6. CONCLUSIONS ular project, all developments made are analyzed in order RODA is a complete digital repository that delivers func- to decide which ones should be integrated with the next of- tionality for all the main units of the OAIS reference model. ficial version. This analysis needs to be done as not all of Even so, and as any software that wants to be successful, it the developments may be widely applicable and therefore needs to be open for further improvements. Those improve- may not be suitable for a broader audience or might have a ments can be triggered by its own community needs as well severe impact on other features previously developed by the as for changes in the Digital Preservation community, which community. need to be continuously monitored in order to keep RODA up-to-date with the best practices of this field. Being RODA All development trends, main and alternative ones created based on open source technologies and well established stan- for the research projects, are published in GitHub and avail- dards such as METS [13], EAD [12] and PREMIS, makes it able for the community to try and develop upon them. But easier to improve. This is shown by the improvements made only the main version is continuously maintained by the core in the SCAPE project as well as the improvements that will developers and used as a base for new research projects. be made in the E-ARK project. Also, another great advan- This ensures that the development work is focused and the tage of being open source is the fact that these improvements project doesn’t become too fragmented to be maintained. are freely and immediately available.

On the research initiatives and results presented in section 3, But some planning and design is needed to ease the effort that mainly relate to the SCAPE project, Scout, Plato and needed to merge and publish the outputs of research on open Taverna are services external to RODA that integrate with it source products. Research outputs are many times not pro- using 3 APIs and one plug-in. These APIs were developed to duction ready, are domain specific, and can have an impact be repository system independent, and some of them are im- on the platform that break existing functionality. Easy ex- plemented for other repository systems18. The APIs them- tensibility of the open source application, using e.g. plug-ins, selves were developed to follow standards (like OAI-PMH, is an important characteristic to enable a fast inclusion of PREMIS, Dublin Core, METS), to be flexible (i.e. mini- new research outputs into main versions, especially if the mal mandatory information) and to have the least possible features are domain-specific. Also, the use of APIs for inte- impact on the underlying data models. All of these char- gration, that use of standards, are flexible and are designed acteristics allowed for the APIs to be merged into the main for least impact on the data model, can be of paramount source code and shipped on the next version of RODA. On importance to enable publishing of the features to the end the plug-in, a software logic that was deemed necessary to be user. added to RODA itself, the fact that it was implemented as a plug-in, i.e. a modular and contained software component If impact of the platform is unavoidable, identifying the risk that adds a specific feature, allows it to be easily merged in early stages, accepting it exists, and aligning the roadmap and shipped with the next version. of the open source project with the research objectives is important. In these cases, ensuring the developments fol- The future research presented in section 4, that mainly re- low the community interests and keeping a close connection lates to the E-ARK project, describes future developments with the community is needed to ensure developments fit that have a much deeper impact on RODA data model and the community real world cases. business logic. A change of the SIP format might introduce In software development, open source and even more so in 17https://github.com/keeps/roda research, change is inevitable and even necessary, but plan- 18http://wiki.opf-labs.org/display/SP/Repository+APIs ning, design and communication are very important to keep

Page 87 project in the right path and maintain community adoption. [15] L. Torvalds and J. Hamano. Git: Fast version control system. URL http://git-scm. com, 2010. [16] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, 7. ACKNOWLEDGMENTS D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, This work was partially supported by the E-ARK project. A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame, The E-ARK project is co-funded by the European Commis- F. Bacall, A. Hardisty, A. Nieva de la Hidalga, M. P. sion under CIP-ICT-PSP-2013-7 (Grant Agreement number Balcazar Vargas, S. Sufi, and C. Goble. The taverna 620998). workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. 8. REFERENCES Nucleic Acids Research, 41(W1):W557–W561, 2013. [1] Reference Model for an Open Archival Information System (OAIS). Technical report, Consultative Committee for Space Data Systems (CCSDS), 2002. [2] C. Becker, H. Kulovits, A. Rauber, and H. Hofman. Plato: A service oriented decision support system for preservation planning. In Proceedings of the 8th ACM IEEE Joint Conference on Digital Libraries (JCDL 2008), 2008. [3] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Rec., 26(1):65–74, Mar. 1997. [4] M. Ferreira, L. Faria, M. Hahn, and K. Duretec. Report on compliance validation. Technical Report MS63, SCAPE project, 2014. [5] ISO. Space Data and Information Transfer Systems—Audit and Certification of Trustworthy Digital Repositories. ISO 16363:2012, International Organization for Standardization, Geneva, Switzerland, 2012. [6] M. Kraxner, M. Plangg, K. Duretec, C. Becker, and L. Faria. The SCAPE planning and watch suite: supporting the preservation lifecycle in repositories. In iPRES 2013 - 10th International Conference on Preservation of Digital Objects, 2013. [7] C. Lagoze and H. V. de Sompel. The open archives initiative: Building a low-barrier interoperability framework. Digital Libraries, Joint Conference on, 0:54–62, 2001. [8] J. Loeliger and M. McCullough. Version Control with Git: Powerful tools and techniques for collaborative software development. ” O’Reilly Media, Inc.”, 2012. [9] PREMIS Editorial Committee. Data Dictionary for Preservation Metadata: PREMIS version 2.0. Technical report, Mar. 2008. [10] S. Strodl, C. Becker, R. Neumayer, and A. Rauber. How to choose a digital preservation strategy: Evaluating a preservation planning procedure. In Proceedings of the 7th ACM IEEE Joint Conference on Digital Libraries (JCDL’07), pages 29–38, New York, NY, USA, June 18-23 2007. ACM Press. [11] The Library of Congress. Common Query Language. http://www.loc.gov/standards/sru/cql/ [Online; accessed 21-October-2014]. [12] The Library of Congress. Encoded Archival Description. http://www.loc.gov/ead/ [Online; accessed 21-October-2014]. [13] The Library of Congress. Metadata Encoding & Transmission Standard. http://www.loc.gov/mets/ [Online; accessed 21-October-2014]. [14] The Library of Congress. Search/retrieve via . http://www.loc.gov/standards/sru/ [Online; accessed 21-October-2014].

Page 88 Mastering the fuzzy information in the “cloud era”: Case Open Source Archive Liisa Uosukainen Anssi Jääskeläinen Mikkeli University of Applied Sciences Mikkeli University of Applied Sciences Patteristonkatu 2 Patteristonkatu 2 P.O.Box 181, 50101 Mikkeli P.O.Box 181, 50101 Mikkeli [email protected] [email protected]

ABSTRACT There can be numerous reasons for this behavior including It is inevitable that business and industrial world will soon meet one defective operation, slowness, missing support, usability issues, a of their biggest challenges so far. How to govern a cross platform fuzzy user interface (UI), illogical workflow or just general bad distributed information which physical location is unknown? user experience [1]. Furthermore, if a user that is comfortable with Currently the best practice has been via restrictions and rules but single sign-on and modern web UI:s is forced to use some awkward this method is inoperative with modern users. This paper describes old system, there will be resistance and disobedience. For example, how we at the Mikkeli University of Applied Sciences have started the IT management of our university does not support the usage of to resolve this dilemma. GoogleDrive, but everybody is still using it. Utilization of personal devices or third party services shouldn’t be Categories and Subject Descriptors seen as a troll. Instead it offers many possibilities. Bring your own H.1.2 [Models and principles]: User/Machine Systems– human device mentality can for example reduce data administration costs, factors, human information processing. number of commercial licenses, hardware costs as well as H.3.4 [Information storage and retrieval]: Systems and maintenance costs [6]. Furthermore, users are happy when they can Software–Distributed systems. use devices, services and applications that they are comfortable with. General Terms In spite of the benefits, when something is changed radically the Management, Design, Human Factors resistance to change will be an issue. Therefore the change should Keywords be made with light steps that users have time to adapt. When users feel that they had a possibility to affect, they are more open towards Open source, digital archive, user orientation. changes [4]. This is something that needs to be in mind when the 1. INTRODUCTION information management policies are changed or new The utilized storage technology renews regularly. 5.25, 3.5, HDD, technological tools are introduced to the end users. SSD and optical media. These represent the local storage 3. SUGGESTED SOLUTION, OPEN technologies that are in use or have been used in home environment. Currently, the movement has been towards social SOURCE ARCHIVE media and clouds, so when something is created it won't be stored The problem of divergent information is enormous, modern users locally. Ipads’ can be linked with the DropBox, Android devices prefer clouds and use those in spite of policies while older use GoogleDrive as a backup place and so forth. From the user generations like to keep things as they were. Our solution is to point of view this behavior is very welcome, thus it releases user enhance the existing digital archive system so that it still relies on from doing things manually. Furthermore, the content in a cloud is rules from the older generation but modernizes the utilization and platform independently accessible virtually anytime and anywhere. ideology with open source and novel search methods. While the end users generally love this new way of sharing and Although we speak for open source, we realize that it is not an distributing information, the IT management of a company will not. answer to everything. It cannot for example change the information They have been working hard to keep the hardware and software security policies or established best practices. However, it offers a structure homogeneous. It is only a matter of time before the cloud possibility to show that things can be done differently e.g. with half era will inevitably take possession of the business and industrial the price and higher customer satisfaction. Eventually, these world as well. This has already began and the IT management will observations can be the initiating force that leads to a big policy meet one of their biggest conundrums ever. How to govern a cross changes on the company level. platform distributed unstructured information which physical Open Source Archive (OSA) project, which is a basecamp for the location is unknown? solution is an ERDF project that started in May 2012 and will run until the end of December 2014. The focus of the OSA project is to 2. DISTRIBUTION, A FRIEND OR A FOE identify and develop solutions that provide value and new kind of Even though the IT management is struggling with the introduced archival user experience for the current and future users of digitally problem, from the author's point of view, this is a wrong way to archived data. The principal aim of the OSA-project is to develop start solving this dilemma. The problem has already taken place user oriented archive solution for preserving, managing, and since workers are using services, applications and clouds that are providing access to digital content. The project is carried out by not maintained or supported by the IT management. Therefore the Mikkeli University of Applied Sciences with multiple partners such question that should be asked is: “Why our workers use cloud as archives, software vendors, service providers and educational drives or personal devices to manage their information?” institutes. Users inside the partner organizations have been the source for the end user wishes and needs. Figure 1 presents the

Page 89 simplified structure of the OSA solution and the area of our 3.2 Search and access development focus. The end users access the OSA system via web client. Search and index features visible to end users can be configured to meet their particular needs. For example search form, columns in the result layout and facet fields are all fully configurable. With the utilization of linked objects and faceted search, the ingested data is more accessible and easier to understand and reuse. Finally, OSA uses role based access control to ensure, that data is accessible only to permitted user according to the given rights. 3.3 Architecture The OSA system is based on a service oriented archiving application written in Java. It provides user friendly access via web client to the implemented features. The application is a coupled set of software components used via common API (Application Programming Interface) [4]. As much focus as possible was given on sustainable software development and selected community- driven, open source components. OSA is based on Fedora Commons (Flexible Extensible Digital Object Repository Architecture) repository module. Fedora Commons provides a framework for modelling digital data and to build archiving services. OSA-application utilizes other open Figure 1. Simplified structure of the OSA solution source technologies and tools like MariaDB as a relational database, Apache Solr as a search platform, MongoDB as NoSQL 3.1 Preservation and data analysis store and OpenLDAP for authentication purposes. Results of the The key feature of the OSA archiving system is data lifecycle project will be released as open source at the end of this project. management. We intended to develop a solution that supports the Red Hat Enterprise Virtualization (RHEV) running on blade core processes of archives and other memory institutions. Our focus servers was chosen for building and managing cloud IaaS was to develop a customizable solution to ensure that each (Infrastructure as a Service) environment. RHEV is based on organization, or even an individual user, can configure the archive Kernel-based Virtual Machine (KVM) hypervisor and oVirt open to meet their particular needs. The predecessor of the OSA archive virtualization management platform [7]. Virtualization is utilized to system has for example used successfully in creating a family ensure scalability and capacity for future development and services. archive [5] and OSA development partially relies on this success. Finally, all original files of ingested objects will be stored into tape Metadata creation is another key feature of OSA. This is made drives for long-term storage. flexible and configurable for all defined types of digital objects. Metadata creation is automated as far as possible in order to prevent 4. CONCLUSIONS users from doing irrelevant things or accidental mistakes. The work for OSA system is still under way and this paper Naturally, a possibility to define metadata fields individually for described the current development phase. The most important different digital object types (collections, documents, images, audio aspect of this paper is the highlighted distribution problem and the recordings, moving images, etc.) is also present [3]. During the positive feedback received considering the suggested OSA ingest process, the OSA system will capture the metadata of objects solution. We have utilized the information gained from the users to and runs normalizing procedures. Furthermore, the usage of entities automate the ingest workflow as far as possible. We suggest that (agent, place, event, action) is supported in describing the archival the end users should be brought into the process as soon as possible objects. The entity related metadata can be partly defined by using in spite of the area of development. The end users commonly have available glossaries or ontologies. The utilization of contextual a better understanding of what they want to have than a bunch of entities in describing objects links digital objects to each other and regulators or designers sitting in the ivory tower. In generally therefore makes it easier to search information from the archive. speaking, from the authors’ point of view there are only two ways to manage the inevitable change to 21st century, resist it to the bitter Automation, to some extend can be done in archives. For that end or go with the flow. purpose we used the workflow engine that has been developed as a bachelor thesis in this project [2]. Some processes were modeled as 5. REFERENCES micro services and combined into workflows. A set of micro [1] Cooper, A. The inmates are running the asylum. Sams services were created for pre-ingest workflow and ingest-workflow Publishing, 2004, USA covering virus checking, technical metadata capturing and [2] Kurhinen, H., Lampi, M. 2014. Micro-services based normalizing, checksum checking, and preview generation. There is distributable workflow for digital archives, in Proceedings of a possibility to trigger workflows automatically when files are Archiving 2014, (Berlin, Germany, May 13-16, 2014) uploaded to a certain directory. The workflow engine is expandable; new data processing functionalities are fast to create [3] Lampi, M., Palonen, O. 2013. Open Source for Policy, Costs, and add to organization’s workflows. With the utilization of such a and Sustainability, In Proceedings of Archiving 2013, technology the actual user involvement during the archiving Archiving 2013 (Washington DC, USA, May 13-16, 2013) process can be minimized and thereby also the mistakes done by [4] Lowdermilk, T. User-Centered Design, O’Reilly, CA, 2013. the user.

Page 90 [5] Uotila, P. 2014. Using a professional digital archiving service for the construction of a family archive. In Proceedings of Archiving 2014, (Berlin, Germany, May 13- 16, 2014) [6] Zielinski, D. Bring Your Own Device, HRMagazine, Vol 57, 2, 2012. [7] Chen, K. Red Hat Enterprise Virtualization – White paper, http://www.redhat.com/en/files/resources/en-rhev-idc- whitepaper.pdf

Page 91 Integration of Records Management and Digital Archiving Systems: What Can We Do Today? Pauline Sinclair Alan Gairey Robert Sharpe Preservica Ltd Preservica Ltd Preservica Ltd 26, The Quadrant 26, The Quadrant 26, The Quadrant Abingdon Science Park Abingdon Science Park Abingdon Science Park Abingdon, OX14 3YS, UK Abingdon, OX14 3YS, UK Abingdon, OX14 3YS, UK +44 1235 555511 +44 1235 555511 +44 1235 555511 [email protected] [email protected] [email protected]

ABSTRACT few simple things be done in the short–term to make things Archival organizations can receive digital content from a wide easier? variety of sources. However, the single most common is This paper will look at each impedance point in turn and discuss probably records management systems. The interface between what can be done today and what could be done in a year or two records management and digital archiving systems remains one with limited effort. where there is considerable impedance for a number of reasons, in particular: Categories and Subject Descriptors  Records need to be appraised. Should this be done D.2.12 Interoperability within the records management system or after export?  The files in which records are manifested may become General Terms obsolete before the records have reached the end of Design, Experimentation, Theory. their active lifetime. This implies some form of digital preservation is needed. Which in turn implies a role Keywords for the functionality of digital archiving systems for Digital preservation, records management, integration. active records whether or not these records are of long- term archival merit. 1. INTRODUCTION  There are a number of technical mismatches between Typically archives have many sources of content, but probably systems. In particular: the most common is records management systems. This holds true in both the analogue and digital worlds. In the analogue o There is no single export format for records management systems so it is not clear how to world, the traditional model for integrating records management export them. This includes how to export and archiving is that archiving is something that happens at the logical structures (e.g., hierarchy of records), end of a record’s life. Archiving is one disposal option, physical structures (e.g., arrangements of files alongside destruction and transfer to another records that are needed for files to work coherently in a management system or organisation. given technology) as well as metadata. While the traditional, end-of-life archiving model has the advantage of being simple and easy to implement, it has the o There is no single metadata schema for records management systems. Instead, structural, disadvantage that in the digital world the files in which records descriptive and technical metadata may be are manifested may become obsolete before they are archived to described in arbitrarily complex ways. a digital preservation system. This is only a problem for records with long retention periods, i.e. “long enough to be concerned o There is no universally agreed import format or with the impacts of changing technologies, including support for metadata schema for digital preservation new media and data formats, or with a changing user systems. community” [9].  The need for digital archiving systems to receive input 2. DIGITAL PRESERVATION OF ACTIVE from a number of heterogeneous systems. RECORDS Standardisation, through projects such as e-Ark, may eventually To ensure continued access to digital records with long retention produce solutions by solving all or a number of the above points. periods, it is necessary to integrate digital preservation with However, such programs will last many years and adoption of electronic records management, but how? There are three any outcomes will take many years more. In the meantime, possible models: records will still need to be received and processed by archival organisations. What is the best way of proceeding today? Can a

Page 92  Synchronised archiving, where all records are held in records to be exported, and such a specification is necessary to parallel in a digital preservation system (DPS), as well support interoperability. In addition, neither standard mandates as the electronic records management system (ERMS). the use of a specified metadata schema, which is also necessary to support interoperability.  Syndicated archiving, where all content (records and archived material) is held in the ERMS and the DPS 3.2 Export Metadata Schema just holds preservation metadata and initiates and In 2002 The National Archives (TNA) in the UK published a provides preservation services (characterisation, metadata standard for ERMSs [10], with an XML schema to integrity checking, and preservation planning and follow the final version of the standard. By the time the schema action). was issued, TNA had decided to move away from testing and  Integrated archiving, where the DPS acts as the ‘back certifying ERMSs, so the schema was never adopted widely. end’ for the ERMS, providing record storage in MoReq2 [4] and now MoReq2010 [5] have defined an XML addition to preservation services, while the ERMS schema for exporting metadata from a records management handles capture and access. system, in order to support interoperability. Unfortunately, While all three address the technological obsolescence of neither standard has been adopted widely; although several records, the complexity of the solution decreases from first to software vendors have announced their intentions to become last. In the synchronised archiving model, records management compliant with either MoReq2 or MoReq2010 only one has done functionality (retention periods and legal hold in particular) has so (Fabasoft Folio Governance), while another is using the XML to be added to the DPS, and coordination between the two schema (Automated Intelligence's Compliant SharePoint for systems is required on record capture, security, access, legal hold Office365). and disposal. In the syndicated archiving model, the ERMS has As yet, there is no single metadata schema for records to be modified to include some digital preservation functionality management systems, although the MoReq2010 schema could (i.e. handle multiple copies and representations), and coordinate fulfil this role if its use becomes widespread. deletion with the DPS. Finally, in the integrated archiving model, the records 3.3 Import Format and Metadata Schema management and digital preservation concerns are separated, Just as there is no standard export format and metadata schema thus necessitating fewer modifications to both systems. The for records management systems, there is no universally agreed ERMS captures records and hands them over to the DPS to store import format or metadata schema for digital preservation and preserve; when records need to be deleted, the ERMS systems. OAIS [9] defines the information that a SIP should initiates the process and the DPS implements it. Access to contain, but does not constrain its structure. Hence there are as records is through the ERMS, with the DPS merely supplying an many different SIP implementations as there are digital appropriate representation; the only direct access to the DPS is preservation systems. Examples include BagIt [3] and VERS for preservation planning and action. Standard Electronic Record Format [8], as well as various proprietary SIP formats. Likewise, there are many metadata 3. TECHNICAL MISMATCHES BETWEEN standards, with PREMIS (preservation), METS (structural), SYSTEMS MODS, and Dublin Core (descriptive) being the most common. 3.1 Export Format 3.4 Common Exchange Format There is no standard export format for ERMSs. Such a format The need for a universally agreed exchange format is clear: not should set out how to export logical structures (hierarchies of only would it allow interoperability between electronic records records (e.g. Series, File), known as aggregations, that provide management systems, but also smooth transfers to digital context), physical structures (i.e. arrangements of electronic files archives, which often need to accept deposits from many that are needed for files to work coherently in a given different records management systems. Such a format would technology), and metadata (providing context and technical need to specify not only the metadata schema to be used, but also information). The physical structure of a record, together with the logical and physical structures of records in sufficient detail any associated technical metadata, is required to maintain access for any DPS or ERMS to import any package of records that to the record, while the context is necessary to maintain the conforms to the standard, regardless of which software understandability of the record. application exported it. In addition, a transfer protocol is The International Congress of Archives’ principles and functional required to ensure a smooth and efficient transfer of material. requirements for an ERMS (ICA-Req) [7], which is the basis for While the need for an exchange format is clear in the end-of-life the ISO standard (ISO 16175) for Principles and Functional archiving model, it is also required for the integrated archiving Requirements for Records in Electronic Office Environments, model in order to ensure that records are preserved correctly. just recommends using open formats. However, it does provide The physical structure of the record must be transferred, together two examples: the Australasian Digital Recordkeeping with any associated technical metadata, in order to ensure that Initiative’s Digital Record Export Standard [2] and the technology obsolescence can be addressed correctly, while the UN/CEFACT’s Record Exchange Standard Business context (provided by descriptive metadata and the logical Requirement Specification [11]. Both standards provide a very structure of the record) is required to ensure that the record (an high-level definition of what a Submission Information Package information object) is interpreted correctly. The context could be (SIP) used to transfer records and their associated metadata held in the ERMS for as long as the ERMS continues to provide should contain. However neither specifies the structure of the

Page 93 access to the contents of the ERMS and the DPS, but would need document library, as well as all metadata associated with the to be transferred to the DPS at the end of the ERMS’s life. folders and files within the library. The EC-funded E-ARK project [1] aims to develop a suitable 5.2 Future Capability exchange format and protocol; however, the project has several A likely candidate solution for the integrated archiving solution years yet to run, so this is not an immediate solution. is one built around the Content Management Interoperability 4. APPRAISAL Services (CMIS) protocol [6]. Records need to be appraised for their archival value, but should Many Records Management systems (such as FileNet, HP TRIM, this be done within the records management system or after Alfresco, EMC Documentum, SharePoint) already implement export. Traditionally, a records manager would assess which this protocol to some extent. The Preservica DPS also records need to be disposed of to the appropriate archive when implements part of the CMIS protocol. By extending the CMIS setting the retention schedule for a record or aggregation and implementation in both the ERMS and the DPS, an integrated then on arrival at the archive the records might be appraised a archiving solution can be achieved, where the DPS can be second time to determine whether they met the archive’s considered as a remote repository “hosted” within the ERMS. acquisition policy and so should be accessioned. Content added to the ERMS is then “preserved at creation” by Given the volume of electronic records being created in the calling the relevant CMIS operations to store the content directly modern, digital era, pragmatically it is not possible to appraise in the DPS. individual records or archival accessions at the item level. Several challenges remain for this kind of integrated archiving Therefore appraisal has to occur at a higher level, both in the solution, mainly the fact that the CMIS protocol is built around ERMS and on accession to an archive. This has knock-on the synchronous handling of individual files, while a DPS consequences in that more ephemera, such as personal emails typically deals with the asynchronous processing of SIPs (i.e. mixed in with business email, is getting into archives, which in related groups of files). Nevertheless, the application of suitable turn may affect whether a record series can be opened up constraints to govern this data model mismatch, plus the ubiquity (because it may contain sensitive, personal data). of CMIS within ERMSs, means that this remains the most Unfortunately, there is no clear solution to this problem either suitable candidate solution for integrated archiving. today or on the horizon. 6. REFERENCES 5. EXAMPLE IMPLEMENTATION [1] Aas, K., Bredenberg, K., Delve, J. 2014. Integrating Currently it is only possible to implement the end-of-life Records Systems with Digital Archives: Current Status and archiving model, and an example of this is described in section Way Forward. In Proceedings of International Congress on 5.1. However, this model only ensures the continued Archiving (Girona, Spain, 13-15 October 2014). Retrieved accessibility of records once transferred to the archive at the end 21 October from of their active lifetime; it does not ensure the continued http://www.girona.cat/web/ica2014/ponents/textos/id131.pd accessibility of long-lived records in the ERMS. Therefore we f. are working towards implementing the integrated archiving model, and a possibility for doing so is outlined in section 5.2. [2] Australasian Digital Recordkeeping Initiative Digital Record Export Standard. ADRI-2007-01-v1.0. 31 July 2007. 5.1 Current Capability Retrieved 16 September, 2014 from A current example of the end-of-life archiving model is http://www.adri.gov.au/content/products/digital-record- Preservica’s ability to ingest document libraries exported export-standard.aspx. automatically from a Microsoft SharePoint server. [3] Boyko, A., Kunze, J., Littman, J., Madden, L., and Vargas, Document libraries can be exported from a SharePoint server B. 2014. The BagIt File Packaging Format (V0.97). manually via the Central Administration Console. However, Network Working Group. 28 January 2014. Retrieved 21 SharePoint also allows retention policies to be set on items in a October 2014 from https://tools.ietf.org/html/draft-kunze- library. The policy can be set so that, following the expiry of the bagit-10. retention schedule on an item, an action can be triggered [4] European Communities. 2008. Model Requirements for the automatically. (These retention policy features allow us to Management of Electronic Records. MoReq2 Specification. consider SharePoint as a candidate Records Management ISBN 978-92-79-09772-0. DOI 10.2792/11981. Retrieved system.) 21 October 2014 from http://moreq2.eu/attachments/article/189/MoReq2_typeset_ Preservica supplies a SharePoint solution file that can be version.pdf. installed and deployed via the Central Administration Console. MoReq2 XML schema retrieved 21 October from This solution automates the export of a document library, and http://moreq2.eu/moreq2. can be selected as the action to be triggered when a retention schedule expires. [5] European Commission. 2011. MoReq2010: Modular Requirements for Records Systems. Version 1.1. December Preservica can be configured to watch the network location to 2011. ISBN: 978-92-79-18519-9. DOI: 10.2792/2045. which SharePoint exports document library packages. As soon as Retrieved 21 October 2014 from a package file appears, Preservica can process it for ingest. The http://sysresearch.org/moreq/files/moreq2010_vol1_v1_1_en Preservica ingest workflow preserves the folder structure of the

Page 94 .pdf. [9] The Consultative Committee for Space Data Systems MoReq2010 XML schema retrieved 21 October from (2012). Reference Model for an Open Archival Information http://sysresearch.org/moreq/index.php/specification. System (OAIS). CCSDS 650.0-M-2. Magenta Book. [6] OASIS Committee Specification 01. 12 November 2012. Retrieved 28 August, 2014 from Content Management Interoperability Services (CMIS) http://public.ccsds.org/publications/archive/650x0m2.pdf. Version 1.1. Retrieved 21 October 2014 from [10] The UK National Archives. 2002. Requirements for http://docs.oasis-open.org/cmis/CMIS/v1.1/cs01/CMIS-v1.1- Electronic Records Management Systems. 2: Metadata cs01.pdf. Standard. Retrieved 21 October 2014 from [7] Principles and Functional Requirements in Electronic Office http://www.nationalarchives.gov.uk/documents/metadatafina Environments, 2008. Retrieved 16 September, 2014 from l.pdf. http://www.adri.gov.au/content/products/electronic-office- [11] UN/CEFACT’s Record Exchange Standard Business environments.aspx. Requirements Specification.23 June 2008. Retrieved 16 [8] Public Record Office Victoria. 2003. VERS Standard September, 2014 from Electronic Record Format. PROS 99/007 (Version 2) http://www.adri.gov.au/resources/documents/DRES-BRS- Specification 3. 31 July 2003. Retrieved 21 October 2014 20080623.pdf. from http://prov.vic.gov.au/wp- content/uploads/2012/01/VERS_Spec3.pdf.

Page 95 From retention schedules to functional schemes in the French Ministry of Defence

Hélène Guichard-Spica Anne-Sophie Maure Service historique de la défense Direction de la mémoire, du patrimoine Château de Vincennes, avenue de et des archives Paris 14 rue Saint-Dominique 94306 Vincennes cedex 75 700 Paris SP 0033141934397 0033144421027 helene.guichard- anne- [email protected] [email protected]

ABSTRACT In this paper, we describe our project to design a records 2. BALARD management framework, based on the analysis of the IT The main example of this rationalization is the Balard project. All architecture.. The aim is to help every actor in his field to manage the departments are to be gathered in the same location, while properly and legally his data and records. This project is the today they are spread in a dozen of places in Paris . 9300 people keystone to set up the long term archiving information systems will be working in the same buildings. plan. The stakes of these operations are to rethink all the working processes including the records and archives management. The first step was the drafting of a ministerial instruction, in july 2011, Categories and Subject Descriptors defining the records management policy of the MoD. Then an E.1 [Data structures]; H.3 [Informational storage and archives action plan was launched in 2012 to re-organise the filing retrieval] and collecting processes of all the records, paper and electronic. It also has a “Balard” part to frame all the archives operations. General Terms Such a massive removal implies a lot of archives to appraise and Management, Documentation. transfer in a short time, especially as the archives storage will be drastically reduced (less than 4,3 ml by person). Keywords Appraise these bulks of archives was considered as the first Functional areas. Information governance policy. IT architecture. priority. All the departments were asked to elaborate retention Long-term preservation. Records management framework schedules in order to achieve that goal, avoid uncontrolled Retention schedules destructions and prepare the transfers in the archives repositories. In spite of the advice given and the practical information on the intranet, the retention schedules produced are of quality uneven. They have been made by departments and are not the efficient tool 1. INTRODUCTION expected. They only work for serial records or departments which The French ministry of Defence (MoD) has been going through don’t evolve a lot with a small perimeter of action. This method various mutations for the last 20 years. Its size (in 2011/2012, doesn’t allow a pertinent management of the knowledge of the almost 300 000 employees and a budget over 40 billion euros) producing departments nor an appropriate appraisal of the and its organization (three main entities : the Army’s General records. Moreover, if it had helped some departments to appraise Staff -including Army, Air Force and Navy general staff-, the their bulks of records, it only takes into account the electronic weaponry general department and the Secretary general records and the IT systems in fragmented way. department) make it evolve a challenging task. A new records policy had to be thought and applied to efficiently The main change is the “interarmisation”, inter-army policy manage all the records produced and the IT systems. started in 1962 and strongly led since 2005. The aim of this policy is to share the skills and means of the three armies (Army, Air Force and Navy) in a rational and efficient way. This policy implies drastic staff cutbacks. Between 2008 and 2015, the goal is the decrease of 54 000 positions. Ministerial modernization action plans have been launched in order to challenge this rationalization, reduce the working distinctions and have a greater dematerialization of the business processes.

Page 96 3. FUNCTIONAL RATIONALISATION pilots, the typology of the records or data produced, their retention times. At the same time, all the IT flows are analyzed and compared. The dialogue with the departments implied at each 3.1 IT systems archiving strategy step of the conception is fundamental to the relevance of the The unavoidable state governance led by the IT department within project. the MoD to rationalize its systems implied 4 main actions: This type of functional retention schedules framed by well- a unified and controlled storage, based on the tools defined and validated processes is simple to apply for restricted provided by the new common technical platform; and not too complex areas. But this method can be harder to apply - the decommissioning / the service removal of the when the rationalization is in progress and to environment dealing obsolete or redundant IT systems; with many businesses such as the human resources functional - the setting up a global governance to manage the IT zone. system’s projects and their lifecycle; Nevertheless, this work is necessary to succeed in urbanization - the implementation of the IT urbanization, based on and information management. It allows to track the redundancy functional zones designed by a 2007 instruction about and duplication within a functional zone or between several - the IT architecture. zones. It, of course, also provide the right retention times. This A guide and a manual were written to explain the RM approach in framework gives a global vision of the blueprint of the zone those actions. As a result, the IT department wrote, in 2013, an which allows a better IT systems management from an archives instruction requiring each IT system to have, at each step of its perspective. lifecycle, an archiving approach: a RM expert must be designated at the beginning of the project, the preservation need must be analyzed, a retention schedules and an archiving chart must be written. 4. TOWARDS GLOBAL INFORMATION GOVERNANCE Moreover the trouble is that the French administration is far from having reached the zero paper target. So how can we manage Elaborating the retention schedules for the real estate resources mixed files composed of paper and electronic records? Deal with zone has led to work on the different missions and sub-missions huge amount of paper as well as tracking all the information of all the departments implied in the processes. Soon, we realize within IT systems? that the theory and the definitions described in the overall IT architecture are fairly far from the reality and the real actions. It is In this context, the records manager’s goal is to capture all the necessary to clearly identify the creators of the data (producer information flows, their contact areas, their production or services) and the submission services. validation phases before even considering the records typologies or the specific data. To elaborate an IT retention schedule the The retention schedules are elaborated to fulfill two main whole information flow (paper and/or electronic) must be taken requirements: a vision based on the macro processes and needs of into consideration. very accurate and detailed information on the processes. 3.2 Retention schedules by functional zones Once the information flows are tracked, it is easy to spot the deficiencies either functional or documentary within the zone. After the semi-failure of the “traditional retention schedules” policy, the records managers realized they have to take into In order to identify all the data or information produced, it is account the IT urbanization and its global and functional necessary to audit all the IT systems of the studied zone. Many approach. questions are raised at that point such as: how to manage the data that are not stored in a single IT system? How to deal with paper The methodology then elaborated follows the IT architecture. The records used to complete manually the inputs in the systems? retention schedules are now based on the functional zones defined by the IT department. They have to be cross-disciplinary to avoid The audit then gives a complete and accurate mapping of the redundancy, they take into account both paper and electronic paper records flows and clearly shows the ruptures in the records and clearly identify the documents flow and the pilots of documentary flows. the actions or processes. Functional retention schedules combined with an audit of the IT For each zone, the functional retentions schedules present by systems of the zone considered are very efficient tools for RM and functional quarter and block, all the associated processes, their global information governance. They also allow a greater dialogue not only with the records creators, as it used to exist with “traditional” retention schedules but also with the IT department and all the services implied in the processes. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are Captions should be Times New Roman 9-point bold. They should not made or distributed for profit or commercial advantage and that be numbered (e.g., “Table 1” or “Figure 2”), please note that the copies bear this notice and the full citation on the first page. To copy word for Table and Figure are spelled out. Figure’s captions otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. should be centered beneath the image or picture, and Table Conference’10, Month 1–2, 2010, City, State, Country. captions should be centered above the table body. Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.

Page 97 5. CONCLUSION This work is only at its beginning, as it has just started this year. Nevertheless, it is clear that these functional retention schedules allow us to set a global information management by identifying the data records managers must track, store and preserve. But they also show the records, paper or electronic, that we won’t be able to preserve. They give us the ability to only capture significant information and to preserve in its context with the relevant metadata to understand them in the future. Archiving therefore becomes the acceptance of a scheduled loss of information, formalized in an archiving contract.

Page 98 Information Governance with MoReq Jon Garde Senior Product Manager at RSD The Hollies Breadcroft Lane Maidenhead SL6 3NU United Kingdom [email protected]

ABSTRACT Many people are asking what is the difference between records management and information governance. This presentation takes this question and applies it directly to the MoReq specification. It asks whether MoReq can be used within an information governance framework, rather than as simply a specification underlying an electronic records management platform? The presentation looks at the place of standards and specifications, such as MoReq, in information governance both now and in the future. It also reviews how MoReq’s core services and modular structure can be used not just as a part of, but rather as the foundation of, a corporate information governance framework. Keywords Information Governance, MoReq, Compliance, Records Management, Standard, Specification, Framework, Electronic Records

(Final text not received in time)

Page 99 A Maturity Model for Information Governance Diogo Proença Ricardo Vieira José Borbinha IST / INESC-ID IST / INESC-ID IST / INESC-ID Lisbon, Portugal Lisbon, Portugal Lisbon, Portugal [email protected] [email protected] [email protected] oa.pt

ABSTRACT can do in order to reach higher maturity levels, showing the Information Governance (IG) as defined by Gartner is the outcomes that will result from that effort which enables “specification of decision rights and an accountability framework to stakeholders to decide if the outcomes justify the effort needed to encourage desirable behavior in the valuation, creation, storage, go to higher levels and results in a better business and budget use, archival and deletion of information. Includes the processes, planning. roles, standards and metrics that ensure the effective and efficient The remaining of this paper is structured as follows: Section 2 use of information in enabling an organization to achieve its goals”. presents the related work that can influence the development of the Organizations that wish to comply with IG best practices, can seek maturity model, Section 3 presents the development strategy for the support on the existing best practices, standards and other relevant maturity model as well as a first example of the maturity model references not only in the core domain but also in relevant based on the ISO16363 and based on the levels from SEI CMMI peripheral domains. Thus, despite the existence of these references, [3]. Section 4 presents the conclusions of this paper. The maturity model presented here is being developed in the context of the E- organizations still are unable, in many scenarios, to determine in a 1 straightforward manner two fundamental business-related concerns: ARK project. (1) to which extent do their current processes comply with such standards; and, if not, (2) which goals do they need to achieve in 2. RELATED WORK order to be compliant. 2.1 Maturity Model Development Method In this paper, we present how to create an IG maturity model based There are various examples of maturity models developed for the on existing reference documents. The process is based on existing information management and records management areas, as shown maturity model development methods. These methods allow for a in Section 3.2. systematic approach to maturity model development backed up by a However, many of these maturity models have been developed in well-known and proved scientific research method called Design an ad hoc way, with no regard for detailed documentation of Science Research. We will focus on ISO16363 for the development development, comparison with other models and even without of this maturity model. following a certain process based on best practices from previous maturity model development efforts. Categories and Subject Descriptors K.6.0 [Management of computing and information systems]: One example of such method is presented in [1], which is backed General - Economics by a Design Science Research (DSR) method [4], making it useful both for the industry and the academia. This method is founded in General Terms eight requirements (R1 – R8) [1]: Management, Measurement, Performance, Design. 1. R1 – A Comparison with existing maturity models is presented and clearly argues for the need of a new model

Keywords or the adaptation of an existing one; Information Governance, Maturity Model. 2. R2 – Iterative Procedures are followed to ensure a 1. INTRODUCTION feedback loop and refinement; A maturity model defines a pathway of improvement for 3. R3 – The principles, quality and effectiveness behind the organizational aspects and is classified by a maturity level. The design and development effort of a maturity model should maturity levels often range from zero to five, where zero consists on pass through an iterative Evaluation step; the lack of maturity and five consists of a fully mature and self- 4. R4 – The design and development of maturity models optimizing process. Maturity models can be used for assessing and/or achieving compliance since they allow the measurement of a should follow a Multi-methodological Procedure which maturity level and, by identifying the gap between the current and use must be well founded; pursued level, allow the planning of efforts, priorities and 5. R5 – During the development of a maturity model there objectives in order to achieve the goals proposed. should be a clear Identification of Problem Relevance The use of maturity models is widely used and accepted, both in the so that the problem solution can be relevant to industry and the academia [2]. There are numerous maturity practitioners and researchers; models, virtually one for each of the most trending topics in such 6. R6 – Problem Definition should include the application areas as Information Technology or Management. Maturity Models domain for the maturity model and also detail the are widely used and accepted because of their simplicity and intended benefits and constraints of application; effectiveness. They depict the current maturity level of a specific aspect of the organization, for example IT, Outsourcing or Project

Management, in a meaningful way, so that stakeholders can clearly identify strengths and improvement points and prioritize what they 1 http://www.eark-project.com

Page 100 7. R7 – There should be a Targeted Presentation of recommendations issued by the United Kingdom Results regarding the users’ needs and application government and benchmark against other similar constraints; organizations. 8. R8 – The design of a maturity model must include  Scope: Management, specifically information Scientific Documentation, which details the whole management. process design for each step of the process, as well as, the  Term used to name of the Attributes: Section. methods applied, people involved and the obtained  Attributes (9): Organizational arrangements to support results. records management; Records management policy; Keeping records to meet corporate requirements; Records The well-argued claim of these authors is that these fundamental systems; Storage and maintenance of records; Security & requirements should drive the development of every maturity access; Disposal of records; Records created in the course model. Apart from evaluating well-known models according to these dimensions, the authors also delineate a set of steps to of collaborative working or through out-sourcing; correctly develop a maturity model. It depicts which documentation Monitoring and reporting on records management. should result from each step, and includes an iterative maturity  Levels: Level 0 (Absent); Level 1 (Aware); Level 2 model development method that proposes that each iteration of the (Defined) and Level 3 (Embedded). maturity model should be implemented and validated before going to a new iteration. 2.2.3 Digital Asset Management (DAM) Maturity 2.2 Maturity Models Model This section presents the several maturity models from the The DAM maturity model builds on the ECM3 maturity model [7]. Information Management, Records Management and Information This model was developed having in mind that the successful Governance domains that can influence the development of the implementation of DAM in organizations goes beyond the use of maturity model proposed in this paper. Each Maturity Model is technology. It requires a holistic approach which includes people, presented starting with a small description of the model, the aim of systems, information and processes. This maturity model provides a the model, scope, attributes and levels. These attributes further description of where an organization is, where does it need to be so detail the maturity model by decomposing certain aspects of the that it can perform gap analysis and comprehend what it needs to do maturity model domain. Some of the attributes being used are to achieve the desired state of DAM implementation. sections or principles. Although there are other attributes being In detail: used, such as, dimensions.  Aim: Improve the success rate of DAM projects in 2.2.1 Asset Management Maturity Model organizations by providing a way of assessing the current The Asset Management Maturity Model is originated from an state of the current implementation, as well as, an evaluation in the Netherlands to investigate how asset managers deal with long-term investment decisions [5]. This evaluation took improvement path for enhancement of DAM. into consideration organizations that control infrastructures, such  Scope: Management, more specifically Digital Asset as, networks, roads and waterways and focus on the strategy, tools, Management. environment and resources.  Term used to name of the Attributes: Categories / In detail: Dimensions.  Aim: Understand how asset managers deal with long-  Attributes (4/15): People (Technical Expertise; Business term investment decisions and provide an improvement Expertise; Alignment); Information (Asset; Metadata; path for organization to improve the long-term investment Reuse; Findability; Use Cases); Systems (Prevalence; decisions. Security; Usability; Infrastructure); Processes (Workflow;  Scope: Management, specifically a subset of management Governance; Integration). entitled asset management.  Levels: Level 1 (Ad-Hoc); Level 2 (Incipient); Level 3  Term to name the Attributes: Dimensions / Category. (Formative); Level 4 (Operational) and Level 5  Attributes (4): Strategy; Tools; Environment; Resources. (Optimal).  Levels: 1 (Initial); 2 (Repeatable); 3 (Defined); 4 (Managed) and 5 (Optimizing). 2.2.4 Enterprise Content Management (ECM) Maturity Model 2.2.2 Records Management Maturity Model In order to efficiently deploy ECM solutions organizations need to plan and develop a comprehensive strategy. That strategy must This maturity model was created by JISC infoNet and stands as a encompass the human, information and systems aspects of ECM self-assessment tool for higher education institution in England and [8]. If we look from a practical view, organizations cannot deal with Wales [6]. It is based on a code of practice and its aim is to help in all the ECM challenges at the same time. As such organizations the compliance with this code although it is independent from the need to enhance their ECM implementation step-by-step wise, by code and the future plans are to continue development and following a roadmap for ECM improvement. This maturity model enhancement independent from this code. provides the tools to build this roadmap by providing the current In detail: state of ECM implementation as well as a roadmap to reach the  Aim: Help higher education institutions to assess their required maturity level. current approach on records management in regard to In detail:

Page 101  Aim: Build a roadmap for ECM improvement, in a step- 4. ISO 11442: Technical product documentation – by-step fashion ranging from basic information collection Document management; and simple control to refined management and 5. ISO 13008: Information and documentation – Digital integration. records conversion and migration process;  Scope: Management, more specifically, Enterprise 6. ISO 15489: Information and documentation – Records Content Management. management;  Term used to name of the Attributes: Categories / 7. ISO 16175: Information and documentation – Principles Dimensions. and functional requirements for records in electronic  Attributes (3/13): Human (Business Expertise; IT; office environments; Process; Alignment); Information (Context/Metadata; 8. ISO 17068: Information and documentation – Trusted Depth; Governance; Re-Use; Findability); Systems third party repository for digital records; (Scope; Breadth; Security; Usability) 9. ISO 18128: Information and documentation – Risk  Levels: Level 1 (Unmanaged); Level 2 (Incipient); Level assessment for records processes and systems; 3 (Formative); Level 4 (Operational) and Level 5 (Pro- 10. ISO 23081: Information and documentation – Managing Active). metadata for records; 11. ISO 30300: Information and documentation – 2.2.5 Information Governance Maturity Model Management systems for records – Fundamentals and This maturity model builds on the generally accepted recordkeeping vocabulary; principles developed by ARMA [9]. The principles provide high- 12. ISO 30301: Information and documentation – level guidelines of good practice for recordkeeping although they Management systems for records – Requirements; do not go into detail to the implementation of these principles and 13. ISO 38500: Corporate governance of information do not have further details on policies, procedures, technologies and technology; roles. The point of this maturity model is to address this gap by 14. ISO 27001: Information security management. detailing what a successful implementation of information governance is at different levels of maturity. For the purpose of the preliminary exercise in this paper we will use In detail: the Trustworthy Repositories Audit & Certification (TRAC) as the  Aim: Help organizations understand the standards, best main reference. Its purpose is to be an audit and certification practices and regulatory requirements that enclose process for the assessment of the trustworthiness of digital repositories, and its scope of application it’s the entire range of information governance, so that they can understand what digital repositories. It is based on the OAIS model [12]. The final are the successful information governance characteristics version of TRAC was published in 2011, it contains 108 criteria at differing levels of maturity. that are divided into three main sections: Organizational  Scope: Governance, more specifically a subset of Infrastructure; Digital Object Management; and Infrastructure and governance entitled Information Governance. Security Risk Management. A successor version of TRAC, a  Term used to name of the Attributes: Principles. standard for Trusted Digital Repositories (TDR), was published in February 2012 as the ISO 16363:2012 standard [10].  Attributes (8): Accountability; Transparency; Integrity; Protection; Compliance; Availability; Retention; The maturity model for information governance, depicted in Sections 3.1 to 3.3, consists of three dimensions: Disposition.  Levels: Level 1 (Sub-standard); Level 2 (In  Management: “The term management refers to all the Development); Level 3 (Essential); Level 4 (Proactive) activities that are used to coordinate, direct, and control and Level 5 (Transformational). an organization. In this context, the term management does not refer to people. It refers to activities. ISO 9000 3. DEVELOPMENT STRATEGY uses the term top management to refer to people.” [11] This section focuses on the development strategy used for  Processes: “A process is a set of activities that are developing the maturity model for information governance. In order interrelated or that interact with one another. Processes to develop the maturity model for information governance we will use resources to transform inputs into outputs. Processes use several references from various relevant domains, such as, are interconnected because the output from one process information management, records management, archival becomes the input for another process. In effect, management, asset management and digital preservation. processes are “glued” together by means of such input Some of these references include: output relationships.” [11] 1. ISO 14721: Space data and information transfer systems  Infrastructure: “The term infrastructure refers to the – Open archival information system – Reference model; entire system of facilities, equipment, and services that an 2. ISO 16363: Space data and information transfer systems organization needs in order to function. According to ISO – Audit and certification of trustworthy digital 9001, Part 6.3, the term infrastructure includes buildings repositories; and workspaces (including related utilities), process 3. MoREQ 2010: Model Requirements for the Management equipment (both hardware and software), support services of Electronic Records;

Page 102 (such as transportation and communications), and goals. Knowledge sharing is formally recognized in the information systems.” [11] organization. The organization staff contributes to external best practice. These dimensions provide different viewpoints of information 3.2 Processes governance which help to decompose the maturity model and enable easy understanding. 3.2.1 Level 1 (Initial) Ingest, Archival and Dissemination of content are not done in a For each dimension we have a set of levels, from one to five, where coherent way. Procedures are ad-hoc and undefined, the archive one show the initial phase of maturity of a dimension and level five may not even be prepared to ingest, archive and disseminate shows that the dimension is fully mature, self aware and optimizing. content. These levels and their meaning were adapted from the levels defined for SEI CMMI. [3] 3.2.2 Level 2 (Managed) To use this maturity model an organization needs to position itself There is evidence of procedures being applied in an inconsistent in the maturity matrix in each of the dimensions. This step is called manner and based on individual initiative. Due to fact that the self-assessment. The self-assessment consists of following a series processes are not defined, most of the times the applied procedures of predetermined steps in which the organization answers a series of cannot be repeated. questionnaires that will result in a maturity level. This self- 3.2.3 Level 3 (Defined) assessment method will also be developed in conjunction with The Ingest, archival and dissemination processes are defined and in maturity model so that they are fully aligned. place. For ingest, is defined which content the archive accepts and With the resulting maturity levels for each of the dimensions that how to communicate with producers, the creation of the Archival resulted from the self-assessment, the organization can identify the Information Package is defined as well as the Preservation desired maturity level for each of the dimensions and realize the Description Information necessary for ingesting the object into the work that needs to be done in order to reach that level. This results archive. For archival, preservation planning procedures are defined in better understanding of the steps needed to reach the organization and the preservation strategies are documented. For dissemination, goal and also helps to better allocate budget for improving maturity the requirements that allow the designated community to discover of information governance and an even help to substantiate and identify relevant materials are in place, and access policies are expenditure to top management. defined. 3.1 Management 3.2.4 Level 4 (Quantitatively Managed) The Ingest, Archival and Dissemination processes are actively 3.1.1 Level 1 (Initial) managed for their performance and adequacy. There are Management is unpredictable; the business is weakly controlled and mechanisms to measure the satisfaction of the designated reactive. The required skills for staff are neither defined nor community. There are procedures in place that measure the identified. There is no planned training of the staff. efficiency of the ingest, archival and dissemination processes and 3.1.2 Level 2 (Managed) identify bottlenecks in these processes. There is awareness of the need for effective management within the 3.2.5 Level 5 (Optimizing) archive. However, there are no policies defined. The required skills There is an information system that allows for process performance are identified only for critical business areas. There is no training monitoring in a proactive way so that the performance data can be plan, however training is provided when the necessity arises. systematically used to improve and optimize the processes. 3.1.3 Level 3 (Defined) 3.3 Infrastructure The documentation, policies and procedures that allows for effective management are defined. 3.3.1 Level 1 (Initial) The infrastructure is not managed effectively. Changes in the There is documentation of skill requirements for all job positions infrastructure are performed in a reactive basis, when there is within the organization. There is a formal training plan defined; hardware/software malfunction or it becomes obsolete. There are no however it is not enforced. security procedures in place. The organization reacts to threats 3.1.4 Level 4 (Quantitatively Managed) when they occur. The organization monitors its organizational environment to 3.3.2 Level 2 (Managed) determine when to execute its policies and procedures. There is evidence of procedures being applied to manage the Skill requirements are routinely assessed to guarantee that the infrastructure. There is awareness of the need to properly define the required skills are present in the organization. There are procedures procedures that allow for effective management of the infrastructure in place to guarantee that a skill is not lost when staff leaves the that supports the critical areas of the business. There are security archive. There is a policy for knowledge sharing of information procedures in place. However, individuals perform these procedures within the organization that is described in the training plan. The in different ways and there is no common procedures defined. training plan is also assessed routinely. 3.3.3 Level 3 (Defined) 3.1.5 Level 5 (Optimizing) Infrastructure procedures are defined and in place. There are Standards and best practices are applied. There is an effort for the technology watches/monitoring, there are procedures to evaluate organization to undergo assessment for certification of standards. when changes to software and hardware are needed, there is The organization is seen as an example of effective management software and hardware available for performing backups and there among its communities and there is continuous improvement of all are mechanisms to detect bit corruption and reporting it. Security management procedures. There is encouragement of continuous procedures are defined and being applied in the organization. The improvement of skills, based both on personal and organizational

Page 103 security risk facts are analyzed, the controls for these risks are Innovation Programme 2007-2013, E-ARK – Grant Agreement no. identified and there is disaster preparedness and recovery plans. 620998 under the Policy Support Programme. The authors are solely responsible for the content of this paper. 3.3.4 Level 4 (Quantitatively Managed) There are procedures in place that actively monitor the environment 6. REFERENCES to detect when hardware and software technology changes are [1] J. Becker, R. Knackstedt, J. Pöppelbuß. “Developing Maturity needed. The hardware and software that support the services are Models for IT Management – A Procedure Model and its monitored so that the organization can provide appropriate services Application”. In Business & Information Systems Engineering, to the designated community. There are procedures in place to vol.1, issue 3, pp. 212-222. 2009. record and report data corruption that identify the steps needed to replace or repair corrupt data. The security risk factors are analyzed [2] S. Shang, S. Lin. “Understanding the effectiveness of periodically and new controls are derived from these risk factors. Capability Maturity Model Integration by examining the There are procedures to measure the efficiency of these controls to knowledge management of software development process”. In treat the security risk factors identified. Disaster preparedness and Total Quality Management & Business Excellence, Vol. 20, recovery plans are tested and measured for their efficacy. Issue 5. 2009. 3.3.5 Level 5 (Optimizing) [3] CMMI Product Team, CMMI for services, version 1.3. Software Engineering Institute. Carnegie Mellon University, There is an information system that monitors the technological Tech. Rep. CMU/SEI-2010-TR-034, 2010. environment and detects when changes to hardware and software are needed and reacts to it by proposing plans to replace hardware [4] K. Peffers, T. Tuunanen, M. Rothenberger, S. Chatterjee,”A and software. There is also a system that detects data corruption and Design Science Research Methodology for Information identifies the necessary steps to repair the data and acts without Systems Research,” In Journal of Management Information human intervention. To allow for continuous improvement there are Systems, 2007. also mechanisms to act upon when the hardware and software [5] T. Lei, A. Ligtvoet, L. Volker, P. Herder, “Evaluating Asset available no longer meets the designated community requirements. Management Maturity in the Netherlands: A Compact There is an information system that manages security and policy Benchmark of Eight Different Asset Management procedures and the disaster and recovery plans which allows for Organizations,” In Proceedings of the 6th World Congress of continual improvement. There is a security officer that is a Engineering Asset Management, 2011. recognized expert in data security. [6] JISC InfoNet, “Records Management Maturity Model.” 4. CONCLUSIONS [Online]. Available: This paper presented the fundaments of a maturity model for http://www.jiscinfonet.ac.uk/tools/maturity-model/ information Governance, as well as, a state of the art on maturity [7] Real Story Group, DAM Foundation, “The DAM Maturity models surrounding information governance found in literature. Model.” [Online]. Available: http://dammaturitymodel.org/ Based on that state of the art and other references from the archival domain, namely the ISO16363 we developed a maturity matrix [8] A. Pelz-Sharpe, A. Durga, D. Smigiel, E. Hartmen, T. Byrne, consisting of three dimensions and five levels. J. Gingras, “Ecm Maturity Model - Version 2.0,” Wipro - Real Story Group - Hartman, 2010. Further on the goal is to analyze other references from different domain, such as, records management as detailed before which will [9] ARMA International, “Generally Accepted Recordkeeping enhance, detail and help develop the maturity model that will be Principles - Information Governance Maturity Model.” developed in the scope of the E-ARK project. Moreover, there will [Online]. Available: http://www.arma.org/principles also be a method to perform a self-assessment of this maturity [10] ISO 16363:2012. Space data and information transfer systems model which will result in a toolset consisting of both the maturity – Audit and certification of trustworthy digital repositories. model and the self-assessment method which will help assessing the 2012. state of information governance in organizations as well as provide [11] ISO 9001:2008: Quality management systems – Requirements. an improvement path that organizations can follow to enhance their 2008. information governance practice. [12] ISO 14721:2010. Space data and information transfer systems 5. ACKNOWLEDGEMENTS – Open archival information system – Reference model. 2010. This research was co-funded by FCT – Fundação para a Ciência e

Tecnologia, under project PEst-OE/EEI/LA0021/2013 and by the European Commission under the Competitiveness and

Page 104 Evidence-based Open Government: Solutions from Norway and Spain James Lowry Deputy Director University College London [email protected]

ABSTRACT The Open Government Partnership (OGP) has given new impetus to the movement towards government accountability. The inclusion of a commitment on records management in the UK’s OGP National Action Plan marks the growing recognition that accountability depends on the availability of authentic and verifiable records. Nevertheless, there is still a great deal of work to be done in advocating the value of records as evidence so that planning for openness measures, such as open data initiatives, recognises the need to demonstrate the trustworthiness of the information that governments are releasing. Advances are being made, in Norway and Spain, in bringing together records management concepts and practices and openness initiatives to demonstrate trustworthiness. In these two cases we can see that frameworks of legislation for proactive and reactive disclosure, demonstrable continuums of care for records from creation to disposition, and the application of emerging technologies can offer assurances to governments and citizens that information is being ethically released or withheld and managed to ensure its authenticity, as a basis for trust and openness.

(Final text not received in time)

Page 105 Can records Management be automated? James Lappin Thinking Records blog: www.thinkingrecords.co.uk [email protected] @Jameslappin

ABSTRACT archivists and the future generation of researchers and historians This paper examines the reasons why sections of the that they serve. recordkeeping world (and in particular the US National Archvies The purpose of records management is to design systems that and Records Administration) are looking to encourage automated capture records in a requirements of all or most of the approaches to records management. These approaches are aiming stakeholders on this continuum. to reduce the burden of records management on end users. The last time records managers were able to design systems at a Automated approaches are in sharp contrast to the efforts of most corporate scale that met the needs of all or most stakeholders was records management programmes over the last 15 years to give in the paper age, before the coming of e-mail. individual end users the responsibility for the key records The failure of organisations in the post e-mail age to design management tasks of selecting and filing records records systems that meet the needs of all stakeholders has meant that some stakeholders have taken to advocating systems that meet The paper outlines the different automated approaches on offer solely their own particular needs, whilst neglecting the needs of and concludes that whilst each of them has merit, none of them other stakeholders on the records continuum. For example: yet provides a fully scaleable solution to records management in organisations. In particular there is no auomtaed solution • The need of an organisation’s legal Counsel to respond more currently available to tackle the problem of the build up of large quickly to litigation requests might drive the implementation of scale e-mail agregations in the form of e-mail accounts on servers e-discovery software and/or an e-mail archive. and in e-mail archives. • The need of the National Archives and Records Administration

of America to capture for historians a record of the As well as evaluating the automated approaches available from correspondence of senior federal officials drove them to rewrite vendors in the content management space, this paper also looks at their e-mail policy(1) and invite Federal Agencies to preserve ways in which organisations can configure the way they manage and transfer the e-mail accounts of key staff. e-mail with a view to having e-mails accumulate in manageable individual or team correspondence files rather than in These are examples of mono-faceted rather than multi-faceted unmanageable individual e-mail accounts. approaches to recordkeeping:

The paper discusses ways three examples of records management • E-discovery systems have an incredibly powerful indexing systems that have worked well - two from the digital age and one and search capability. They enable a legal counsel to create a from the paper age, and identifies two common denominators - search string to pull back documents or e-mails created , sent or received by particular named individuals, within a particular time frame and which include particular words or phrases. But • the fact that they each involve some sort of intevention to control and filter the communication channels by which the the organisation cannot allow anyone outside of their legal/compliance team to use that search facility. This is business whose records they capture is conducted. because it searches dark data - in particular data in e-mail accounts. To allow all colleagues to search such data would • the fact that they capture records that are referred to and relied upon by the people carrying out the work that the system lead to unethical breaches of privacy and confidentiality. records • NARA’s capture of significant e-mail accounts may help historians in 75 years’ time, but it won’t help the immediate Keywords colleagues and successors of those post holders. These e-mail Records management accounts will be inaccessible to them unless the organisation can find a way to reliably filter out private and confidential 1. INTRODUCTION correspondence Records are multi-faceted. They have many potential users. We may think of a records continuum spreading out in each 2. THE THREE AGES OF RECORS direction from the person(s) carrying out a piece of work, to their MANAGEMENT immediate colleagues and line management chain, and then, embracing people further removed in time and or/space - the successor(s) to the person(s) carrying out the work; auditors; legal 2.1 The changing position of the end-user in and compliance colleagues. Depending on the nature of the work records management practice the continuum may stretch to external stakeholders such as customers, clients, regulators and citizens; and on further to

Page 106 In the days before e-mail the best records management systems • it exponentially increased the volume of correspondence were built on the belief that the capture of records was too it gave individuals a new source of reference, their e-mail important to be left to end-users. This belief held that it was • account, which meant they had less need to consult the official important that officers/officials did not have the choice of which paper file, less occasion to notice any omissions on that file, communications/documents arising from their work were and and less reason to take action if there were gaps in the file. were not captured as a record. One of the purposes of a records system is to hold those 2.4 The age of the electronic records officers/officials to account for how they conduct their work management system (another is to enable those officers/officials to defend how they The introduction of e-mail meant that organisations lost control of had conducted that work). If they can choose what goes onto the their main communications channel. They tried to regain that records then they can choose the leave off the record any control by requiring employees to move documents and communications/documents that could be detrimental to them. correspondence needed as a record into an official file in an These beliefs changed after the introduction of e-mail in the mid- electronic records management system. 1990s. Since the introduction of e-mail organisations have The organisation would provide employees with a generic typically stated in their records management policy that it was the definition of what constitutes a record, and expect them all to responsibility of each individual employee to capture and apply this definition to their own e-mail account. maintain a good record of their activities. This was often stated in moral terms - individuals were employed by the organsation to do However in practice each individual interpreted that definition a job and they therefore had a duty to leave behind a good record differently. The amount of correspondence saved into the system of that work depended on the motivation, awareness and workload of each individual. Since 2007 an information governance view has emerged that individual knowledge workers are too inconsistent and poorly Correspondence and documentation built up outside the electronic motivated to perform records management tasks well, and that records management system, and outside of the protection of organisations would be better off finding ways to automate records retention rules. In particular correspondence built up in records management. e-mail accounts. To find out the reasons for these sudden reversals in ideas and Organisations faced a double edged sword: belief we need to at the practicalities of records management in these different ages. • if they deleted e-mail accounts promptly they wiped out their own memory, because they were not capturing sufficient e- 2.2 The registry age mails in their electronic records management systems BUT In the days before e-mail a gap in time and space existed between • if they let e-mail accounts build up they were amassing huge post arriving into an organisation and post arriving at the desk of quantities of trivial, private, and personal e-mails alongside the the officer/official to whom it was intended. Organisations used e-mails needed as a record. this gap in time and space to interpose records clerks, organised in registries, to file documents and correspondence needed as Furthermore the one-to-one nature of e-mail communication lent records. itself to unguarded and sometimes toxic comments that also built up in e-mail accounts. This contrasted with the paper days when Post room staff would filter the morning correspondence: envelopes would be opened in the post room and correspondents would moderate their communications in the knowledge that their They would sending business correspondence to the records • communications could be read by many different eyes on their clerks in the registries way to the addressee. • They would send post that was obviously personal or promotional in nature direct to the addressee. 2.5 The age of automation An early step in the move to automation was the decision of the The files created and maintained by the registries were relied on US Securities and Exchange Commission (SEC) (2) to require by all stakeholders. This meant that omissions in the file would organisations engaged in the trading of financial securities to be noticed by the only people in a position to notice them - the capture all communications made and received by its traders. people carrying out that work. Barclay T. Blair (3) has said that this ruling ‘singlehandedly created the e-mail archive industry’. In a situation such as the 2.3 The disruption of e-mail trading floor it would be ridiculous to expect a trader engaged in The coming of e-mail created a rival communications channel to some kind of misdemeanor to voluntarily declare into a record that provided by the postal system. Organisations had no time to system the e-mails they used to inform their collaborators of their plan how to deploy e-mail in a way that would enable them to insider information. The only way to ensure accountability was to transparently filter and file correspondence. The network effect capture everything, including trivial and personal correspondence. meant that as soon as their customers and stakeholders adopted e- mail, then the organisation had to adopt it too. That meant Another milestone in the march of automation was the release of deploying off-the-shelf commercial e-mail packages. SharePoint 2007 in late 2006. The records management model in SharePoint had individual knowledge workers simply able to right The introduction of e-mail had three major effects: click on a document and select the option ‘send to record centre’. • it collapsed the gap in time and space between the sender and Administrators were expected to configure rules to enable the recipient of a document/communication SharePoint records centre to organise the documents that were

Page 107 sent to it. Lying behind this model was the belief that knowledge • Enterprise content management (ECM) vendors such as Oracle workers had better things to do with their time than engage with and Documentum and IBM have long provided sophisticated the type of corporate records classification that had been the workflow definition tools with their products. organising principle behind the electronic records management • ECM vendors such as Open Text and IBM provide auto- systems whose market share SharePoint eroded. classification capabilities as part of their enterprise content The most significant step in the march of automation was the management (ECM) suite. passing of the Managing Government Records Directive (4) in the • Content analytics tools and e-Discovery tools (such as Nuix, US which mandated the US National Archives and Records HP Control Point and others) give administrators a dashboard Administration to explore ways of automating records by which they can define parameters for particular types of management. content (for examples documents on a shared drive that are In the context of records management NARA defined automation more than seven years old) and trigger a workflow to get the as any move to reduce the burden of records management on end content within those parameters reviewed by the content users, by no longer requiring them to take a decision on every owners and then destroyed if the content owners authorise the single document or e-mail they create or receive. This was the disposal first time that the move to automation had come from within the • In-place records management tools such as those offered by recordkeeping professions themselves. RSD and IBM enable an organisation to intervene in systems such as SharePoint and MS Exchange and link objects in those 3. THE RANGE OF APPROACHES TO applications (libraries, content types or folders in SharePoint, AUTOMATION folders in individual e-mail accounts) to the organisation’s record classification and associated retention rules 3.1 NARA’s typology of approaches to 4. EVALUATING APPROACHES TO automation AUTOMATION NARA released a report (5) in 2014 which listed the different ways in which records management might be automated. These approaches can be grouped in two different categories, 4.1 Workflow depending on the way in which they reduces the burden on end- Of the above approaches the workflow approach is the one that users. most approaches the standard of reliability, comprehensiveness and usability achieved by the registry system approach in the The first category of approaches continues to apply records paper age. management disciplines at the document/e-mail message level, but uses a machine rather than humans to determine what needs to be For example an insurance company might set up a workflow captured as a record and where it needs to be filed/classified. system for dealing with claims. They might ensure that communications related to claims are channeled into mailboxes These approaches work by either: specifically created for claims correspondence. They would use • Defining workflows which automatically capture records at configure workflows to allocate a claims number to each claim, different stages of a process and to ensure that any subsequent correspondence relating to that claim and any recordings of voice conversations are kept together • Using auto-classification (through a machine learning tool or on one claim file. through the definition of rules) to select documents/e-mails needed as a record and to file them OR Note how the insurance company has wrested control over the communications channel between claimants and its staff away The second category abandons the attempt to manage records at from individual e-mail accounts and into mailboxes governed by the document/e-mail message level, and instead manages records its workflow system. at a higher level of aggregation. These approaches involve: The biggest difference between the workflow approach in the • Applying defensible disposition policies to existing groups of digital age and the registry approach in the paper age is that: records (for example NARA’s acceptance that Federal Agencies could preserve entire e-mail accounts rather than require • Registry systems in the paper age could scale across all the individuals to select which e-mails met the definition of a different activities of an organisation (simply by maintaining an record) OR adequate ratio of records clerks to total numbers of staff) BUT • Holding a records classifications and retention rules in one • The definition of workflows is too time intensive to enable an application, and applying them to objects in the many different organisation to extend workflow control over the full range of systems of the organisation (shared drive, SharePoint, its activities. Unless a work process is relatively predictable and Exchange etc.) often repeated, then there will be insufficient return on investment to justify defining a set of workflows to control the 3.2 Vendor support for automation process. All of these approaches are feasible, and supported by mainstream tools. 4.2 Auto-classification There are two methods of auto-classification. The first is by machine learning, Machine learning uses sophisticated statistical

Page 108 algorithms to identify patterns within a particular set of • A staff management retention rule that is triggered by the date documents. It works as follows: the person left employment • The administrator gathers together a sufficiently large sample • A consultation retention rule that is triggered by the closure of set of document/e-mails that correspond to a particular category the consultation. in a records classification Then you have to group together records into containers specific • The sample document set is fed to the machine learning tool to one member of staff, one consultation, and one engineering project. • The tool identifies the common patterns present in each document within the set, and is then able to go out and identify 4.3 Defensible disposition other documents/e-mails that correspond to that category Defensible disposition is the least intrusive method of automation The second kind of auto classification is by the definition of rules. because in theory it involves no change to the way that content For example a rule might read: ‘if a six digit project code appears accumulates in an organisation. It simply gives you the tools to in the subject line or text of the e-mail then move the record to apply disposition rules to those accumulations. the file that corresponds to that project code’. The disadvantage of the approach is that some accumulations of The rule definition approach has: content, most notably e-mail accounts, involve such a mixture of the trivial and significant; harmless and toxic; and the personal • The advantage of transparency. It is easier for colleagues to and business; that applying a retention rule to such an aggregation trust an auto-classification tool if you can explain to them may involve unacceptable compromises. precisely the logic by which the tool will work. It is easier to explain the rules that have been written, than it is to explain the 4.4 In place records management complex mathematics behind the machine learning tools. In place records management enables an organisation to maintain • The disadvantage that it is even more time consuming to define very sophisticated records classification and retention rules and rules for auto-classification than it is to build document sets to apply them in different environments. It is most effective in define a machine learning tool. organisations that are global in scope. Such organisations may need to apply different retention rules to records arising from the Both forms of auto classification share a common disadvantage. same function or activity in different jurisdictions. They are also • Records classifications tend to be very granular. likely to have a great many different content management systems. • The more granular a classification is, the more nodes it has at In-place records management approaches intervene when the bottom level particular events happen, for example when a new object such as a folder or a library or a site is created in SharePoint, or a new • The more bottom level nodes it has the more training sets that folder is created in an e-mail account. The intervention serves to have to be gathered for the machine learning tool, or the more link the object and its contents to the organisation’s record rules that have to be defined for the rules engine. classification and retention rules. Another complication with records management is that we do not It is at its best when working with systems such as SharePoint and normally apply our classifications directly to documents. Instead ECM systems that have a sophisticated API and a reasonable level we apply it indirectly, via containers/aggregations/folders/files of existing organisation. It is less effective with: that represent specific pieces of work. • E-mail accounts (if an individual does not use folders in their e- These pieces of work emerge as new projects emerge. Ideally we mail account then there may be a paucity of events to trigger need an auto classification tool to both: the tool to intervene) • Identify which classification a document belongs to, for • Shared drives (which lack an API to enable the tool to example whether it arises from an engineering project, a intervene properly). consultation, or from the management of a member of staff AND 5. INTERVENING IN THE E-MAIL • Recognise specifically which engineering, project, which COMMUNICATIONS CHANNEL consultation and which member of staff it relates to. One of the weaknesses of the records management situation in organisations is that content tends to build up in ever larger In practice if you are going to try to apply auto-classification at a accumulations. Cassie Findlay pointed out in a recent lecture that corporate scale, across a wide range of different activities, then this puts records at risks, because the accumulations are so large you will end up using a ‘big bucket’ approach which will group that eventually sweeping decisions have to be made that affect all records into mega-containers such as ‘Environmental policy content within the aggregation. records’ ‘Health and Safety records’, rather than granular containers such as ‘Wind turbine policy 2012 to 2015’ or The most glaring examples of this comes with e-mail accounts ‘Asbestos records for the HQ building’. that organisations end up applying entirely arbitrary disposition rules to. The problem is that it is hard to apply an accurate retention rules to big bucket containers. For example if you wanted to apply: At the time of writing e-mail is still the main channel of communications into and out of most organisations. This • an engineering retention rule that is triggered by the end of the situation will not persist for ever, but whilst it does persist it is life of a structure, important that we find a way to filter and control accumulations of

Page 109 e-mail. When we look back at records management history we can which that individual spent in that post, minus any e-mails the see that the times in which we have been able to control the individuals has flagged up as private. channels by which recorded information is communicated, then • we have been able to build reliable and scaleable records systems. Repeat the process when the new incumbent to the role leaves. The correspondence file would build up as different post We have seen that none of the proposed approaches to automation holders occupied and then left the role. can yet deal satisfactorily with e-mail: The organisation would need to educate individuals that their e- • NARA’s Capstone approach to preserving some e-mail mail will be passed on to their successor, and would need to give accounts permanently only helps historians in the distant future them a means of flagging e-mails that are private and should not be passed onto their successor • auto-classification only classifies into big buckets (when applied at corporate scale) This approach creates a partially multi-faceted record. It extends access to the accumulation of e-mails to an individual’s successor • defensible disposition approaches struggle to find a defensible in post and their line manager. It could be used for compliance retention period for e-mail accounts purposes by legal counsel. • workflow can only stretch to a small number of processes. 5.3 Team based e-mail correspondence files in-place approaches struggle if individuals do not use folders in • The United Nations Food and Agriculture Organisation (FAO) their e-mail accounts went one further than the role based correspondence file (6) The continued failure of vendor offerings to help an organisation They intervened in the process whereby individuals send e-mails. manage their e-mail should not stop organisations trying to win When an individual presses ‘send’ in an FAO e-mail account they back control of the way that e-mails accumulate. In this section are faced with a pop-up we explore two different ways that a manageable correspondence file could be filtered from individual e-mail accounts. • The pop-up asks the sender whether or not the e-mail they are about to send is a record. 5.1 Treating e-mail accounts as • If they select that it is a record, then a copy of the e-mail is correspondence files routed to a record repository In theory an e-mail account is simply the electronic equivalent of the correspondence files that many individuals kept in the hard • In the record repository the e-mail is stored in a correspondence copy age. file for the team with which that individual works The two differences are that: • Each team correspondence file is configured to send a digest e- mail to each member of the team once a day, listing all the e- • In the hard copy age when an individual changed job within an mails tagged as ‘record‘ by their team mates the previous day. organisation they left any correspondence files behind them for their successor. In the e-mail age most organisations allow an FAO found that some teams significantly reduced the number of individual to keep that correspondence in their in-box even e-mails that they copied to each other, because they know that by when they move to a completely different role saving an e-mail as a record all their colleagues would become aware of its existence the following day via the digest e-mail. • In the hard copy age private and personal correspondence did not find its way onto a correspondence file, but in the e-mail What is interesting about the FAO approach is that they have paid age private correspondence accumulates cheek by jowl with as much attention to ensuring that the records are actually used business correspondence in the same account. and read as they have to ensuring that records are captured We have seen the relative failure of attempts to get individuals to In effect their record system was providing a current awareness filter their e-mail accounts into filing structures within electronic tool for colleagues, who could see what their colleagues were records management systems, due partly to the high volume of e- working on without the intrusion of being copied into multiple e- mails such individuals create and receive. mails. One response to this would be to pull back from the insistence on 6. CONCLUSION filing e-mails into filing structures, and instead create accumulations of e-mails that are non-toxic and which can be The current array of automated approaches present us with a passed onto a post holders successors. dilemma. • The approach that is the most effective (the workflow 5.2 Role based e-mail correspondence files approach) is not scalable across all of an organisations activities One relatively simple way of filtering e-mail would be to: because of the time and resource necessary to analyse processes • Create a correspondence file for each role in the organisation in order to build the workflows • Link each individual’s e-mail account to the correspondence • The approach that is the most scalable (the auto-classification file for the role they occupy by machine learning) achieves that scalability at the expense of a loss of granularity that would see records grouped into • Intervene whenever an individual leaves a post. The purpose of ‘buckets’ that are simply too large to enable us to apply useful the intervention should be to capture into the relevant retention and access rules correspondence file, all the e-mail from the time period in

Page 110 In-place records management tools and content analytic tools are Records systems must be referred to by end users in order to be pragmatic approaches to the messy situation that organisations sustainable - this is because the colleagues carrying out a piece of find themselves in, with content scattered over many different work are the only people in the organisation who are in a position repositories. However neither one of these two tool sets has yet to to notice gaps in the record and to do something about those gaps. provide a solution to the problem of the build up of large and The challenge for automated approaches is this - how do you unmanageable e-mail aggregations. reduce the burden and responsibility of records management on

end-users, whilst still retaining an active role for the end-user in The best records management examples we have looked at in this the records system, as record users, and as whistle-blowers on paper were the following: gaps and imperfections? • The registry systems operated in the paper age 7. REFERENCES • Line-of-business workflow/case-file systems such as the insurance claims systems [1] NARA's Capstone Bulletin • The FAO’s use of team based e-mail correspondence files http://blogs.archives.gov/records- express/2013/08/29/capstone-bulletin-issued/ These three approaches each had two things in common: [2] SEC rule 17a-http://en.wikipedia.org/wiki/SEC_Rule_17a-4 • They each routinely intervened in the communications channel by which individual colleagues sent and received [3] Barclay T Blair's quote is in written/recorded communications http://barclaytblair.com/2013/06/21/response-to-naras- capstone-email-bulletin/ • They each captured records that were read and relied upon by [4] Managing Government Records Directive the officers/officials carrying out the work that the records http://www.archives.gov/records-mgmt/prmd.html arose from [5] NARA report into automation The point about intervening in the communications channel is http://blogs.archives.gov/records- important. Without that intervention the record system is express/files/2014/03/Automated-Electronic-Records- ‘outside of the loop’, an after-thought, sitting to one side of the Management-Report-and-Plan_3.6.14_finaldraft.pdf way business is conducted rather than engrained in the way that business is conducted. [6] Case study of the FAO’s records system implementation http://thinkingrecords.co.uk/2013/07/13/faos-approach-to- Any system will develop imperfections. Sustainable systems have making-e-mail-manageable-and-shareable/ built in provision for spotting imperfections and correcting them.

Page 111 MoReq and E-ARK Jon Garde Senior Product Manager at RSD The Hollies Breadcroft Lane Maidenhead SL6 3NU United Kingdom [email protected]

ABSTRACT E-ARK is using as its basis the OAIS model for archiving. This presentation looks at how the records kept in MoReq Compliant Records Systems (MCRS) can be transferred to archives under this model and describes the progress made already (and planned in the future) to integrate and demonstrate MCRS solutions interoperating with E-ARK deliverables, as one of the DLM Forum’s contributions to the E-ARK project. Keywords MoReq, E-ARK, Archiving, OAIS, Transfer

(Final text not received in time)

Page 112 Is big data governing future memories? Alessia Ghezzi a Estefanía Aguilar–Morenoa Ângela Guimarães Pereiraa

European Commission, Joint European Commission, Joint European Commission, Joint Research Centre (JRC) Research Centre (JRC) Research Centre (JRC) Via Enrico Fermi 2749 Ispra (VA) Via Enrico Fermi 2749 Ispra (VA) Via Enrico Fermi 2749 Ispra (VA) 210127 Italy 210127 Italy 210127 Italy +39 0332789244 +39 0332789632 +39 0332785340

[email protected] estefania.aguilar- [email protected] [email protected]

ABSTRACT The main characteristics of big data are Volume (that need In this paper, we will set the basis for a reflection about the advanced architecture to be managed), Velocity (data generated meanings of democratisation of digital memories, looking at how, continuously) and Variety (data created in different formats) [2]. in the big data era, preservation is currently being moved from As data became big – as well in the definition, not too technical traditional institutions of memory hands to distributed others. but catchy – a set of technologies and methodologies with great promises and equally great pitfalls [3] is developing. Categories and Subject Descriptors When characterising big data some authors also include a fourth V, which stands for Veracity referring to ensuring integrity of data K.4 COMPUTERS AND SOCIETY – – in terms of formats and structure – that allows their correct K.4.1 Public Policy Issues [Ethics] management, but not considering their accuracy or exactness. An impressive amount of data is daily generated from very different K.4.2 Social Issues sources, as well as used and interpreted by actors skilled enough K.4.m Miscellaneous to handle them. Because the cost of storage has fallen so much, it is easier to justify keeping data rather than discarding them. Therefore, the technical capability of preserving huge amounts of General Terms data - instead of applying appropriate appraisal procedures - Management, Documentation, Reliability, Experimentation, triggers a lower level of attention to quality of contents and Security, Human Factors, Standardization, Theory, Verification. hinders a critical approach to governance of future memories, as well as diminishes the importance of the institutions that were Keywords traditionally in charge of managing them: institutions of memory, such as archives. Whereas some particular qualities, like Big Data, Governance, Digital memory, Information management, provenance, authenticity and accuracy are fundamental to Democratisation, Citizen Participation. institutional records, these sorts of characteristics seem not to be demanded for big data. In fact, big data predictions rely on a huge amount of inexact data [4], whereas memories, as pieces of evidence, are based on integral and accurate information. The Knowledge production and its governance are intertwined with same thing happens to the perceived reliability of the institution memory practices on its various forms and therefore calls for a (institutions of memory) and the process (memory practices) of reflection on ethical dimensions. The development and use of preserving records. The truthfulness of records relies on the Information and Communication Technologies and the hyper- overarching idea that they are under the control of a trusted connectivity momentum have lead to massive content creation, (legitimate) authority, which ensures the integrity of the system, different forms of knowledge and also to humongous amounts of its accuracy and reliability. But what are the qualities that make data, that has became known as ‘big data’. big data trustful? And does it have any kind of “trusted authority” behind data we can rely on? Whereas institutions of memory were dealing with immateriality of contents, trying to find a solution for managing digital Big corporations and some governments have the capacity to store memories appropriately, big data is pushing towards a new massive amount of data for a future use and processing, posing meaning for memory making and makers. It is estimated that in threats to democracy and fundamental rights, such as privacy. 2007 only about 7 per cent of data produced was analogue [4], the This is especially important if we take into account that rest being digital and the phenomenon is progressively and rapidly “everything about our lives is in the process of becoming data” increasing, in this paper we would like to raise awareness about [8]. From a concrete institutional perspective, given the how collective digital memories could be affected by big data’s exponential rise of contents, preserving everything is an governance, in terms of quality of the information and new unaffordable task, and appraisal becomes more necessary than societal actors, namely algorithms. ever. So we wonder who is in charge of deciding what data is to be kept or not, and under which criteria? Who is in charge of big

Page 113 data? Who are the actors using and interpreting big data? Under of society, namely through the choice of what becomes visible and which processes? invisible, sharable and unsharable. Hence, the questions of ethical Can we consider big data as records? Records are not data, but nature arise: who governs these data? - data owners, data brokers? “an account officially written and preserved as evidence or Do we have enough constitutional rights against intentional or testimony” [7] is and this difference should be emphasised in unintentional disclosure, dual use and abuse of all the data order to avoid considering big data as a main source for collected by corporations or public institutions? Could data be memories. Not only institutions of memories but also other disclosed to third interested parties, without appropriate consent? institutions dealing with data – like statistical national institutes - For corporations it is manifestly expressed: for commercial use have developed a reflection on some critical points in big data and profit, but could they be also used by public institutions? Are management [1]. Can big data interact, overlap or substitute public institutions just interested in demonstrating their official data provided by institutional sources? Just relying on big accountability and historical purposes or could this situation data are we able of representing the whole spectrum of society, change in the future? not just the one that is interacting digitally? Are small or We have to assume that big data era has just started and it is still underrepresented voices considered if technologies lead us to deal under development. We do not know yet the scale or the speed of with trends instead of with stories? big data progression, but we have many questions already. No We can also envision the risk in the processes for dealing with big clear rules or principles have been developed for dealing ethically data. Corporations, foundations and other new actors have with big data, but four normative ethical values have been different interests in the process of generating and preserving data. formulated: privacy, confidentiality, transparency and identity [6]. Such processes are based on algorithms created and performed by Nevertheless if society wants to govern big data, “who collects, big technological companies, with different interests that shares and uses data must be made more transparent and determine and are determining how and what information is being accountable” [6]. We argue that the ethics of memories preserved for the memory to be. Consequently, it is necessary to governance has not been appropriately dealt in this big data clarify which is everybody’s part in the digital landscape in order context. Our contribution's here is to start a proper societal debate to ensure a consistent and accountable government of memories. about the future of our memory (and knowledge creation and In an illiterate algorithmic society, in which just a bunch of actors preservation) and how this is being (co-)constructed in the era of has the knowledge to deal with such amount of data, these big data. datasets will become black boxes neither accountable, nor .a The opinions of the authors in this paper cannot in any traceable. “To prevent this, big data will require monitoring and circumstance be taken as official position of the European transparency, which in turn will require new types of expertise Commission. and institutions” [3]. So, how should institutions of memory face big data? In a society collapsed by information, technology and actors REFERENCES behind big data are offering the added value for a concise access to information; yet they also have the capacity to control information as well as society in the long-term. “The data [1] Donvito, D. 2013 L’opportunità Big data: sfide e prospettive collected now will have unforeseen uses (and value) in the future” per la statistica ufficiale. ISTAT News. 7 (Feb. 2013), 12. [6]. Following the same pattern, power has been distributed Available at: http://www.istat.it/it/files/2013/02/Big_Data.pdf among these actors in pursuit of a desirable democratisation. In fact, democratised memories carry positive things such as [2] IBM. 2013. The four V’s of big data. Available at opportunities for validity and enhanced access to information, http://www.ibmbigdatahub.com/infographic/four-vs-big-data better quality assurance and accountability, public engagement, [3] Lohr, S. 2012. How Big data Became so big NYTimes (Aug. and extended peer review and co-production. Nevertheless, some 2012, 11) Available at other unclear issues, regarding new memory actors’ goals http://www.nytimes.com/2012/08/12/business/how-big-data- remained unsolved, there has been a “redistribution of became-so-big-unboxed.html?_r=0 information power from the powerless to the powerful” [5]. Given [4] Mayer-Schonberger, V., Cukier, K. 2013. Big data. John the fact that big data directly influences decision making, social Murray, London. actors are empowered to collect, analyse and interpret data that may inform decisions. Not knowing why but only what, with the [5] Pariser, E. 2012. The filter bubble: what the internet is hiding risk of lost centuries of practices of social understanding.[4] from you. Penguin books, London. Algorithmisation and data fetishism could lead citizens to take [6] Richards, N.M.; King, J.H. 2014. Big data ethics. Wake forest trends, summarization and data visualisation produced by law review. (May. 2014, 19) powerful actors as facts, instead of observations organised to [7] The American Heritage Dictionary of the English Language, respond to specific questions. If big data remains in the hands of a Fourth Edition. 2000. Houghton Mifflin Company. number of companies or governments, it may prevent people to dig into the information, limiting their access under a prisma of [8] Watson, S.M. 2013. You are your data: and you should infographics and data visualisations, instead of traceable demand your right to use it. Future tense, November 2013. knowledge. Available at: Whereas the opened debate about the bias and obscurantisms of http://www.slate.com/articles/technology/future_tense/2013/11/qu archival procedures triggers attention to a new concept of open antified_self_self_tracking_data_we_need_a_right_to_use_it.html and participatory institutions of memory, fewer people seem to be paying attention to the normalisation by big data and technologies

Page 114 Access and Preservation in the cloud: Lessons from operating Preservica Cloud Edition

Kevin O’Farrelly Alan Gairey James Carr Preservica Preservica Preservica 26 The Quadrant 26 The Quadrant 26 The Quadrant Abingdon Science Park Abingdon Science Park Abingdon Science Park Abingdon, UK Abingdon, UK Abingdon, UK +44-1235-555511 +44-1235-555511 +44-1235-555511 Kevin.O’[email protected] [email protected] [email protected] Maïté Braud Robert Sharpe Ann Keen Preservica Preservica Preservica 26 The Quadrant 26 The Quadrant 26 The Quadrant Abingdon Science Park Abingdon Science Park Abingdon Science Park Abingdon, UK Abingdon, UK Abingdon, UK +44-1235-555511 +44-1235-555511 +44-1235-555511 [email protected] [email protected] [email protected]

ABSTRACT Other OAIS functional entities (preservation planning, data The archival community has recently been offered a series of management, administration and storage) can all be performed cloud solutions providing various forms of digital preservation. without the need to move content across the internet. Access can However, Perservica is unique in providing not just bit-level be provided in a variety of forms including those suitable for preservation but the full gamut of digital preservation services archivists and those suitable for the general public. It is also that, up until recently, were available only to organizations using possible to render content server-side to minimize the need for a system installed on-site following on from a complex, and download. potentially risky, software development project. This “new Importantly, it is also possible to export an organization’s entire paradigm” [1] thus offers a zero capital cost “pay as you go” content thereby providing a suitable “end of life” route to move to model to perform not just bit-level preservation but also “active a different digital preservation system. preservation” [2]. This short paper will describe the practical difficulties of providing and operating such a comprehensive General Terms service in the cloud. Infrastructure, communities, strategic environment, preservation A cloud system’s advantage is to reduce the need for capital costs strategies and workflows, digital preservation marketplace, case (since hardware and software are rented not bought up front) and studies and best practice. system maintenance (since this is provided by the system’s provider). To reduce costs further a system can share multiple Keywords organizations’ content on a single operational instance. However, OAIS, Bit-level Preservation, Logical Preservation, Active this instance must maintain each such tenant organization’s Preservation, Cloud isolation (i.e. one organization’s content must not be exposed to any others). In addition each tenancy must be able to control its own processes without being able to compromise those of other 1. INTRODUCTION tenants. This leads to the need for some degree of tenancy There has been a recent trend towards deploying and utilizing administration (without placing on each tenant a large burden of software systems in the cloud. In particular, digital archiving and administration that is best handled at the system level). preservation solutions are now available in the cloud. Cloud- based software systems (and digital archiving and preservation The need to move bulk content across the internet as part of ingest solutions in particular) have some distinct advantages and cannot be avoided but the remaining ingest functionality can be disadvantages over local deployment. This short paper compares performed either prior to upload (through a downloadable client- and contrasts the experiences of developing solutions both on an side tool) or server-side (through comprehensive workflows). organization’s site and via a shared tenancy system in the cloud. Some ingest steams (e.g., web crawling) in fact can be considerably eased by using the cloud since an organization’s Note that in this paper, the term ‘the cloud’ is used to refer to local internet bandwidth is no longer relevant. public cloud instances, where services are made available over a publicly available network. While private clouds (i.e. cloud infrastructure operated solely for one organization) are similar to

Page 115 public clouds, many of the issues (legal, hardware provision and to rent servers for just the time that they are needed meaning that, elasticity in particular) are different. for example, it is possible, to use the servers needed to process a backlog or a temporary ingest surge and then stop paying for them 2. METHODOLOGY after that point. In order to be able to discuss general issues that can occur with In the case of buying a cloud-based service each user is sharing cloud systems and how it is possible to address these, it is processing resources with other users. Thus, it is the necessary to have experience. This paper relies on Tessella’s responsibility of the provider to ensure that sufficient resources experience of developing and running both on-site and cloud- are available to cope with steady loads and to deal reasonably based preservation systems (Preservica). Hence, issues are with peak demands. Typically this will be monitored via a service discussed in general first and then (where appropriate) the level agreement (SLA) determining not just availability but also Preservica solution to these issues is outlined. reliability whilst also specifying any limitations on, say, Tessella’s on-site preservation system (using the SDB software, processing load that the tenants cannot break without sufficient recently rebranded as Preservica Enterprise) has been developed prior agreement (to allow the service provider time to provision over about a decade and is deployed on-site by a number of for it) and, potentially, payment. leading archives and other memory institutions around the world. This allows bespoke functionality to be added to the system’s core 3.3 Tenancies and Tenancy Isolation functionality in order to deliver a system that meets the specific, Typically a cloud-based, software-as-a-service offering relies on true needs of the organization. economies of scale as hardware and administration costs are shared across all clients of the service. However, this means that The cloud-based Preservica service was launched in June 2012 clients of this service also share the same infrastructure, raising and utilizes the same core software. It is deployed within Amazon the potential for security breaches. Web Services cloud offerings. Hence, each organization utilizing the Preservica service becomes a ‘tenant’ within a selected instance. It is vital that these tenants 3. CHOOSING THE CLOUD remain isolated from each other and are not able to see each There are a number of features that are important in determining other’s contents or to be able to tell the workflows etc. run by whether or not to use the cloud for a digital preservation system. each other. Preservica has undergone extensive design reviews 3.1 Legal constraints and a rigorous testing program to ensure tenant isolation. The use of a cloud solution means that content is stored away from an organization’s own site. This may (or may not) be an 3.4 Exit Strategy Another very important aspect to consider in choosing a cloud issue depending on the nature of the content stored, the mandate system is how organizations will be able to move between of the organization and the legislative and regulatory framework providers. This is important since the cloud is still young and in which they operate. The complex topic of intellectual property thus can be expected to evolve quickly. In order to be able to gain rights is covered in more details in other places [3]. advantages from these changes, it is important that organizations The single biggest concern seems to be jurisdiction, with, for don’t become locked into arrangements that are very difficult to example, US institutions reluctant to let their content leave the break either for contractual or technical reasons. United States and most European institutions reluctant to let their Preservica guards against this by allowing a full export of content content leave the European Union. To get around this issue with related metadata in a published AIP . This Preservica currently (March 2014) is deployed in two separate export process can be configured to allow alternative metadata instances: one on the East Coast of the United States and the other schemas to be used and/or alternative packaging approaches. This in Dublin in Ireland. allows great flexibility in how to export and thus in ability to Of course other organizations will have other constraints (e.g., import into a successor system. defence contractors are unlikely to be willing to allow their information to be stored in a public cloud) that may prevent them 3.5 Capital vs. Revenue Costs from using the cloud. Of course, a lot of decisions need to balance costs with the ideal functionality. 3.2 Hardware & Elastic Computing One of the advantages of cloud systems is that it is not necessary Typically, the cost of owning a full OAIS system in the cloud is for an organization to purchase or maintain its own hardware. much lower than the cost of owning and operating a similar This removes the need for a capital budget and to have to make system on site. As well as operational costs there are two big (often quite technical) purchasing decisions. It also removes the overheads in setting up an on-site system: equipment capital costs need to have to decide when it is necessary to perform a hardware and software capital costs. upgrade (and to pay the capital cost associated with such an However, in certain circumstances it is possible for the economics upgrade). to change in favor of an on-site system, even considering these Cloud services are usually elastic. This means it is possible to add overheads. additional hardware to expand computing capability. In the case The most obvious of these overheads is the capital cost of of Preservica the core software works by passing the ‘heavy hardware, especially storage systems. Generally the cost of loading’ tasks to an array of job servers via a queuing system. renting cloud-based hardware is lower than the cost of buying and This means that both on-site and cloud-based systems are known running an equivalent system on site. However, at high storage to scale very well. Of course such scaling comes at a cost whether volumes the economics of an organization running its own system it is via purchased, on-site hardware or rented, virtual servers in begin to be comparable to, or even cheaper than, those of using a the cloud. One of the advantages of the cloud is that it is possible

Page 116 cloud-provided one. When taken together with the simplified exit images in a cheaper storage system with low access capabilities, strategy, this could lead to a decision to use an on-site solution. such as Amazon’s Glacier offering, while storing low-resolution, Another potential overhead for an on-site solution is the capital access copies in a highly available storage system such as Amazon cost needed to procure, develop and configure the system in the S3). first place. Although a cloud system removes the need to pay Preservica has methods to allow content to be moved to allow for these costs, by its very nature such a system must be generic. An changes of policy, because of a change in the perception of risk, on-site system, in contrast, can be built to meet an organization’s or to cope with a triggered risk (e.g., failure of a provider), or to exact needs (ideally based off an existing, flexible starting optimize costs after a change in pricing. In the latter case it is system). For example, many of Tessella’s customers have built important to weigh any costs of moving content (e.g., in systems to completely automate the process of ingesting very high bandwidth charges) against any potential savings. volumes of materials using ingest workflows configured to work with the peculiarities of each source (e.g., to interpret the output 5. ACCESS of a digitization stream correctly and then ingest it). This can Another important feature of most cloud solutions and digital reduce the effort needed for ingest significantly and can produce a preservation systems is access to content. The capabilities of very high payback over the use of a more generic system that systems vary here, but Preservica has two distinct offerings. requires a large amount of intelligent user input in order to The first is an archivist’s user interface. This provides search and interpret the sources for each ingest of new material. browse capabilities and offers a detailed view of the metadata of Hence, the decision on whether to use the cloud or not, is often a each entity (collections, records, files, and embedded objects balance between one-off capital costs and on-going revenue costs. within files) in the system. This includes the ability to view the audit trail and provenance of each entity. For records with 4. STORAGE multiple representations (e.g., those that have been migrated from Many people associate the cloud with storage. Indeed, a basic one set of technologies to another) it is possible to compare the requirement of a digital preservation system is to offer bit-level significant properties between each representation. preservation. Cloud-based digital preservation systems allow The second user interface is intended to be used by the general organizations to make use of the economies of scale offered by public to get live access to the parts of the collection they are storing content using infrastructure beyond the means of most allowed to see. This user interface deliberately only displays a individual organizations. It also means that the operating and subset of the available information about each entity (e.g., it administration costs are similarly reduced. excludes the audit trail) and only the representations intended for In the case of Preservica, the S3 storage services offered by public consumption. Amazon Web Services are used by default. These services create In addition, both user interfaces are capable of providing server multiple copies in geographically separated places and perform side rendering to allow users to view content without needing to their own integrity checking. This allows Amazon to claim download it to their device. This is important in a cloud-based 99.999999999% durability, which compares favourably to almost environment since downloads come at a cost and, depending on any in-house storage arrangement. However, organizations with a an individual’s internet connection, can be slow. It also allows mandate to retain content in perpetuity are, naturally, wary of such complex technologies to be rendered (e.g., Preservica will render claims (not least because even if it is accepted that the technical WARC files using the Wayback machine which otherwise would risk is extremely low there is a probability of the system ceasing require a complex server setup to be used once the individual has to exist for other reasons). Indeed some cloud-based storage downloaded such a set of files). services have gone bankrupt and thus no longer exist. This approach of having two distinct user interfaces and therefore To get around this issue, most cloud-based offerings allow two very different user experiences is an example of the organizations to choose to store copies in alternative storage separation of concerns that is a feature of the cloud-based systems. In Preservica’s case this can include the ability to hold a approach. It allows very different user communities to be local copy using a ‘copy home’ storage mechanism (using ftp to supported from one system. The on-site approach to this issue has write content back to hardware controlled by the host typically been to have separate systems (often from different organization). suppliers) but this is harder in the cloud since the integration is No system can offer a 100% guarantee. Hence, while it is much less efficient if systems are not co-located. tempting to continue to add more storage options, the ultimate goal will remain unachievable. Some providers do offer an 6. OTHER OAIS FUNCTIONAL ENTITIES insurance-backed guarantee. However, even here, it must be While most cloud-based systems just offer bit-level preservation remembered that, as with other insurance, while a claim might and provide some form of ingest and access, these are only some lead to monetary compensation, this will not recover what has of the functional entities in OAIS and are thus insufficient to meet been lost, and it will still be necessary for an assessment of the its demands. Preservica provides a full OAIS solution in addition value of what has been lost to be made prior to any claim being to Storage and Access described above. It has come about owing paid. to the increasing maturity of the functionality of the core product. This ability to bring functionality that was previously confined to Ultimately, therefore, the appropriate storage policy is a on-site systems with a large bespoke element and significant compromise between costs and risks. Preservica allows this capital costs into the cloud has been described as a “new balance to be controlled differently based on appropriate criteria. paradigm” [1]. Hence, a storage policy module allows organizations to choose different strategies for different content files (e.g., for digitization streams it might be appropriate to store the high-resolution master

Page 117 6.1 Ingest intuitive browser-based user interfaces to do so. This means that A variety of routes are available including the ability to upload each organization can have control without having the burden of client-created SIPs (which can be created from ad-hoc content via complex system administration. a downloadable tool), create SIPs server-side from upload ZIP files and purely server-side ingest routes (e.g., web harvesting). 7. CONCLUSION This paper has presented some of the advantages and issues of All ingests pass through rigorous quality controls. running digital preservation services in the cloud. It shows that it 6.2 Data Management is possible for this approach to offer a much-reduced entry barrier This is highly flexible allowing users to describe the information to organizations performing digital preservation without the need using a schema of their choice and yet still search, view and edit to compromise on demanding a full OAIS solution (i.e. both the information [4]. In addition, it is possible to integrate with logical and bit-level preservation). some external cataloguing systems. There are a number of technical challenges that have been overcome in the development of a cloud-based digital 6.3 Preservation Planning preservation service. They include: This includes ‘Active Preservation’ [2] and includes the ability to  Enabling a carefully considered exit strategy. perform both technical and conceptual characterization, determine  which material is at risk either during ingest or at a later date, Allowing multiple storage options driven by an determine the most appropriate preservation plan and then automatable storage policy. perform validated format migration at scale. This is controlled via  Allowing different access functionality for different a technical registry [5]. classes of user especially avoiding the need for download where possible. 6.4 Administration  Providing full OAIS functionality on top of storage and If a cloud service is used it is not necessary for an organization to access (i.e. not just bit-level preservation). maintain its own technical administrative staff. This is especially  Separating system-level administration (carried out by valuable to smaller organizations since such tasks are often hard the supplier) from tenant-level administration (carried to resource. Even larger organizations find it hard to recruit, out by the tenant organization). manage, train (and ultimately retain) technical staff such as database administrators. Sometimes such administration is out- 8. REFERENCES sourced to a parent organization (e.g., a regional archive might [1] Adrian Brown. 2013 Practical Digital Preservation. Facet rely on the central IT provision of the region’s government). In Publishing, London, UK. these cases it can be hard for the needs of the smaller, client [2] Sharpe R and Brown A. Active Preservation. Lecture Notes organization to be heard and understood by the administrators. In Computer Science, 2009, Proceedings of the 13th Hence, for small and medium sized organizations, at least, there is European conference on Research and advanced technology a distinct advantage in buying a cloud-based service where the for digital libraries, Corfu, Greece, Pages: 465-468. administration is performed by skilled and trained administrators [3] Andrew Charlesworth. 2012. Intellectual Property Rights for who understand the needs of the system. Digital Preservation. DPC Technology Watch Report 12-02 However, organizations still want (and need) to have some [4] Alan Gairey, Kevin O’Farrelly and Robert Sharpe. 2012. element of control. Hence, Preservica again separates the Towards seamless integration of Digital Archives with concerns and distinguishes system-level administration from source systems. In Proceedings of International Congress on tenant-level administration. Archiving (Brisbane, Australia, 20-24 August 2012). System-level administration involves managing availability, [5] Maïté Braud, James Carr, Kevin Leroux, Joe Rogers and performing database backups, adding new patches and Robert Sharpe. Linked Data Registry: A New Approach to functionality etc. This is the responsibility of the service provider Technical Registries. Submitted to iPres 2014. (Tessella in the case of Preservica Cloud).

The tenant-level administration (i.e. configuring functionality for an organization, determining which local metadata schemas to use etc.) needs to be controlled by the tenant and Preservica provides

Page 118 Some problems of professional processing of online social networks archives Miroslav Novak, Ph.D. Tatjana Hajtnik, M.Sc. Regional Archives Maribor Archives of the Republic of Slovenia Glavni trg 7 Zvezdarska 1 SI-2000 Maribor SI-1000 Ljubljana Tel.: + (386 2) 22 85 013 Tel.: + (386 1) 24 14 200 [email protected] [email protected] ABSTRACT expressing the views of individuals within certain groups2, or to simple informing about individual events or other entities3. For On online social networks generate contents, which have potential this reason, the content of online social networks gets the subject archival value and should be treated as archival material. In this of professional archival research and discussions, both in terms of regard, archivists must pay attention to this kind of content at their appraisal as well as the search for long-term storage method already appraised creators of traditional (physical) archives. In the including an understanding of their integrity, usability and public evaluation, acquisition, preservation and use of such records faith. Let us mention only a few archival professional questions appear many archival professional questions. They are not so that appear in this context: much tied to technological solutions of individual . Can be records created in online social networks actually accomplishments of creation, preservation or acquisition of such defined as archives? If so, In which segments of social records, but primarily on their complete and authentic form of networks, these records truly have archival value; content, which has immediate archival value, including its own authentic and reliable content and authentic content of authority . How to implement the basic principles of archival records. For these reason these contents must be verified either in profession, particularly the principle of provenance and the the procedures of data capturing in creation agencies or during the principal of original order, in these environments and how acquisition to archival institutions. The practical implementations to interpret them in specific cases; of acquisitions however really show the complexity of the archival . How to carry out the appraisal of records on social networks professional treatment of these types of archives. with relatively simple technological support; Categories and Subject Descriptors . How to contextualize the records with archival value with H.3.7. [Information storage and retrieval]: Digital libraries – related records, that are not defined as archives and thereby collection, dissemination, standards, system issues, user issues. reduce the level of the resulting information and communication noise; General Terms . How to implement the proactive role of the competent Archives over these records in terms of quantity, scope and Management, Documentation, Reliability, Security, Theory, Legal method of administration of individual online social Aspects. networks. In identification of archival professional problems and their Keywords solutions [6], we will focus only on some basic methodological Keywords: Web2.0, social network (the Internet), evaluation, issues of implementation of professional archival principles and methods of operation, archival professional principles, long-term on appraisal of such records. digital preservation, use of materials, original decoration, original presentation, modifications to the material 2. ONLINE SOCIAL NETWORKS IN 1. INTRODUCTION BRIEF Development of modern online social networks is intense in Online social networks are applications, Web services, platforms recent years, and has a vast direct impact on the changes in wider or websites that build social relationships between people by formal or informal social communities such as "Arab spring" [1]. means of modern technological solutions. [7] These now support 1 The impact refers to the formation of public opinion and different areas of human activity, including entertainment, participation in public affairs, distance learning, exchange of scientific and cultural information, financial transactions, etc. From the perspective of archival theory, online social networks are interesting mainly because of its dual nature. On the one hand, they represent the media of modern communication of multitude

2 Let us mention the Facebook Pages: »Barack Obama« with total of 42.618.068 likes [3] and »Tina Maze« with total of 337.544 1 Let us mention the Facebook Page »Franc Kangler should resign likes [4]. as a mayor of Maribor (Franc Kangler naj odstopi kot župan 3 As an example, let us mention the US National Archives FB Maribora)« with total of 39.316 likes [2]. Page with 85.125 likes [5].

Page 119 of individuals on their own behalf or on behalf of the legal person, methodological problem of data capturing of authority content of on the other hand it is the modern way of expression, information "current" real entities, which are not always properly implemented exchange and ultimately, a form of two-way reciprocal social in their "current" virtual entities. relations between individuals or between individuals and various At the same time, we can already find effective methodological groups. Online networks are communication channels for data solutions of links between the "past" real and "current" virtual transfer on entities in real life, and thereby the environment in environment in the presentation of archival material on the data which data get its meaning and existence. At this level, archivists level.6 A consequence is the professional recognition that every must follow the principles of archives known for example from 4 entity from the reality must be unambiguously identified in a technologies of communication based on analogue technologies. virtual environment, following the procedure that includes the Other, more complex nature of online social networks, we can validation of captured data. Individual entities can thus be defined define however as mapping of real life into their virtual version. in different ways for example with a unique geo location for In doing so, we are facing a systemic problem of distinguishing spatial entities, creation of unique virtual profiles for natural and between the conceptions and contents such as "fiction", "virtual legal persons, assignment of unique identifiers of authority art", "virtual reality",5 etc. Such archival professional problems content, identification of the number and value e.g. bitcoins for are known from the past in real world for example in the performed work or payment etc.. relationship between the real event or person from the past and his During the "virtual" life, individuals or other entities are poetical, musical, or artistic interpretation. confronted with similar situations as in real-life, e.g. with the The third, most complex nature of social networks can be defined fluctuation in the intensity of communication, with the increase or at the level of archival professional processing of authority decrease of the number of connected persons or other entities, contents, especially yet the fundamental entities of social with the volume of content, with the speed of information networks. In the real world, it is possible to identify the natural or exchange, as well as deviations such as cybercrime.7 At the same legal person by name, title, place of birth, establishment, time their users face problems with the implementation of dissolution or death and other identifiers, which are verified substantive corrections, consideration of legal limitations relating through a variety of legal, judicial and other procedures. In the to the protection of personal and other data, data migration from sphere of online social networks, we are not familiar with the one to the other comparable network or on local media and last formal procedures that would ensure the credibility of captured but not least the problems of closing and termination of profiles. authority content. Individuals can be members of different social Users face also the problems related to the documenting of networks where they usually do not have, or have only in contextual information or data structures. From the archival exceptional cases, a reliable system of identification in various theory and practice perspective arise issues such as how changes environments. However, if such an identifier exists, then it can be are implemented and how they affect the public faith of a problem in the process of formation of the authority record and considered content within time and space? its public faith. The content can be false basically, because the pseudo-code of a person without connection to a real person was 3. ARCIVAL CONTENTS IN ONLINE used. However, the captured authorized contents can also be outdated, intentionally or unintentionally malformed or otherwise SOCIAL NETWORKS misleading. It is not easy to form the answer to the fundamental question relating to the content of social networks, whether they have It is expected that the problem will become even more complex archival or just temporally and spatially limited information when the number of migrations from one network to another will value? There are many reasons. Let us mention only the increase, the data on persons will be long-term stored in one or outstanding dynamics and the scope of such systems, a relatively more source networks without being properly verified and ready low formalized level of their legal protection, lack of the for long-term preservation. From this, we can create an archival necessary historical distance, problems relating to the lack of professional criteria and procedures for their appraisal, etc. 4 For example 294 letters of Franc Miklosich, sent to his brother Solutions to these problems should be searched for in the relations Morisu, preserved by Regional Archives Maribor [8]. Another between the physical form of archives and their derivatives on example: 248 preserved letters of the soldier Žiga Janko from online social networks. The basic focus of archival professional Motvarjevci, written from the front between the years 1914- activities should be given to the content analysing their form of 1918, also preserved by Regional Archives Maribor [9]. appearance however, will obviously need to be defined as corrective factors of archival professional solution. 5 Discussed concepts often refer to identical content, but for the purposes of this paper they are defined as follows: "fiction" In designing the general archival professional point of view on includes all online activities such as online games, their results content of online social networks, we have to distinguish the and other mechanisms of expression of individuals contents that were "born" in these environments from the contents (photomontages, fictional events, entities, etc.), even if based on events or facts from real life. The term "virtual art" defines all the results of artistic creation or re-creation of individuals or 6 groups who have or do not have the basis in the events or facts For example, the presentation of historical maps of the Habsburg from real physical or virtual life of persons, corporate bodies or Empire and the Austro-Hungarian Empire interacted with other related entities. The term "virtual reality" defines all modern technologies of geographic information systems [10]. entities that are dealt with in a virtual environment and have 7 In this context, let us mention only the assurance of the integrity their foundation in the real persons, corporate bodies or other of individuals against abuses such as unauthorized access to the real or virtual entities that have directly expressed public faith. data, integrity of individuals, online child safety, etc.

Page 120 that are merely derivatives of content from other environments. 2003 [20] that points to a number of factors that threaten the [11] existence of digital records. Not only obsolescence of hardware Already archived contents to other media represent an example of and software, but also uncertainties about resources, responsibility derivatives, while online social networks serve only for their and methods for their maintenance and preservation, and the lack dissemination, promotion, etc. [12] In this context are popular of appropriate legislation. activities in Slovenia, for example, the “Document of the Month” In the continuation of the research, we will be limited only to the prepared by Archives of the Republic of Slovenia [13 ] and information and data structure legitimacy provided in the context Historical Archives Ljubljana [14]. Online social networks are of online social networks from the archival and information used not only for the purpose of presentation of archival material science point of view. but also to inform the professional archival public about the various professional activities of archival institutions [15] or 3.1 Methods of archival content appraisal on professional associations, for direct exchange of expert opinions between different practitioners or for notifying in critical online social networks situations.8 Even in these cases, we can define the contents of the It is well known that professional issues related to the appraisal of online social networks as derivatives of archival or non-archival archival content are complex as rule. The answers often depend contents from other technological environments. on well-established professional archival traditions, established levels of national consciousness and whit this associated Unlike derivatives, these environments contain “digital born" consciousness of cultural property protection, financial and contents also. Thea are formed in the context of the functioning of technical capacities of archival institutions, etc. In exceptional social networks where there are also used and where they remain. social conditions, answers regarding the appraisal of archival Archivists must first appraise such contents as archival material content depend also on the political, military or other repressive [17]. In this context, we have to limit ourselves to those archival mechanisms. In the information technology era, technological contents created by the creators of archives. solutions represent among others, an important factor of (none) In the context of understanding and formation of general archival appraisal of relevant content, including archival. professional point of view, a survey conducted by the National The evaluation of a creator of the physical archival material is Archives and Records Administration (NARA) between carried out in accordance with a law or with the procedure of governmental institutions regarding the use of Web 2.0 is promulgation of an institution as a creator of archives. In case that interesting. It shows that information on online social networks do the status of a creator derives from the law, criteria, under which not represent the official and credible source for the investigated certain legal persons have a status of a creator of public archives, state institutions.9 This cognition results from the observation that must be defined. The same applies to the evaluation of creators on such information is duplicated, the question remains however, basis of the promulgation. Behind the evaluation criteria and lists which is the official source of information.10 With online Web 2.0 generated on this basis, we can see a multitude of methods on tools, users simultaneously have the possibility of changing and basis of which we can define the criteria or individual evaluation adapting the appearance of the interface used for reviewing the entities. [21] information, they can improve processes and functionalities and add metadata and other functions. By adding new contextual It is the same for the appraisal of physical archival material of contents they change the original context of the information individual creators. Archivists can help themselves with the (information "look and feel" is not static) and change provisional lists of document categories that always have a consequently its value. The investigated institutions warn at the character of archival material. When archives in physical form are same time, that the disappearance of information from online appraised, such an entity remains unchanged in form, content, social networks would be interpreted in the eyes of the public as scope, etc., regardless of whether we look at it from the avoidance of responsibility or desire to conceal. Despite the fact perspective of the creator or the end user. Moreover, in terms of that the information no longer has the business value for their content and logic, links in the material remain unchanged, institutions they believe that it is difficult to find the real reasons "frozen" at a given time and usually also in the space. for deleting information from social networks. In this context, we should mention also a document already adopted by UNESCO in Evaluation of potential creators of archival material on online social networks can result at this stage of archival professional development also from the evaluation of real creators, regardless 8 Cf.: information regarding the fire in the Archives of Bosnia and of whether they are defined on basis of the law or on basis of their Herzegovina on February 2014 submitted by Facebook promulgation procedures. page.[16]. At the level of the appraisal of archival content of individual 9 The survey was conducted in 2010 and included state creators within online social networks, get procedures and institutions: Department of State, Environmental Protection decisions from the archival methodological point of view Agency, Joint Staff, NASA, United States Army, United States complicated. Such contents are often mutually related in terms of Geological Survey.[18] content, but sometimes only in terms of context. One can observe 10 Institutions have indicated that many online tools (internal and that some of these connections are very weak and exist only for a external) have in-built rejections that make clear that the short time, while others are strong and defined as the fixed link. information has no official character. At the same time, they Weak contextual links are e.g. advertisements and other parallel redirect the user on their official website, for example the communications that occur, and thereby dynamically change the Facebook and You Tube of the Department of State, where it is whole presentation of this type of archival content in time and stated: "If you are looking for the official source of information space. about the Department of Justice, please visit justice.gov." [19]

Page 121 Another problem arises when we export the contents of archival Technological solutions, which enable the preservation of records value from the original environment to the local environment. The transmitted to the public through their profiles (e.g. Twitter, subject of export are in the rule only "appraised" contents with Facebook), are also available to individuals. Through function in solid connections in terms of content, while their weak versions in the settings of its profile, the user can sent a request for data the rule are not subject to export. From a methodological point of export to the server, while the information is sent over a pre- view, the question arises as to whether archivists should evaluate determined e-mail address in the form of a zip file. Thus, for only the content and its strong links in terms of context, which is example archival content is located in the tweets.csv files, which the view of the creator of content or whether they should in the can be accessed through the interface defined in the file evaluation take into account the sight and feeling (Look & Feel) index.html. Other files in the folder are necessary for the of a user. A third possibility is for archivists to preserve two forms presentation of content in the local environment. [25] (appearances) of the same content. This raises a multitude of other Websites can also be saved (stored) using "save as". The entire issues, to mention only: content of the website is stored on disk as .mht file. It is still  How to proceed if the same content occurs and is represented possible to use solutions such as capturing content in static, through two or more different user interfaces? common JPG format with the help of functions for recording screen content [26]. Such content can be captured in a dynamic  How to deal on long-term with contents with strong in-built form like AVI as well. contextual connections, where the target content no longer However, in the case of these solutions we are talking about the exists (no information) or we have instead a completely "manual" processes, which quickly become impractical. Websites different content (disinformation)? and social networks are complex sets of pages with links to other We have to look for the answers to these and similar questions in websites and contents. It would be difficult to use these practical the current archive doctrine, according to which the archives are "manual" methods, if we would like to keep all changes of the stored in the original, arranged and generally in the specimen. We records of social networks. That would mean to use them several will have to complete this paradigm and develop methods and times a day (how many?) or even for multiple users (e.g. for all ways of appraisal of such archival content, which are not limited employees within the organization). We must not forget even on to archival content from online social networks only, but also on problems related to traceability of records and their modifications, the content of the databases and static or dynamic documents that and the problem of proving the authenticity, formats that become contain substantial contentual and logical connection. obsolete over time and records become illegible, etc. There are also many archival and library initiatives, which began 3.2 Tools for managing archival contents on to engage themselves intensively with the management of records online social networks in online social networks. [27] The first example of capturing and storing tweets is the Library of Congress in Washington, to which Various services (free of charge and payable), which can be found the company Twitter donated its entire archives from 2006 to on the web today offer a very good and useful functions related to 2010. [28] However, later they have reached an agreement and the management of content in online social networks, but they are they continue to preserve the tweets. In early 2013 they had in far from ideal. Many of them are experimenting with formats custody about 170 billion of tweets and their quantity is (formats), functionalities and capabilities they offer and change at increasing daily. The volume of preserved tweets has risen since any time. Usually they arise from commercial activities and are as February 2011 from 140 million a day to almost half a billion such particularly problematic, as for example the company tweets per day from October 2012. Any concern about modest Backupify. 11 documentary value of tweets is superfluous, since the Library of Services that enable quite easy management of information on Congress wrote in their statement that the "preservation will social networks are the ContextMiner and SocialSafe. enable the users in the future more comprehensive insight into ContextMiner is a framework for the collection, analysis and today's cultural norms and trends". Although the archives were presentation of contextual information along with the data. [23] It not yet available to researchers, they have recorded in early 2013 is based on the idea that when describing archival object however, already about 400 inquiries of researchers from around contextual information helps to substantiate this object or allow the world. its better preservation. It provides tools for data, metadata and Issues related to the long-term preservation of contents from contextual information collection from the web with automated social networks are closely related to the problems and solutions searches. Currently ContextMiner supports automated of storage and capturing of web sites. There are several known investigation of blogs, YouTube, Flickr, Twitter and the open projects in this area carried out by The American Library web. It also collects connection information for YouTube videos Association, The Australian National Library and The Library of from the web. It works by selecting a source for collecting data Congress. [29] Internet Archives keeps over 430 billion web and contextual information by entering search strings (queries), or pages [30] in special archival formats (, warc) which can be URLs. used only with a special interface Wayback Machine, developed With the help of SocialSafe [24] service, we can get a copy of all by the Internet Archives. It enables browsing on the web archives records that have been posted (even if only written and never and display of captured web sites in normal web browser. In published) on Facebook, Twitter, Instagram and other networks. Slovenia the National and University Library uses a similar solution [31]. Access to preserved web contents is commonly free, but the problem lies in the fact that they are captured only from 11 The company Backupify announced on December 2012 that time to time and the time of capturing is unknown and they will no more provide security copies of the data of social unpredictable. Even all the contents of these websites are not network LinkedIn.[22] equally accessible than they were in their original form.

Page 122 4. AN EXAMPLE OF PROCESSING OF of the material. It was published in an appropriate secure an accessible web site within the domain under own administration. ARCHIVAL CONTENTS IN ONLINE PAM has not yet implemented any interventions on the content of SOCIAL NETWORKS the DIP. DIP data level analysis does not indicate potential Already during the implementation of the project Maribor 2013 - problems, for example at the level of coding tables or functioning European Capital of Culture, the Regional Archives Maribor of links. Big professional challenges have been observed at the established that the creator is a public legal person, which is level of the functioning of external resources references located in obliged in accordance with the law to transfer archives to the reused domains on which the competent archive does not have competent archival institution. At the end of the project issues their professional competence. This in practice leads to related to the storage and subsequent use of its content, including misinformation. Therefore, it will be necessary to implement online social networks have become topical. Basic analysis of interventions for the correction of references to external resources. their contents showed that in the present case, we found That rises some archival professional questions. Let us expose duplicates of the content of the web pages on the project only the questions relating to the implementation of the changes Facebook page, at the same time however, they provided of archival content in such a way, that the corrected references additional information on the events at the web network Twitter would function properly, but at the same time, we would not need and on e-mail. to interfere the content. There are also questions like How to From archival theoretical point of view were the contents of the create the presentations of the content in DIP, especially if it is treated creator of online social networks evaluated as archival changing dynamically already at the creator? It appears that in this material, which must be transferred to the Archives. In this case, on the one hand, it would be necessary to take the dynamics context, the challenges in the field of implementation of the of the single presentation of the content into account, and on the relevant acquisition in accordance with existing archival other to carry out one or more snap shots from the pre-determined professional standards including problems associated with so- time. called anonymous administration of such systems, appeared. [17]. It should be noted that in PAM they did not specifically In the continuation we will discuss only the content of social commence the development of Archival Information Package, networks Twitter and Instagram, which were established in the (hereinafter AIP) but have temporarily left the original data framework of the project Maribor 2012 - European Capital of structure and organization of files. Only information about the Culture. username and password of each account separately was added. 4.1 Transfer and processing of contents from 4.2 4.1 Transfer and processing of contents social network TWITTER from network ISTAGRAM In the social network Twitter were contents related to the project Instagram is an online service that allows mobile photo, video and "Maribor 2012 - European Capital of Culture", presented in two related services sharing, enabling users to capture photos and accounts: @Maribor2012. Evropska prestolnica kulture Maribor video clips using on them the appropriate filters and then share 2012. Zavrtimo skupaj12 and @ LifeTouch_2012 tweets that have them through various social networks like Facebook, Twitter, been submitted in connection with the fourth programme topic of Tumblr and Flickr. [34] European Capital of Culture Maribor 2012 entitled Life on the 13 Within the framework of the project "Maribor 2012 - European touch. In the process of transfer, the contents were packed in Capital of Culture" among others, Instagram photo contest two independent Submission Information Packages (hereinafter CATCH YOUR INSPIRATION (UJEMI SVOJ NAVDIH) was SIP), which have been prepared separately. Solutions, allowed organized. The competition has received nearly 1,000 photos in only by online network Twitter, were used. This means that in the the context of three themes: #EPKUTRINKI, #EPKLJUDJE and present case advanced automated support solutions, developed for #EPKSHARK. Photos and accompanying contents of the contest example by the competent archival institution, were not are stored within transferred online content of the entire project. implemented. At the same time, photographs were treated as a collection of Downloaded content was described in the archival information photos in a special series, divided into three sub series, which system of Regional Archives Maribor (hereinafter PAM) in correspond to three themes. The decision for such a solution was accordance with the standards. Thereby Dissemination taken because the content of a website was designed on the model Information Packages (hereinafter DIP) were created. The of open logical information loop, but from archival professional difference between SIP and DIP is only, that PAM added the user point of view, it was necessary to conclude the loops at least interface in Slovenian and constant connection to the description logically. That was carried out with the formation of static DIP. This is represented by archival description on the level of sub

series [35] including presentations of photos submitted to the 12 On this account, 1365 tweets were created. They were contest. To each photo in the DIP they added above in a package organized into 61 files and 12 folders in total size of 2.92 MB the narrow range that contains information such as: (3,066,094 B). The page was followed by 1251 people  Reference code, which defines a unique place of the content (followers), the page itself however followed 346 other pages of the photograph within the archival information system, (following). [32] 13  Title of the series, which defines the basic contextual On this account, 1872 tweets were created. They were environment of photograph, organized into 46 files and 12 folders in total size of 2.84 MB (2,982,965 B). The page was followed by 309 people  Code of the photograph, which is unique and derives from the (followers), the page itself however followed 341 other pages original title of the photograph. (following). [33]

Page 123 In the context of the creation of DIP they were very mindful on [5] USNationalArchives. n.d. Total Page Likes 85.125. the management of the sequences of individual entities in the sub Retrieved September 17, 2014, from https://www. series. facebook.com/usnationalarchives In the case of preserving photos with online social network [6] Semlič Rajh, Z., Šabotić, I. and Šauperl. A. n.d. Znanstveno- Instagram, the Maribor archivists faced with similar archival raziskovalno delo v arhivistiki: značilnosti uporabe dveh professional issues such as the preservation of content from raziskovalnih metod. Retrieved May 29, 2014, from Twitter, especially with regard to design, AIP and DIP. Even in http://www.pokarh-mb.si/uploaded/datoteke/Radenci/ this case structures of files in the AIP have not been specifically Radenci2013/11_Semlic_Sabotic_Sauperl_2013.pdf tackled. They were temporarily left in their original data structure [7] Spletno socialno omrežje. n.d. Wikipedia. Retrieved May 15, and organization. Added them only the information on the user 2014, from http://sl.wikipedia.org/wiki/Spletno_socialno_ name and password for access to the Instagram network. omre%C5%BEje [8] SI_PAM/1933/002/004 Korespondenca Franza Miklosicha z 5. CONCLUSIONS bratom Morizem, 1876-1910 (Podserija). n.d. SIRAnet. Professional issues related to the preservation of online social Retrieved May 29, from http://www.siranet.si/detail.aspx? networks contents of evaluated creators of archives can be dealt ID=912611. with on at least three levels: archival, informational and technological. On the archival professional level, in particular, [9] Pridobljeno 29. maja 2014 s spletne strani contexts must be checked and evaluation of the content in http://www.pokarh-mb.si/si/aktualno/125/hrepenenje-izza- accordance with professional archival principles must be okopov-pisma-s-fronte-1914-1918.html. implement. On the informational level entities such as information [10] Historical Maps of the Habsburg Empire. n.d. mapire. and misinformation, their completeness, correctness, clarity, Retrieved May 27, 2014, from http://mapire.eu/en/. integrity, and credibility, etc. should be checked. At the technological level, we have to deal however, mainly with [11] Mednarodni arhivski svet. 2005. Elektronski dokumenti: correctness of the functioning of the content at technical level, the Priročnik za arhiviste. ICA Študije 16. Retrieved May 27, modalities for its presentations in a variety of environments 2014, from http://www.arhiv.gov.si/fileadmin/arhiv.gov.si/ including the modalities of implementation of their involvement pageuploads/zakonodaja/ELEKTRONSKI_DOKUMENTI_S in both the original as well as in the archival environment. TUDIJA_16.pdf. Key challenges in the management of online social networks [12] Gostenčnik, N. 2014. Uporaba spleta 2.0 v Pokrajinskem contents represent their large quantity, rapid variability, mutual arhivu Maribor. Retrieved May 27, 2014, from http://daz.hr/ interpenetration, multiplication, rapid technological development zad-dan/uporaba-aplikacij-web-2-0-v-pokrajinskem-arhivu- and the lack of clear rules and standards for their management. maribor/ Due to the large volume and constant changes of information on [13] Arhivalije meseca. n.d. In Republika Slovenija, Ministrstvo social networks, from the technological, as well as archival za kulturo, Arhiv RS. Retrieved May 27, 2014, from professional point of view, it will probably never be possible to http://www.arhiv.gov.si/si/delovna_podrocja/razstavna_ preserve these contents fully. Therefore, it is necessary to develop dejavnost/arhivalije_meseca/ methods for their evaluation. In doing so, it is possible to rely in [14] Arhivalija meseca. n.d. In Zgodovinski arhiv Ljubljana. some extent on known solutions, which are implemented for Retrieved May 27, 2014, from http://www.zal-lj.si/ archives in physical form. index.php/component/content/category/43- First archival experiences in this field show that the complex arhivalija_meseca. problem of long-term preservation of such content should be [15] Kemper, J. 2012. Archives Open – Offene Archive? Ein approached with all professional responsibility. Ability of Praxisbericht. In Proceedings of the Tehnični in vsebinski comprehension of such kind of archival professional problems problemi klasičnega in elektronskega arhiviranja arhivistika reflects the level of theoretical and practical development of in informatika: zbornik mednarodne konference (Radenci, modern archival services. Slovenija, March 28.-30, 2012), 445-450. [16] Historijski arhiv Sarajevo. 2014. Facebook. Retrieved May 6. REFERENCES 27, 2014, from https://www.facebook.com/Historijski [1] How the Arab Spring Was Helped By Social Media. n.d. ArhivSarajevo. Policy.Mic. Retrieved May 29, 2014, from http://www.policymic.com/articles/10642/twitter-revolution- [17] Novak, M. and Šövegeš Lipovšek, G. 2014. Virtualizacija - how-the-arab-spring-was-helped-by-social-media. način ohranjanja digitalnih arhivskih vsebin. Tehnični in vsebinski problemi klasičnega in elektronskega arhiviranja, [2] Franc Kangler naj odstopi kot župan Maribora. n.d. Total Arhivi v globalni informacijski družbi : zbornik mednarodne Page Likes 39.316. Retrieved May 29, 2014, from https://sl- conference (Radenci, Slovenija, April 02. – 04, 2014), 437- si.facebook.com/kangler.naj.odstopi. 545. [3] BarackObama. n.d. Total Page Likes 42.618.068. Retrieved [18] A report on federal web 2.0 use and record value. 2010. September 17, 2014, from https://www.facebook.com/ National Archives and Records Administration. Retrieved barackobama/likes November 23, 2014, from http://www.archives.gov/records- [4] TinaMaze. n.d. Total Page Likes 337.544. Retrieved mgmt/resources/web2.0-use.pdf. September 17, 2014, from https://www.facebook.com/ [19] The United States Department of Justice. n.d. Facebook. tinamaze. Retrieved June 4, 2014, from http://www.facebook.com/

Page 124 DOJ#!/DOJ?v=wall; TheJusticeDepartment. n.d. YouTube. [29] About the Internet Archive. n.d. Internet Archive. Retrieved Retrieved June 4, 2014, from http://www.YouTube.com/ April 23, 2014, from https://archive.org/about. TheJusticeDepartment. [30] Internet Archive: Wayback Machine. n.d. Internet Archive. [20] Charter on the preservation of the digital heritage: adopted Retrieved September 17, 2014, from https://archive.org/web/ at the 32nd session of the General Conference of UNESCO, [31] Spletni arhiv Narodne in univerzitetne knjižnice. n.d. spletni 17 October 2003. 2003. Paris: UNESCO. ARHIV NUK. Retrieved June 1, 2014, from http://arhiv. [21] Novak, M. 2011. Metode stručnog i istraživačkog rada u nuk.uni-lj.si/. savremenoj arhivskoj teoriji i praksi. Arhivska praksa, 14, [32] SI_PAM/1948/007/001 @Maribor2012 Twitter sporočila 213-227. (čivki), ki so bila posredovana v zvezi z izvedbo projekta: [22] May, R. 2012. The Cloud to Cloud Backup Blog: Why We Evropska prestolnica kulture Maribor 2012. Zavrtimo Are Discontinuing LinkedIn and Zoho Docs Backup. skupaj., 2010.03-2013.04 (Podserija). n.d. SIRAnet. backupify. Retrieved June 3, 2014, from http://blog. Retrieved June 9, 2014, from http://www.siranet.si/detail. backupify.com/2012/12/10/why-we-are-discontinuing- aspx?ID=1122570. linkedin-and-zoho-docs-backup [33] SI_PAM/1948/007/002 @LifeTouch_2012 Twitter sporočila [23] What is ContextMiner? n.d. ContextMiner. Retrieved May (čivki), ki so bila posredovana v zvezi z: Življenje na dotik, 30, 2014, from http://www.contextminer.org/about.php 4. programski sklop Evropske prestolnice kulture Maribor [24] Save all your social networks to your computer and enjoy 2012., 2011.06-2013.02 (Podserija). n.d. SIRAnet. Retrieved your story in one safe place. n.d. SocialSafe. Retrieved June June 9, 2014, from http://www.siranet.si/detail.aspx?ID= 1, 2014, from http://www.socialsafe.net 1122819. [25] Downloading your Twitter archive. n.d. Twitter Help Center. [34] Instagram. n.d. Wikipedia. Retrieved June 9, 2014, from Retrieved June 6, 2014, from https://support.twitter.com/ http://en.wikipedia.org/wiki/Instagram. articles/20170160-downloading-your-twitter-archive# [35] SI_PAM/1948/006/001 Fotografije zbirke "#epkljudje" : [26] How to Take a Screenshot in . n.d. Veliki Instagram natečaj Evropske prestolnice kulture, 2012- wikiHow. Retrieved May 28, 2014, from 2012 (Podserija). n.d. SIRAnet. Retrieved June 10, 2014 http://www.wikihow.com/Take-a-Screenshot-in-Microsoft from http://www.siranet.si/detail.aspx?ID=1123113; Windows SI_PAM/1948/006/002 Fotografije zbirke "#epkshark" : [27] A National Archives of the Future. 2011. AOTUS National Veliki Instagram natečaj Evropske prestolnice kulture, 2012- Archives. Retrieved April 9, 2014, from http://blogs.archives. 2012 (Podserija). n.d. SIRAnet. Retrieved June 10, 2014 gov/aotus/?p=2322; Levien, R. E. 2011. Confronting the from http://www.siranet.si/detail.aspx?ID=1123146 future: strategic visions for the 21st century public library. SI_PAM/1948/006/003 Fotografije zbirke "#epkutrinki" : Washington, D.C.: ALA Office for Information Technology Veliki Instagram natečaj Evropske prestolnice kulture, 2012- Policy. 2012 (Podserija). n.d. SIRAnet. Retrieved June 10, 2014, [28] Erin, A. 2013. Update on the Twitter Archive at the Library from http://www.siranet.si/detail.aspx?ID=1123180. of Congress. Library of Congress Blog. Retrieved June 1, 2014, from http://blogs.loc.gov/loc/2013/01/update-on-the- twitter-archive-at-the-library-of-congress/

Page 125 Recordkeeping Informatics: Building the Discipline Base Gillian Oliver Frank Upward Barbara Reed Victoria University Monash University Recordkeeping Innovation Wellington Melbourne Sydney New Zealand Australia Australia 64 (0) 4 463 7437 61 (0) 3 9534 2405 61 (0) 2 9369 2343 [email protected] [email protected] [email protected]

Joanne Evans Monash University Melbourne Australia 61 (0) 3 9903 2177 [email protected]

1. INTRODUCTION Previously, we have introduced the concept of recordkeeping ABSTRACT informatics as a new disciplinary base for records management In this paper we build on work we have been doing on [10]. We argued that more efficient, effective reliable and recordkeeping informatics (RKI) which is our term for an sustainable record management comes from understanding and approach to records management that focuses on the processes appreciating the informatics of recording methods. In other words, that produce records rather than the management of them as end knowing not just about managing records, but also about products. For an account of RKI see our article Upward, F., Reed, managing the recordkeeping systems and processes in which they B., Oliver, G., & Evans, J. 2013. Recordkeeping informatics: re- are created, captured, managed and consumed. We built our figuring a discipline in crisis with a single minded approach. argument for recordkeeping informatics around two building Records Management Journal. 23(1), 37-50. blocks (continuum thinking and metadata) and three facets of process analysis (business processes, organisational culture and In that article we identified the need for a new more disciplined access), urging those responsible for the management of current base to records management and in this paper we will discuss five records to accept the challenges of the digital environment and in critical components for this base. Those components involve so doing to actively engage in re-shaping and broadening their  developing new professional groupings and new philosophy and approach. occupations, We tied the nature of the challenge in to Anthony Giddens’s  teaching and training in new forms of skill and sociological explanation of allocative and authoritative resource knowledge, management applying it to information resource management. We were not concerned with the allocative aspects, the productive  finding ways of focusing attention on the ethics of corporate governance, power and vitality of our emerging communication and information technologies. Producing more and more recorded  outlining new recordkeeping competencies and information, moving it around quickly and globally and using functions powerful search engines is part of many daily lives already. We  and showing how these components can be clarified wanted to open up discussions about how to re-introduce and used within the work of project teams proto- protocols into that emergence giving more order across what is typing modularized business applications. often seen to be a wild frontier. That ordering, within Giddens’ interpretation of authoritative resource management, will need to be directed at our life chances, the way we associate with each other, and how well we manage things over spacetime. In all three Keywords areas there are many challenges within the application of technologies in modern societies. Agile computing In order to progress these ideas, to move beyond traditional records management practices which are failing to achieve Corporate Governance recordkeeping objectives in today’s networked society, it is Recordkeeping Informatics necessary to build a new disciplinary base. Without this radical overhaul we will simply continue to make cosmetic and ultimately ineffective adjustments to our traditional ways of doing things. In this paper we identify five critical components essential for this new base. Each of these components will be discussed in turn. The components are:

Page 126  New professional groupings and new occupations implies that we have little to contribute. If our professional  People with a new range of skills and knowledge associations are highlighting this path as the way to go it suggests  Ethical organisations that we need new professional leadership, and recordkeeping  Re-invention of recordkeeping competencies and informatics can make a targeted claim for jurisdiction in a vital functions area where genuine competition is slight. Recordkeeping used to  Roles within project teams be characterized by a single mind, but we divided the role into records management and archival administration and now have to 2. NEW PROFESSIONAL GROUPINGS & grapple with multiple personalities beyond those two groups. Recordkeeping has now become hidden within other professional OCCUPATIONS activities. All professions have the challenge of regaining control Recordkeeping informatics is concerned with managing of our expanding ICT capabilities but without the innate information objects in cyberspace. The occupation of records understandings of evidence that they used to have in the paper management emerged in the mid twentieth century as a result of world. We have gone from a situation where recordkeeping was the need to manage increasing quantities of paper records [4]. We performed adequately in many areas to one where if it is being question whether existing professional infrastructures are up to carried out, it is likely to be done badly. the task of re-positioning and upskilling to the extent required given the fundamentally different environment we are now The new occupations that are needed are based on those of the operating in. past, but have often in the developed English-speaking world been a casualty of the dismantling of bureaucracies. Registrars for The existence of particular occupations reflects societal values instance were the key figures that managed the reception, and needs. Cigar makers and furriers would have been regarded handling and dispatch of in-coming and out-going business. as necessary and respected members of society in the nineteenth These days we need recordkeeping metadata registrars, experts in and first part of the twentieth centuries, yet now the most benign metadata schema, who can identify and implement elements in view would probably be that these occupations are little more than those schemas to turn information into evidence. anachronistic novelties. Similarly, a shortage of blacksmiths and saddlers would have posed a huge risk to people’s ability to trade Records analysts also need to be re-invented, to identify when and and to make a living, these two occupations were once essential where records should drop out as end products of our processes, for the smooth functioning of society. Andrew Abbott, in his but at the same time making records and evidence part of the theory of professions, argues that professions (or occupations) effectiveness of all operational activities when required. arise as a result of system disturbance and the ones that prevail are The numbers of roles required to look after records and archives the ones that are successful in claiming the territory, establishing as analogue or digital products (things to be shelved, or imaged) ownership of a particular set of problems [1]. Several decades will decrease. The numbers of roles required to manage evidence ago it was noted that library and information science was engaged in cyberspace though will increase. One way of thinking about in competition for jurisdiction with other information professions these positions collectively is as stewardship roles for due to ICT developments as well as the increasing strategic authoritative resource management. importance of information [11]. Since that time, the complexity and the scale of the ICT environment has increased exponentially, Furthermore, we have to broaden our thinking, to help move us and the power of information exceeds anything that could have past a narrow competition for jurisdiction with cognate been imagined previously. Reflecting on reactions to WikiLeaks professions, which ultimately diverts our energy and capacity to and the NSA disclosures by Edward Snowden is enough to innovate. Just as the much older profession of medicine has demonstrate that here is territory where ownership and control is developed different roles to provide generalist and specialist perceived as being crucial to the well-being of nation-states. But perspectives, so must we. Thus we should be seeking to position are records managers engaged in this competition for jurisdiction, recordkeeping in conjunction with those other cognate are they even part of the debate? professions, focusing on and being confident in the particular contribution we can bring. A cursory review of the publications for records practitioners produced by their professional associations suggests that 3. PEOPLE WITH A NEW RANGE OF rebranding is attempting to minimize the significance of records, and a claim is being made to information management instead. SKILLS AND KNOWLEDGE For instance, in the United States, ARMA International publishes These stewards for authoritative information resource the Information Management Journal, and surprisingly claims that management will still need to work to logical models and their this publication is “the only professional journal specifically for conceptual underpinnings. This knowledge enables the professionals who manage information as part of their job identification of patterns for the development, selection and description” [2]. The Information and Records Society of Great management of recordkeeping applications – the high level, Britain publishes the IRMS Bulletin. The Records and master plan. Within this understanding, there is a need for new Information Professionals Association of Australasia publishes analytical skills relating to organisational cultures, business the cryptically named IQ Magazine, previously known as process analysis and archival access. These analytical skills will Informaa Quarterly. This terminology usage seems to claim enable us to develop the detailed floor plans necessary for jurisdiction over a very broad information management domain, implementation – the topography of recordkeeping. and suggests that in the English speaking world our professional It is not enough to simply have mastery of a specialized skillset. associations are losing sight of what should be our core concern: Those specialized skills need to be combined with a wide range of the management of information as an authoritative resource. additional, reflexive skills. Reflexivity requires people who are: Claiming jurisdiction over the broad (and extremely contested)  Capable of analyzing shifting legislative mandates domain of information without a strong recordkeeping element

Page 127  ICT literate, maintaining an awareness of emerging The connection between effective recordkeeping infrastructures trends and their consequences and administrative efficiency may seem obvious to us, but without  Equipped with diagnostic skills practitioners who can actually deliver more adequate  Able to engage with people recordkeeping our voices will not be listened to. Our attempts at  Capable of succinct and meaningful explanations of intervention are likely to be seen as needlessly putting red tape in recordkeeping, the ‘elevator pitch’ tailored to whoever the way of action, impediments to achieving workplace goals is in that elevator As a recent report from an Australian Royal Commission into a  Awareness of who holds the power in their workplace major administrative debacle noted [6] it is often considered that and its strategic direction, with understanding of maintaining records is too burdensome but it is precisely agencies financial models and resource allocation that think this way that are most likely to lack a sound corporate  Effective in their networking with others who play key governance framework. In one way we are more sympathetic to roles (whether they are aware of it or not) in the fear that recordkeeping is burdensome than the Royal authoritative resource management Commissioner. In a digital recordkeeping environment the burden  Enthusiastic but purposeful participants with digital can indeed be great unless we find more agile approaches to technologies and social media computerization such as the extensive use of proto-typing  Able to apply graphic and other technical skills in methods and modular and tailorable applications that can be communication plugged into and unplugged within our information infrastructures  Active listeners without losing contact with the records created when they were These are important skills that have to be taken into account when active. Agile computing approaches are in their design infancy in developing educational programmes, in conjunction with our organisations except for hospital records but are one of the professional skills and knowledge. main paths we envisage recordkeeping informatics taking. Recordkeeping used to be a gold standard for any form of 4. ETHICAL ORGANISATIONS governance and administration but on its own it is morally The world we live in is characterized by the spectres of climate indifferent. It can serve any form of government and is change, the spread of terror, corruption, declining confidence in particularly effective in supporting terror based or totalitarian our governments, poverty and inequalities in wealth distribution, regimes (see, for instance [3], [12]). Therefore it is essential to famine and economic collapse at a time when a growth in general cultivate awareness of the ethical dimension of being a wealth, humane food production and improved economic stability professional recordkeeper; we need to work out many different should be within our means. However, attempts to address these ways of legitimating access to information directly, through societal problems are hampered by those with vested interests in judicial processes, whistle blowing within ethical parameters or maintaining the status quo, and an absence of evidence bases to though trusted third parties such as ombudsmen or archivists. We support or refute arguments. have already referred to the need for ethical organisations to work towards win/win solutions. No organisation is an island – all These societal grand challenges may seem remote from day to day organisational entities are part of complex networks which means records management. On the contrary, they provide the complex interdependencies. Prosperous survival means that all standpoint to consider the crucial need of operating within ethical nodes in the network do well – thus all involved benefit. organisations that possess the rare understanding that the imperative in today’s world is to seek out win/win outcomes. 5. RE-INVENTION OF RECORDKEEPING We are certain that the examples of poor administration that we regularly come across in Australia and New Zealand are likely to COMPETENCIES AND FUNCTIONS be systemic, but can seldom be attributed to human error alone. The challenges of digital recordkeeping call for a re-think of what The cause is likely to be an absence of recordkeeping knowledge it means to be a records manager. Former service models that and skills. In New Zealand, for instance, the inability of the focused on managing the thing, rather than the process, have Earthquake Commission (EQC) to provide citizens with critical collapsed. The expanding complexities of ICTs have quite simply information relating to their property and assets subsequent to the passed us by – the traditional emphasis on authoritative resource 2010-2011 Canterbury earthquakes resulted in an extensive management which was relatively easy to understand in paper investigation by the country’s Chief Ombudsman and Privacy based ecologies has slipped away. Commissioner. The report of this investigation places strong We need new standards for professional recordkeeping emphasis on the vital need for timely, accurate and comprehensive competencies and new ways of expressing recordkeeping information in the context of disaster recovery, and is very critical functions. Those functions and competencies in digital of EQC’s failures in this regard. Failure to respond appropriately recordkeeping ecologies will include: in accordance with legislative requirements is attributed to being  reactive rather than proactive and a seemingly complete absence Understanding the storage of information objects, taking of understanding of people’s needs for information [7]. It seems into account fixity in time and place and also fluidity as that the EQC simply did not have the infrastructure or the they alter during re-use. foresight required to fulfill its records-related obligations.  Understanding storage as a strategic issue, outsourcing service models change the cost dimensions to ongoing The old bureaucratic models promoted strong relationships operational charges which have to be proactively between recordkeeping and administration, but the rise of new justified rather than hidden in seemingly one-off capital public management caused the recordkeeping baby to be thrown expenditures out with the bathwater of excessive regulation.  Pluralised access will need to be as automatically controlled by recordkeeping metadata as much as

Page 128 possible. The metadata will have to be applied during large systems architectures. From a teaching and training the formation of self-authenticating records during perspective we need to develop exercises and case studies that business processes and those records need, just as focus on a service orientation in an agile networked environment. automatically, to take their place within corporate stores To better enable integration and interoperability system (archives). developers have been looking to modular or component based Taken together, our three facets of analysis and two building architectures where complex systems are assembled from well- blocks form a kind of kaleidoscope, layering and interweaving defined and standardized components. This vision is being professional competencies and functionality to produce a coherent further extended with the idea that these functional units would whole. Those competencies and the functionality underpinning ultimately be dynamically assembled to carry out business recordkeeping informatics are not fixed and immutable and need processes. A service oriented architecture is seen as having the re-statement in our new digital world. potential to deliver major productivity and capability We suggest a starting point for identification of the new improvements and in so doing transform the way in which competencies to the level of granularity required will be to re- business is done and the ways in which information technology is examine the many existing explorations of archives and records constructed, deployed and managed [8]. management competencies in different jurisdictions that have The project approach should be directed at the ‘so big, so small’ been defined over the last few decades, and to map those against problem in applying ICTs. As technologies converge the whole the recordkeeping informatics facets and building blocks. This is seems to be getting so much bigger, but in its parts the need for however just one approach that can be taken to initiating granular control becomes greater. Accordingly we need projects at meaningful review – and there is a danger inherent in restricting big levels such as whole of government or organisation-wide that thinking because of over-reliance on existing work. In particular, aim to maximise the use of agile forms of computing without we need to make a radical departure from a focus on objects as losing authority in the process. We also need many small ‘wild- opposed to process. For example, a ‘traditional’ records card’ projects addressing in more open fashion any of the myriad management inventory is very different from a workflow, of business applications for which our ICTs can be used. In our architecture or service oriented analysis. Compiling an inventory efforts to bridge between the authoritative and the creative aspects should now be considered as something that may need to be done of information management we need to always be looking for in exceptional circumstances only – as a forensic rather than a patterns that emerge out of our project. Elsewhere we have clinical recordkeeping task. Clinical tasks will involve likened these patterns to Mandelbrot fractals [10]. The understanding and being able to analyse business processes and identification of such patterns enables the development of more workflows that will provide the way to identify points where and more modules that can be tailored within an organisation. decisions are made, and therefore where records are generated and re-used. If the relationships between recordkeeping informatics 7. CONCLUSION and business processes are fully understood and the analysis is Building the discipline base for recordkeeping informatics is still maintained, organisations will be diagnosing and implementing in the very initial stages. It is a massive undertaking, requiring interventions, rather than conducting post mortems engagement and input from others, and cannot be considered the exclusive ground for a few people collaborating at a certain time, 6. ROLES WITHIN PROJECT TEAMS in a certain place. We need to critically review our existing One feature that will contribute to the development and diffusion disciplinary infrastructure and apparatus (our professional of recordkeeping informatics relates to project based activities. associations, our competencies, our educational materials and Complexity and the related difficulties in imposing fixed rules is foci) with rigour and robustness in order to distinguish between one reason why project team methods of working have been the essential (the babies) and the no longer needed (the evolving over the last few decades (no single mind can cope). The bathwater). Understanding the topology in relation to the State Records NSW for example claims that the most distinctive topography of our professional concerns, prioritizing feature to their approach to preserving digital recordkeeping consideration of process rather than thing, will assist in systems is its flexibility noting that “rather than delivering a developing new roles and occupations. Establishing the value of tightly integrated end-to-end system with fixed rules for archiving these roles can only be advanced when we are confident of the digital objects, we’ve developed a project based methodology that nature of our contribution and our capabilities, as well as secure we believe can be applied to the migration of any governmental in our understanding of our relationships with colleagues in digital recordkeeping system to the digital archives. To support cognate occupations. this open approach to digital archiving, we have favoured the use The key to the evolution of records management lies with of small, simple and flexible tools that we can compose together educators, those responsible for teaching and training the to achieve the goals of different migration projects.” [6].We go practitioners of today and tomorrow. This paper should be read as even further than this, advocating a recordkeeping informatics a challenge, we hope it stimulates the development of appropriate role in all projects across an organization as a feasible and case studies and exercises that are essential for the new appropriate way of demonstrating value as well as embedding disciplinary base. We welcome an ongoing conversation. recordkeeping Project approaches can involve large enterprise architectures or, in our preferred approach, more agile and usually web-based 8. REFERENCES approaches built around applications. Which ones records [1] Abbott, A. 1988. The system of professions: An essay on the managers get involved with depend on their workplaces, but web- division of expert labor. University of Chicago Press, based agility clearly has a lot more long-term relevance going for Chicago. it, which is probably a good thing, as records managers have for [2] ARMA International. 2014. All about ARMA International the most part been such insignificant figures in the evolution of http://www.arma.org/r2/who-we-are

Page 129 [3] Ash, T.J. 1997. The file: a personal history. Vintage Books, New York. [4] Duranti, L. 1989. The odyssey of records managers. Rec. Management Q., 23(3), 3-11. [5] Giddens, A, 1984 The Constitution of Society: Outline of the Theory of Structuration, Cambridge, p258 [6] Hanger, I. (Q.C.). The report of the Royal Commission into the Home Insulation Scheme, 2014, p.318 www.homeinsulationroyalcommission.gov.au/.../Reportofthe RoyalCommission [7] Information fault lines: A joint report of the Chief Ombudsman and the Privacy Commissioner into the Earthquake Commission’s handling of information requests in Canterbury, 2013. www.ombudsman.parliament.nz/ - system/paperclip/document_files/document_files/833/origina l/information_fault_lines_accessing_eqc_information_in_can terbury.pdf?1406602182 [8] Reed, B. 2008. Service-oriented architectures and recordkeeping. Rec. Management J., 18(1), 7-20. [9] State Records, NSW, Digital Archiving at State Records NSW, http://futureproof.records.nsw.gov.au/how-we-do- digital-archiving-at-state-records-nsw [10] Upward, F., Reed, B., Oliver, G., & Evans, J. 2013. Recordkeeping informatics: re-figuring a discipline in crisis with a single minded approach. Rec. Management J., 23(1), 37-50. [11] Van House, N. and Sutton, S.A. 1996. The panda syndrome: An ecology of LIS education. JELIS, 37(2), 52-62 [12] Verdery, K. 2014. Secrets and truths: ethnography in the archive of Romania’s secret police. Central European University Press, Budapest.

Page 130 Law and Records Management in Archival Studies: New Skills for Digital Preservation Prof. Dr Marie Demoulin Sébastien Soyez Université de Montréal (Canada) State Archives in Belgium École de Bibliothéconomie et des Sciences de and University of Namur (Belgium) l’Information [email protected] [email protected] ABSTRACT changes in the archival profession, involving new skills, new The information society has deeply transformed the archivist’s tasks, and new roles. environment. New forms of documents and records are emerging Information governance projects are proliferating everywhere, that require new skills and new roles. In order to develop a from the implementation of a new document management solution global, long-term strategy for managing and preserving digital or an information-sharing platform, to centralized databases, information, the next generation of archivists will have to email management policies, and mass digitization strategies. understand technological issues and be involved in technological These complex projects necessitate collaboration with experts decisions in close collaboration with IT experts. In addition, they from different fields, from IT consultants, computer analysts, and should be able to take an active part in electronic records developers, to lawyers, financial officers, and sales managers… management projects, participate in the change management but also archivists. All of these professionals combine their skills process and provide advice from a medium- and long-term in sight of a common goal: professional information management. preservation point of view. Finally, archivists should have a better understanding of legal issues. In this respect, archival studies Unfortunately, archivists are as yet too often absent from this kind programs should be adapted to allow the next generation of of project. There are many possible reasons for this. The first is, archivists to participate in the chain of trust between the electronic without question, tied to a misconception of the role of the record producer and its user. To illustrate this new trend in archivist as limited to the management of historical records. The education, this paper will present DocSafe, a certificate program second reason might come from archivists themselves, who, in digital information management offered jointly by the intimidated by the complex, technological nature of such projects, University of Namur, the University of Liège and the State prefer not to get involved, forgetting how useful their contribution Archives in Belgium since 2013. could be with regard to information management. In fact, when archivists are involved from the beginning, it is clear Categories and Subject Descriptors that their know-how is an asset in helping to achieve projects that K.3.2 [Computers and education]: Computer and Information are viable for the long term, taking into account the life cycle of Science Education electronic documents and assuring their authenticity, integrity, usability and sustainability. Moreover, archivists know better than anyone the context in which an institution’s documents are General Terms produced and received and the documents that are generated Management, Documentation, Security, Human Factors, through business processes, as well as how those documents Standardization, Theory, Legal Aspects. circulate between administrative divisions. Archivists orchestrate the identification and classification of information and establish Keywords detailed information management tools that track its creation, Training and Education, Archival Studies, Interdisciplinarity, preservation, and destruction. This considerable expertise, Law, Records Management essential for modeling the information processes and information flows that serve as a framework for solution development, also contributes to the implementation of the solution during the 1. INTRODUCTION testing phase and by supporting users through change Over the last 30 years, work methods and practices in information management. management and archiving have changed profoundly. New media and new forms of records have appeared that call for an However, in order to fully contribute to the management and adaptation of management and preservation practices. Work preservation of digital information, it seems more and more processes have become progressively more computerized, with essential to equip archivists with additional skills, still generally perpetually changing solutions following ever more quickly missing from archival studies programs. Indeed, the first evolving technologies. This phenomenon is amplifying and generations of archivists have specialized knowledge at their constantly becoming more complex, leading necessarily to disposal with regard to the processing of traditional archives1. On

1 By “traditional archives,” we mean here, archives stored on a traditional medium (principally paper) and that can be viewed without the help of a mediation device. These archives also include iconographic, cartographic, and photographic archives. Petit, R., Van Overstaeten, D., Coppens, H. and Nazet, J., Terminologie archivistique en usage aux Archives de l’Etat en Belgique, Archives de l’Etat, 1994, p. 24.

Page 131 the other hand, in the era of information on digital media, the change is more noticeable, particularly because archiving and archivists must increasingly concern themselves with documents records management functions are already well integrated into from the moment of their creation, and consequently must be able professional practice. In French-speaking countries, awareness of to understand information throughout its life cycle, whatever its the need for change has been slower, even if positive media or form. The archivist’s engagement in dynamic, integrated developments are underway. document management2 calls for dialogue and for developing The structure of traditional archival studies programs can vary closer ties with other disciplines. from one country to another, sometimes as a specialization within This paper will first present a brief panorama of traditional a Master’s degree in History, other times through a Master in archival studies programs and identify recent evolutions in the Archival Studies, still others through a specialized certificate field (Section 2)3. Then, we will consider the role of management, program. In general, the material taught can be grouped into four law, and information and communications technology education large subject categories4: in the future of archival studies programs (Section 3). Finally, in order to illustrate these new tendencies, we will present DocSafe, - Subjects related to the study and management of archives as a Belgian initiative offering a degree program in digital such, namely archival theory and practice, as well as more information management (Section 4). specific courses such as diplomatics and paleography; - Subjects considered to be historical that allow the archivist to 2. TRADITIONAL ARCHIVAL STUDIES better understand the context of creation of records, notably PROGRAMS AND THEIR NECESSARY institutional, social, economic, or literary history, legal EVOLUTION history, art history, and archaeology; Since their origin, traditional archival studies programs have - Subjects stemming from information sciences that help the trained archivists in the preservation of historical documents on archivist to master archival media and information content; analogue media. Since archivists tend to work with documents and produced within specific temporal and geographic limits, it is useful to be able to analyze the context of a document’s creation. - Subjects coming from other fields: law, philology, languages. Consequently, archival studies programs have divided up their offerings by historical period: medievalist archivists and modern With the advent of information and communications technologies, archivists (together called old-regime archivists), or contemporary these traditional archival studies programs are starting to adapt, archivists. Historically, traditional archival studies programs integrating progressively more technical courses coming out of channel students toward public or private institutions dedicated to information sciences or even other sciences. However, at the the preservation of documentary heritage. They can therefore be current moment, this tendency is not yet a generalized one and called ‘heritage archivists’. seems to focus on a merely technical approach to information management. Since the end of the 20th century, traditional archival studies programs have begun to evolve, but these changes have not Inversely, we note that some university-level communications and appeared in the same way everywhere. In Anglo-Saxon countries, information sciences programs are becoming progressively more interested in records management and archival studies. But these developments vary from country to country, indeed within 2 Integrated Document Management (or IDM) is the management countries, sometimes in a paradoxical manner. In fact, one might of all documents, whether technology-based or analog, through wonder whether records managers and heritage archivists need a single process, throughout their entire life cycle, i.e. from the separate training programs. In spite of this, their professional moment creation or reception through their preservation or functions are similar, since “both [are] responsible for the survival destruction. It involves the implementation of a sustainable and use of archives. However, in some organizations and management system, all components of which are integrated, in countries there is the record keeper who is responsible for the order give authorized persons access to all relevant information survival from creation of the record through to the archive stage, contained in a document that documents the activities of an whereas the archivist tends to be responsible for the record at the organization. This definition is inspired by the “Politique de point at which it becomes an archive. Both will have the same Gestion Intégrée des Documents administratifs de Bibliothèque skills set and knowledge to ensure the physical survival and et Archives Nationales du Québec,” 2009 and 2013 versions, intellectual integrity of the archive.”5 Can the needs in the field unpublished. provide responses to these questions? Of course, the tasks of 3 This section is based on a preliminary study carried out on the principal training programs currently offered in French- speaking parts of Europe and in Quebec, using data available on 4 This division into four categories is inspired by a study on the websites of various educational institutions for the 2014- archival science programs carried out by Carol Couture and 2015 academic year. The authors would like to thank Yvan Marcel Lajeunesse. Couture C. and Lajeunesse M., Barreau and Fiona Aranguren Celorrio for their invaluable L’archivistique à l’ère numérique. Les éléments fondamentaux assistance in collecting this data. For a more in-depth analysis, de la discipline, Montreal, Presses de l’Université du Québec, see Demoulin, M. and Soyez, S., “L’interdisciplinarité dans la 2014, p. 192. formation archivistique : un atout pour l’archiviste de demain,” 5 International Council on Archives (ICA). Discover archives and forthcoming in the next Proceedings of the Journées des our profession - Archives and record keepers. Archives de l’Université Catholique de Louvain. http://www.ica.org/125/about-records-archives-and-the- profession/discover-archives-and-our-profession.html.

Page 132 heritage archivists have remained roughly the same, although they Diversifying their skill-sets also allows archivists to respond to are increasingly called upon to participate in digitization projects new needs throughout the entire life cycle of the document. and to use computer-based tools throughout the archival Archivists must no longer be confined only to the static phase, but processing phase. However, despite their ongoing evolution, the should also be involved throughout the entire current phase of a traditional archival studies programs currently offered do not document. In order to do this, they should acquire new knowledge always respond sufficiently to the needs of information managers and new skills: technical skills, legal skills, and project and working with active records. The adaptation of degree programs change management skills, all related to the management and to digital environments and to the new needs arising from records preservation of digital information. We believe that the key to management has been uneven, particularly in French-speaking understanding and meeting current challenges lies in parts of Europe. interdisciplinary education and a tight articulation of these three categories of competences. 3. NEW CHALLENGES, NEW SKILLS In order to face these new challenges, whether technical, legal, or 3.1 Technical Skills managerial, archivists should acquire new skills that would The archivist is driven acquire new technical skills from the fields supplement their current competences. They should be able to of computer science and information sciences. As information understand technology-related issues and choices in order to media and formats have evolved considerably in the last few develop a digital information management and preservation decades, it is essential that archivists learn to understand their strategy in close collaboration with IT experts. Moreover, technical nature and to demystify their complexity. The goal is archivists should also be able to position themselves as key very much to give them the tools and the keys to better assess the stakeholders in digital document management projects, in order to issues and constraints related to existing technologies. From accompany change and direct practices from the perspective of information sciences, the archivist should be able to understand short-, medium-, and long-term archiving. Finally, engaged from the context of creation of these new records, and consequently the current phase of the lifecycle of a document onward, they analyze and evaluate the information systems that produce them in should increasingly take into account legal aspects related to order to implement true digital diplomatics6. A proper transparency, accessibility, and the preservation of authenticity understanding of the functionalities of information systems and and integrity, in addition to the reproduction and dissemination of the modeling of workflows and documentary flows seem information. The archivist is also called upon to play an essential indispensable to the archivists of tomorrow, allowing them to role in the chain of custody between the producer of a document identify and describe the documents to be archived. and its future users, whether immediate or subsequent. It goes without saying that the goal here is not to turn the archivist 3.2 Legal Skills into a lawyer, a computer scientist, or a manager. Archivists Aside from the major legislation on archives, archivists should be should remain experts in their field, and we are not calling into given the tools to apply rules regarding the protection of personal question the absolute necessity that they be educated in the information, intellectual property rights (copyright and associated fundamental principles of archival studies. The goal here is more rights, database protection, licensing, and open-source software) to anchor these fundamentals at the heart of a multidisciplinary and access to information in the public sector. These legal reality. provisions are directly applicable to their daily practice, for managing the security, confidentiality, reproduction, The aim of this increased interdisciplinarity is to allow for dissemination and accessibility of documents, as well as their archivists to develop new skills corresponding to new professional destruction, whether in-house or in relation to the public or a needs such as: contractor. We should remember that these rules have all been adapted to take into account the digital environment. Additionally, Understanding the legal, technical and organizational issues at - in order appropriately manage the context of creation and play in information management, particularly in digital preservation of documents throughout their life cycle, archivists environments, throughout the entire life cycle, regardless of need to know elements of evidentiary law, including in the digital medium. In other words, understanding the technical and era (general principles and conditions of judicial recognition of socio-cultural contexts in which information management is electronic documents and signatures). They need to be made situated in the digital society. aware of the existence of general and specific rules for validity - Understanding the key role archivists should play as and conservation of documents. information professionals in managing digital information. Giving them the tools that will allow them to take on such a 3.3 Project Management and Change role within their organizations. Management Skills Archivists, whether working in archiving or in records - Being able to enter into dialogue with experts in other management, need to be capable of managing and participating in disciplines, with a view to finding common, lasting solutions projects that digitize work processes and heritage documents, and that take into consideration concerns related to the from there, should be able to anticipate and support the resulting management of digital information. organizational change. Indeed, change management is crucial to - Proactively involving themselves in digital information the success of projects involving the integration of new management projects within their organizations, and in developing digital information management tools and policies. 6 On digital diplomatics, see the work of InterPARES, research under the direction of Luciana Duranti (www.interpares.org).

Page 133 information and communication technologies. For example, it is involved in the management and preservation of digital records, not enough to implement document management tools in an particularly technicians, information managers (record managers organization and incorporate them into an integrated document and archivists), lawyers, decision-makers, and users of both management solution. They must also be accepted and applied by records and systems, in addition to service providers. the creators and users of documents, which involves a certain amount of change-management skills. Efficiently identifying the 4.2 Program Structure needs of stakeholders, weighing the risks involved, and setting The program is structured over one academic year, from measurable, quantifiable objectives are skills that are essential to September to June (170 hours—20 ECTS13), with a schedule include in archival studies programs. adapted to professional activity. Once a month, participants are invited to participate in a 2.5 day training session, namely a 4. TOWARD NEW TRAINING Thursday and a Friday (during office hours) and a Saturday PROGRAMS: THE CASE OF THE morning. Thus, for professionally active participants, the employer is invited to release the employee on work time for two DOCSAFE CERTIFICATE IN BELGIUM consecutive days a month, and the employee is invited to devote In order to respond to new professional needs and evolution in half a day of her free time to the program. The goal of this document management practices, a new training program, 7 compromise is to promote engagement in the training on the part DocSafe, was created in Belgium in September, 2013 . DocSafe is of both the employer and the employee. Furthermore, this format a joint certificate in Digital Information Management offered by allows for balance between the program, private life, and three institutions: the University of Namur8, the University of 9 10 professional life for the participant, as well as promoting good Liège , and the State Archives in Belgium . gender balance, particularly for those participants who have familial responsibilities14. 4.1 Target Audience and Program Objectives The program is organized into five modules that follow a typical DocSafe is geared toward information and/or knowledge project management trajectory. Each module is cross-disciplinary managers, records managers, archivists in charge of electronic and tackles the legal, technical and organizational dimensions of documents, and librarians, but also IT managers, lawyers, project digital information management and preservation. managers and quality managers. However, the vast majority of students who enrolled in September, 2013 or September, 2014 are Given the variety of participant profiles, the goal of the first archivists or records managers11. module is to give each participant the fundamental principles, no matter her background, in order to guarantee a shared level of In accordance with its target audience, the program has attracted understanding. This first module therefore presents the basics of not only young graduates looking to specialize or reorient project management and records management, the technical issues themselves professionally, but also active professionals seeking at play in digital information management and preservation, as complementary training that will allow them to participate in a well as an introduction to legal analysis and the regulation of the project in their workplace or open up new horizons. The classes information society. The second module is focused on analyzing are organized so as to be compatible with professional activities the existing situation (mapping workflows, dataflows, and actors; (see section 4.2). In order to make room for discussion and 12 typology of documents and constraints analysis) and the legal interaction, the number of participants is deliberately limited . valuation of the project (legal and probative value of the At the end of the program, students are capable of developing, information, privacy concerns and protection of personal data, implementing, and following up on a digital information digital heritage and legislation on archives, access to and reuse of management and preservation project. They understand the legal, public sector information). The third module looks at the technical, and organizational issues at play in such a project and conception of a solution (needs and risk analyses, functional are able to make strategic choices that critically evaluate the architecture). The fourth module looks at strategic and operational needs, risks and constraints associated with their project. They can aspects, with an emphasis on change management and human propose appropriate solutions and dialogue with all stakeholders aspects of the project, selection and implementation of the solution, and security strategies, as well as legal protections related to the solution (copyright and copyleft, contractual 7 www.docsafe.info. aspects, liability, and confidentiality). Finally, the last module 8 focusses on the evaluation and evolution of the project (follow-up Research Centre for Information, Law and Society (CRIDS). and sustainability). 9 Research and Intervention Centre for Organizational Innovation Processes (HEC-LENTIC). Furthermore, over the course of the training session, three one-day seminars are organized, bringing participants together around a 10 Digital Preservation and Access unit (DIGI-P@T). cross-disciplinary question. These seminars are a chance to review 11 70% of participants in both the 2013-2014 and the 2014-2015 the material and to gain perspective, using all facets of sessions are information professionals (archivists or record information management to analyze a question in its entirety. The managers). three subjects of study are email management, cloud computing, and document digitization projects, which currently represent the 12 22 were enrolled in the 2013-2014 session and 14 enrolled in 2014-2015, the number having been reduced to leave more room for interaction between participants during classes. These 13 numbers are quite positive, as it is a very specialized training European Credits Transfer System. within the French Community of Belgium, which includes 14 Note that 54% of participants are women for the 2013-2014 and Wallonia. session, and 78% for 2014-2015.

Page 134 three most common preoccupations of an organization. However, management and preservation. But such skills are proving equally these themes will probably evolve over the years as a result of indispensable in the development of interpersonal skills for the changes in practice and technologies. Subjects such as open data archivist at a time when archivists are searching for a new and big data are already being considered. identity. Modernizing archival studies programs means showing that 4.3 Learning and Evaluation Methods information management is a career for the future, helping to The program seeks out instructors from universities, business, and inspire new interest in the field and to promote professional the public sector, and is based on a dynamic and interactive reorientation, while responding to the evolution of professional pedagogy suited to a classroom of professionals. The objective is needs and the job market. It also means promoting, supporting, to give them the tools and skills needed to manage digital and encouraging information professionals to actively engage in information efficiently from a medium- and long-term information management within their organization, with real perspective. added value. Over the course of the modules, participants are led to use and Far from heralding the decline of the archival profession, the apply various analytical tools, to share their perspectives and to information society represents an opportunity. More than ever, exchange points of view through scenarios, group exercises, and information is power. As an information professional, the case studies, relying on a solid theoretical base. archivist has a major role to play in this arena, subject to the acquisition of new skills. This evolution transpires through the The evaluation of participants’ competences balances theoretical revision of archival studies programs to better take into account knowledge and applied skills. In order to assure the progressive the professional needs. It means not only developing more assimilation of theoretical knowledge during the program, there modern, more attractive programs with the means to train a new are regular quizzes throughout the year. At the end of the generation of archivists, but also complementing the training of program, a two-and-a-half-day residential seminar takes place that professionals looking to reorient their careers through continuing allows students to immerse themselves in a cross-disciplinary case education tailored to their needs. study requiring the integration of the different perspectives and principles taught in the various modules. The participants, divided into small interdisciplinary groups, are thus placed in a situation 6. REFERENCES close to reality, requiring analysis and synthesis, pragmatism and [1] Couture C. and Lajeunesse M. 2014. L’archivistique à l’ère creativity, team management and time management. At the end of numérique. Les éléments fondamentaux de la discipline. the seminar, they must deliver a brief written summary giving Presses de l’Université du Québec. Montreal. their diagnostic and proposing a strategic plan, which they must [2] Demoulin, M. and Soyez, S.. 2015. “L’interdisciplinarité present before a board of directors made up of instructors from dans la formation archivistique : un atout pour l’archiviste de different disciplines. demain”. Forthcoming in the next Proceedings of the Finally, each participant is invited to draft individually a digital Journées des Archives de l’Université Catholique de information management and preservation project tied to their Louvain. professional context. This analysis has proven to be particularly [3] DocSafe. Certificate in Digital Information Management. rewarding for the student and directly useful for her possible www.docsafe.info. employer, applying the skills acquired throughout the program [4] International Council on Archives (ICA). Discover archives right away and putting them to use for the organization. and our profession - Archives and record keepers. Participants have expressed great satisfaction at the end of the http://www.ica.org/125/about-records-archives-and-the- program, emphasizing that it provided them not only with theory profession/discover-archives-and-our-profession.html. and methodological tools, but also with the self-assurance needed [5] Interpares project. www.interpares.org. to sit at the table and act as full stakeholders in a digital information management project within their organization. [6] Petit, R., Van Overstaeten, D., Coppens, H. and Nazet, J.. 1994. Terminologie archivistique en usage aux Archives de l’Etat en Belgique, Archives Générales du Royaume. 5. CONCLUSION Brussels. New skills are proving themselves necessary today for the acquisition of a concrete and useful skill-set in digital information

Page 135 Author Index Aki Lassila 58 Liisa Uosukainen 89 Alan Gairey 92, 145 Liliana Ragageles 37 Aleksandra Mrdavšič 28 Liivi Karpištšenko 1 Alessia Ghezzi 113 Lucília Runa 17 Ana Rodrigues 17 Luis Faria 63, 83 Ângela Guimarães Pereira 113 Ann Keen 115 Marie Demoulin 80, 131 Anne-Sophie Maure 96 Mário Sant’Ana 17 Anssi Jääskeläinen 89 Maïté Braud 115 Miguel Coutada 63 Barbara Reed 126 Miguel Ferreira 83 Bart Ballaux 12 Mikko Eräkaski 69 Bogdan-Florin Popovici 53 Mikko Lampi 58 Mirja Loponen 24 Cécile de Terwangne 80 Miroslav Novak 119 Charles Jeurgens 7 Pauline Sinclair 92 Diogo Proença 100 Rainer Schmidt 71 Estefania Aguilar Moreno 113 Ricardo Vieira 37, 100 Richard Jeffrey-Cook 32 Francisco Barbedo 17 Robert Sharpe 92, 115 François Chazalon 6 Frank Upward 126 Sébastien Soyez 80, 131 Seth Van Hooland 80 Gillian Oliver 31, 126 Stephen Howard 42

Helder Silva 63, 83 Taavi Kotka 1 Hélène Guichard-Spica 96 Tarvo Kärberg 78 Tatjana Hajtnik 119 Fiorella Foscarini 31 Timo Honkela 58

James Carr 115 James Lappin 106 James Lowry 105 Janek Rozov 1 Janet Delve 71 Jaroslaw Lotarski 76 Jean Mourain 6 Jeroen van Oss 12 Joanne Evans 126 Job Sueters 76 Johanna Räisä 24 Jon Garde 57, 99 Jose Borbinha 37, 100 José Carlos Ramalho 63 Jože Škofljanec 29

Kevin O’Farrelly 115 Kuldar Aas 71 Page 136