in OSS-DL

Dr. Sunita Barve National Centre for Radio Astrophysics, Pune, INDIA, Email:[email protected] Dr. Devika Madalli Documentation Research Training Centre, Bangalore, 560059, INDIA Email: [email protected] Prof. ARD Prasad Documentation Research Training Centre. Bagalore 560059, INDIA, Email : [email protected]

TELDAP 2011, Taiwan Conferece 18th March 2011 Outline of the Talk

• Introduction • Digital Preservation • Open Source Software - (OSS- DL) • Installation of OSS-DL on a Test Bed Environment • Evaluation Criteria for Digital Preservation • Findings • Conclusion Introduction

• Today information is produced in digital form and vast amount of digital contents are made available to users.

• Digital information is growing and exploding at a rapid rate.

• Digital information is available in heterogeneous form along with complexity of digital data.

• Hardware and software are continuously changing on which digital information is created.

• This brings an important challenge of preservation of digital documents. Digital Preservation

“An activity within archiving in which specific items of data are maintained over time so that they can still be accessed and understood through changes in technology”. Digital Preservation

• Digital preservation demands keeping digital information  accessible,  Viewable,  usable for the future

• All over the world several institutes have taken up steps to archive their digital contents available with them by making use of either commercial software or open source software.

• There are range of open source software available for building digital libraries/institutional repositories/digital archives/digital repositories. Open Source Software

Open source software are available free under open source license terms and conditions where source code of the software is available free to the end user for any developments, redistribution, customization.

• http://sourceforge.net

(As of February 2009, the SourceForge repository hosts more than 230,000 projects and has more than 2 million registered users).

• http://www.oss4lib.org Introduction

• We searched on a number of Open Source Software available for building digital libraries.

• The software were shortlisted on the basis of their functionality especially for managing digital collections.

• The software were further finalised after successful installation of latest version of each software.

• Main objective of the study was to find out various digital preservation related features support in these software. Introduction Introduction

 Digital preservation support of each software are verified against the set of defined evaluation criteria.

 The evaluation criteria defined here are included on the basis of earlier studies reported in the literature. Introduction

 Initially nine OSS-DL were selected for the present study such as:-

• CDS-Invenio(Switzerland), • DoKS(Belgium), • DSpace(USA), • EPrints(UK), • Fedora(USA), • Greenstone(NewZealand), • MyCoRe(Germany), • OPUS (Germany), • SciX(Slovenia) Test-Bed Setup • A testbed environment was created and all the major OSS-DL software such as CDS-Invenio, DSpace , EPrints , FEDORA, Greenstone, MyCoRe were installed on a Pentium IV 3.00 Ghz processor with 1 GB RAM and 1 TB GB disk space on a Debian GNU Linux version 5.0. 6(lenny).

• The software were short listed from the ROAR site (http://www.roar.org).

• ROAR maintains list of open access repositories and tools used to create open access repositories.

• A small collection of every document type such as text, audio, video etc. is added into each of these installed software to study digital preservation feature supported with these software.

Evaluation Criteria

• Digital Preservation Strategy

– bit-level preservation – format migration – emulation

• Metadata Support

• Preserve file’s original identities such as its name, size and created date

• Data integrity check

• Persistent Identification Number Evaluation Criteria

• Metadata Preservation

o provenance - documenting the history of the object. o authenticity - validating that the digital object is in fact what it should be, and has not been altered. o preservation activity - documenting the actions taken to preserve the digital object. o technical environment - describing the technical requirements, such as hard ware and software, needed to render and use the digital objects. o rights management - recording any binding intellectual property rights that may limit the repository’s ability to preserve and disseminate the digital object over time. Evaluation Criteria

• How does the software manage compound objects

o Multiple file formats of same object are linked together o Documents having multiple pages linked together

• Audit logs for all the documents that are added into the repository such as who, when, what, how and where?

• In what format are the logs? Are they easily accessible? Evaluation Criteria

• If repository ingests digital content with unclear ownership/ rights, does software has policies addressing those rights?

• Does the system have any mechanism in determining when objects In digital archives should migrate to new hardware and Software?

● Does the software have ability to handle variety of file formats and does it also supports file format versioning?

Evaluation Criteria

• Does the system support automatic format recognition. For unknown formats does the system send any message to the submitter requesting for additional information?

• Where are actual files stored and metadata stored into the repository? Metadata Support

• Dublin Core (for all digital documents) • Metadata Encoding and Transmission Standard (METS) (complex digital objects) • Preservation Metadata Implementation Strategy (PREMIS) • MARC 21 • MARCXML • Metadata Object Description Schema (MODS) • Online Information Exchange (ebooks) • Encoded Archival Description (EAD) (For manuscripts/archival contents) • Text Encoding Initiative (TEI) • Learning Object Metadata (LOM) (elearning objects) • Visual Resource Association Core (VRA) (paintings/sculptures) • MPEG Multimedia Metadata (audio/video) Metadata Support

Invenio DC & MARC 21

DSpace DC

EPrints DC

FEDORA DC, METS, MPEG-21, DIDL, IEEE LOM, MARC, FOXML, ATOM

Greenstone New Zealand Government Locator Service Metadata Standard (nzgls), RFC 1807 Metadata Element Set, Dublin Core, Development Library Subset Example Metadata, Greenstone Metadata Set, Australian Government Locator Service Metadata MyCoRe DCElement Set METS MARC Metadata Support EPrints allow to export data into variety of metadata format such as: Persistent Identification

 For stable long-term management of digital collections persistent identifiers are required.

 CDS-Invenio - Its own identification number  DSpace – Handle.net  EPrints – Generates URI for every document also allow to add other persistent identification number for the document.  FEDORA – URI, as well as other identifier schemes found in PRONOM and the Global Digital Format Registry(GDRF).  Greenstone – No Persistent Identification  MyCoRe – Uniform Resource Name Checksum Support

 CDS-Invenio, DSpace, EPrints, MyCoRe support MD5 checksum.

 Fedora supports variety of checksums such as: » MD5 » SHA-1 » SHA-256 » SHA-384 » SHA-512

 Greenstone– No support for any checksum. Document Versioning Support

• CDS-Invenio, DSpace , FEDORA, MyCoRe, EPrints support adding different versions of documents.

• Greenstone – No support for adding different versions of documents. Automatic Format Recognition • CDS-Invenio: Has the ability to accept documents in all desired formats. The system administrator can limit the formats of submitted documents. This allows the repository to define a policy according to which it accepts specific formats of digital objects that it can manage from a technical point of view.

• DSpace : The DSpace provide support for as many file formats as possible. But the proprietary nature of many file types are not identified while uploading and they are treated as “other” formats.

• EPrints : It has some file formats list which gets identified as soon as file is uploaded. Rest of the files are marked as “other” when uploaded into the repository.

• FEDORA: All MIME type file formats supported.

• Greenstone : When any digital document is uploaded gsdl tries to identify the format of the file and suitable plugin required for opening the file but not all file formats are recognized by Greenstone.

• MyCoRe: Uploads any file format. Audit Logs • CDS-Invenio: Software maintains search, indexing & apache logs.

• DSpace : Detailed log is supported in DSpace with its own log as well as Tomcat log.

• EPrints : Software does have any log area of its own. Since it is running on Apache Web Server it records some information in Apache (access/error log area).

• FEDORA: Software keeps detailed log in its area such as client and server logs as well as Tomcat logs are maintained.

• Greenstone : No logs are maintained.

• MyCoRe: No detailed logs are supported. Details of the Files

 CDS-Invenio: Changes files original name with its own name.

 Eprints, Greenstone: Preserve files own identity when uploded into the software.

 DSpace & FEDORA : Changes the file name to its own internal structure.

 MyCoRe: Changes file names with date, time and internal number. Actual File Storage

• CDS-Invenio: Actual files are stored in “data” directory. Metadata is stored in “”.

• DSpace : Actual files are stored in “assetstore” folder and metadata is stored in “postgres”.

• Eprints: The metadata is stored in “mysql” and actual files are stored in “disk0” directory of Eprints.

• FEDORA: Actual files are stored in “data/datastream” folder and metadata is stored in “mysql”.

• Greenstone: All files and metadata are stored in “import” folder of installation. Metadata is stored as an metadata.xml file in GSDL.

• MyCoRe: Metadata and actual files are stored in “data” folder of docportal. MyCoRe uses “hsql” database. Conclusion

• There is much yet to be known and studied when it comes to the preservation of digital information.

• OSS-DL has yet to come out with proper digital preservation support. Conclusion

 FEDORA supports to large extent more features which are essential from Digital preservation point of view but it lacks user friendly interface.

 DSpace and EPrints are now been used heavily all over the world to build digital repositories/institutional repositories.

 There are more number of repositories available with Dspace.

 In India, many institutes have taken steps to build digital archives using DSpace. Conclusion

 For successful digital preservation it is necessary to use

– Open Source Software

– Open Standards

– Open Formats Conclusion It is necessary to convert proprietary format into open formats and then can be uploaded into digital archive for future storage, retrieval and preservation.

We plan to work on how DROID (Digital Record Object Identification), JHOVE, XENA, LOCKSS etc., function for supporting digital preservation.

Xena the tool developed by Library of Congress can be used to convert Microsoft Office documents to Opend Document Formats (ODF) Similarly PDF/A is used from preservation point of view. Conclusion

Existing OSS-DL does not have any measures to restrict on uploading documents which are available in open standard format. It is necessary that these software will have restrictions on identifying the format of the document before it gets uploaded into the repository. Thank You!

Thanks to the TELDAP Organizers for Supporting me to attend this wonderful conference!