ISSN 1747-1524

DCC | Digital Curation Reference Manual

Instalment on “Automated Metadata Generation” http://www.dcc.ac.uk/resources/curation-reference-manual/completed- chapters/automated-metadata-extraction/

Milena Dorbeva, Yunhyong Kim and Seamus Ross

November 2013 Version 1.0

Page 2 Digital Curation Manual

Legal Notices

The Digital Curation Reference Manual is licensed under a Creative Commons Attribution - Non- Commercial - Share-Alike 2.0 License.

© in the collective work - (which in the context of these notices shall mean one or more of the University of Edinburgh, the , and the University of Bath and the staff and agents of these parties involved in the work of the Digital Curation Centre), 2005.

© in the individual instalments – the author of the instalment or their employer where relevant (as indicated in catalogue entry below).

The Digital Curation Centre confirms that the owners of copyright in the individual instalments have given permission for their work to be licensed under the Creative Commons license.

Catalogue Entry Title DCC Digital Curation Reference Manual Instalment on scientific metadata generation Creator Dobreva, M., Kim, Y., and Ross, S. (authors) Subject Information Technology; Science; Technology--Philosophy; Computer Science; ; Digital Records; Science and the Humanities. Description This chapter will discuss the role of automated metadata generation in curating and understanding often complex datasets.

Publisher HATII, University of Glasgow; University of Edinburgh; UKOLN, University of Bath. Contributor Joy Davidson (editor) Date 10 November 2013 (creation) Type Text Format Adobe Portable Document Format v.1.3 Resource Identifier ISSN 1747-1524 Language English Rights © HATII, University of Glasgow

Citation Guidelines Dobreva, M., Kim, Y., and Ross, S. (2013), "Automated Metadata Generation", DCC Digital Curation Manual , J. Davidson, S. Ross, M. Day (eds), Retrieved , from http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/automated-metadata- extraction

Dobreva, Kim, Ross: Automated Metadata Generation Page 3

About the DCC The JISC-funded Digital Curation Centre (DCC) provides a focus on research into digital curation expertise and best practice for the storage, management and preservation of digital information to enable its use and reuse over time. The project represents a collaboration between the University of Edinburgh, the University of Glasgow through HATII, UKOLN at the University of Bath, and the Council of the Central Laboratory of the Research Councils (CCLRC). The DCC relies heavily on active participation and feedback from all stakeholder communities. For more information, please visit www.dcc.ac.uk . The DCC is not itself a data repository, nor does it attempt to impose policies and practices of one branch of scholarship upon another. Rather, based on insight from a vibrant research programme that addresses wider issues of data curation and long-term preservation, it will develop and offer programmes of outreach and practical services to assist those who face digital curation challenges. It also seeks to complement and contribute towards the efforts of related organisations, rather than duplicate services.

Digital Curation Reference Manual Editors (2010-2011) • Joy Davidson, Associate Director, Digital Curation Centre (UK) • Kevin Ashley, Director, Digital Curation Centre (UK)

Digital Curation Reference Manual Editors (2005-2010) • Seamus Ross, Director, HATII, University of Glasgow (UK) • Michael Day, Research Officer, UKOLN, University of Bath (UK)

Digital Curation Reference Manual copy editor • Florance Kennedy, Administrator, Digital Curation Centre

Peer review board members have included • Neil Beagrie, JISC/British Library Partnership Manager (UK) • Georg Büechler, Digital Preservation Specialist, Coordination Agency for the Long-term Preservation of Digital Files (Switzerland) • Filip Boudrez, Researcher DAVID, City Archives of Antwerp (Belgium) • Andrew Charlesworth, Senior Research Fellow in IT and Law, University of Bristol (UK) • Robin L. Dale, Program Manager, RLG Member Programs and Initiatives, Research Libraries Group (USA) • Wendy Duff, Associate Professor, Faculty of Information Studies, (Canada) • Peter Dukes, Strategy and Liaison Manager, Infections & Immunity Section, Research Management Group, Medical Research Council (UK) • Terry Eastwood, Professor, School of Library, Archival and Information Studies, University of British Columbia (Canada) • Julie Esanu, Program Officer, U.S. National Committee for CODATA, National Academy of Sciences (USA) • Paul Fiander, Head of BBC Information and Archives, BBC (UK) • Luigi Fusco, Senior Advisor for Earth Observation Department, European Space Agency (Italy) • Norman Gray, Researcher, Department of Physics and Astronomy, University of Glasgow • Hans Hofman, Director, Erpanet; Senior Advisor, Nationaal Archief van Nederland (Netherlands) • Falk Huettmann, PhD Associate Professor, University of Alaska • Max Kaiser, Coordinator of Research and Development, Austrian National Library (Austria) • Gareth Knight, Preservation Officer, CeRch, Kings College • Carl Lagoze, Senior Research Associate, Cornell University (USA) • Nancy McGovern, Associate Director, IRIS Research Department, Cornell University (USA) • Jen Mitcham, Curatorial Officer, Archaeology Data Service, University of York • Mary Molinaro, Director, Preservation and Digital Programs, University of Kentucky Libraries • Reagan Moore, Associate Director, Data-Intensive Computing, San Diego Supercomputer Center (USA) • Sheila Morrissey, Senior Research Developer, ITHAKA • Alan Murdock, Head of Records Management Centre, European Investment Bank (Luxembourg) • Julian Richards, Director, Archaeology Data Service, University of York (UK) • Donald Sawyer, Interim Head, National Space Science Data Center, NASA/GSFC (USA) • Jean-Pierre Teil, Head of Constance Program, Archives nationales de France (France) • Mark Thorley, NERC Data Management Coordinator, Natural Environment Research Council (UK) • Helen Tibbo, Professor, School of Information and Library Science, University of North Carolina (USA) • Malcolm Todd, Head of Standards, Digital Records Management, The National Archives (UK) • Andrew Wilson, Senior Data Policy Advisor, Australian National Data Service • Erica Yang, STFC Rutherford Appleton Laboratory Page 4 Digital Curation Manual

Preface The Digital Curation Centre (DCC) develops and shares expertise in digital curation and makes accessible best practices in the creation, management, and preservation of digital information to enable its use and reuse over time. Among its key objectives is the development and maintenance of a world- class digital curation manual. The DCC Digital Curation Reference Manual (formerly the Digital Curation Manual) is a community-driven resource—from the selection of topics for inclusion through to peer review. The Manual is accessible from the DCC web site (http://www.dcc.ac.uk/resources/curation- reference-manual).

Digital Curation Reference Manual instalments provide detailed and practical information aimed at digital curation practitioners. They are designed to assist data creators, curators and reusers to better understand and address the challenges they face and to fulfil the roles they play in creating, managing, and preserving digital information over time. Each instalment will place the topic on which it is focused in the context of digital curation by providing an introduction to the subject, case studies, and guidelines for best practice(s). To ensure that this manual reflects new developments, discoveries, and emerging practices authors will have a chance to update their contributions annually.

To ensure that the manual is of the highest quality, the DCC has assembled a peer review panel including a wide range of international experts in the field of digital curation to review each of its instalments and to identify newer areas that should be covered. The list of current and previous members of the peer review board is provided at the beginning of this document.

The DCC actively seeks suggestions for new topics and suggestions or feedback on completed instalments. Both may be sent to the editors of the DCC Digital Curation Reference Manual at [email protected] .

Joy Davidson and Kevin Ashley Digital Curation Centre

18 April 2011 Dobreva, Kim, Ross: Automated Metadata Generation Page 5

Biography of the author

Seamus Ross is Dean and Professor, Faculty of Information, University of Toronto. Formerly, he was Professor of Humanities Informatics and Digital Curation and Founding Director of HATII (Humanities Advanced Technology and Information Institute) ( http://www.hatii.arts.gla.ac.uk ) (1997-2009) at the University of Glasgow. He served as Associate Director of the Digital Curation Centre (2004-9) in the UK ( http://www.dcc.ac.uk ), and was Principal Director of ERPANET ( http://www.erpanet.org ) and DigitalPreservationEurope (DPE (http://www.digitalpreservationeurope.eu and http://www.youtube.com/user/w epreserve ) and a co-principal investigator such projects as the DELOS Digital Libraries Network of Excellence ( http://www.dpc.delos.info/ ) and Planets (http://www.planets-project.eu/ ). He recommends Digital Preservation and Nuclear Disaster: An Animation, http://www.youtube.com/watch?v=pbBa6Oam7-w and "Digital Archaeology" (1999), http://eprints.erpanet.org/47/01/rosgowrt.pdf

Dr. Milena Dobreva is a Senior Lecturer in Library, Information and Archive Sciences at Faculty for Media and Knowledge Sciences at the University of Malta. She was the principal investigator of EC, JISC and UNESCO funded projects in the areas of user experiences, digitisation and digital preservation and is a regular project evaluator for the EC. From 1990-2007 she worked at the Bulgarian Academy of Sciences where she earned her PhD degree in Informatics and served as the founding head of the first Digitisation Centre in Bulgaria. She was also a chair of the Bulgarian national committee of the Memory of the World programme of UNESCO. From 2007-2011 she worked for the University of Glasgow and the University of Strathclyde. Milena was awarded an honorary medal for contribution to the development of the relationships between Bulgaria and UNESCO (2006) and an Academic Award for young researchers (Bulgarian Academy of Sciences, 1998).

Yunhyong Kim is the lead researcher and Co-Principal Investigator on the BlogForever project funded by the European Commission under Framework Programme 7 (FP7) ICT Programme (ICT No. 269963). She has a Ph.D in Mathematics from the University of Cambridge and an MSc in Speech and Language Processing from Linguistics and English at the University of Edinburgh. Yunhyong's research focus is in machine learning methods that support information management, use, and preservation as well as enhancing knowledge discovery. She has developed methods for automated document genre classification and semantic metadata extraction in the context of digital preservation, and the ingest and appraisal of material in a digital repository environment. Page 6 Digital Curation Manual

Introduction and scope

The vital role of metadata in managing digital objects has been stressed in several previous researches, e.g. the National Science Foundation (NSF) National Science Initiative 1 and the DELOS workgroup (Hedstrom et al. 2003). This has been further emphasised in instalments of the Curation Reference Manual contributed by Michael Day (2005), Marlene van Ballegooie & Wendy Duff (2006), and Priscilla Caplan (2006) who have examined metadata for different purposes (e.g. preservation and archival) and the issues surrounding the classes of metadata that ease the efficient management of digital objects for immediate and long term uses. Here, we will not repeat these discussions on the necessity of metadata nor advance discussions surrounding what constitutes adequate metadata classes for different types of objects. We shall instead bring to the forefront the benefits of automated metadata generation methods that could replace or complement manual assignment to make metadata generation easier.

Digital object management models, such as OAIS 2 and ISO 15489 3, and such digital library, archive and repository implementations as DSpace 4, LANL 5, e- Depot 6 and Portico 7 all require that objects are accompanied by metadata if they are to be managed efficiently and adequately. But none of these models posit a solution to the creation of that metadata, nor propose how metadata generation could benefit from automation. Manual assignment of metadata is expensive (DELOS/NSF Working Groups, 2003; Hedstrom et al., 2003; PREservation Metadata: Implementation Strategy Working Group (PREMIS)) and metadata generation faces the threat of the metadata bottleneck 8, a metaphor which illustrates that the human efforts needed to create metadata can not cope with the pace of creation of new digital resources. A number of researchers, including those involved in the release of the DigitalPreservationEurope research roadmap (DPE) 9 and an earlier study of Ross and Hedstrom (2005), have recognised the need for automation in various preservation-related activities.

The need to use automated method to support the generation of metadata has also been noted by the Library of Congress Action Plan 10 . The benefits of automated metadata generation have been observed in many research initiatives: recent Joint Information Systems Committee (JISC) funded projects “Automatic Metadata Generation: use case identification and tools/services prioritisation” 11 , and “Metadata Generation for Resource Discovery” 12 have produced a survey of use case scenarios and available tools for metadata extraction ( Duncan and Douglas, 2009) and the Automatic Metadata Generation Applications (AmeGA) project 13

1http://www.cs.cornell.edu/lagoze/papers/Arms-et-al-LibraryHITECH.pdf 2http://public.ccsds.org/publications/archive/650x0b1.pdf 3http://www.datacapture.co.uk/information/ISO-15489.htm 4http://www.dspace.org/ 5http://www.lanl.gov 6http://www.kb.nl/dnp/e-depot/operational/background/index-en.html 7http://www.portico.org 8The term metadata bottleneck was coined by E. Liddy (2002). 9http://www.digitalpreservationeurope.eu/publications/dpe_research_roadmap_D72.pdf 10 http://lcweb.loc.gov/catdir/bibcontrol/actionplan.pdf 11 http://www.jisc.ac.uk/whatwedo/programmes/inf11/resdis/automaticmetadata.aspx 12 http://www.jisc.ac.uk/whatwedo/programmes/resourcediscovery/autometgen.aspx 13 http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf Dobreva, Kim, Ross: Automated Metadata Generation Page 7 aimed to recommend functionalities to generate metadata within the library and bibliographic communities. However, there is no established explicit best practice guideline workflow model that can be implemented to meet the needs of any given digital object collection and metadata scheme.

Even confining our attention to digital text documents, we have previously observed (Kim and Ross 2006; Kim and Ross 2007) that research in the automated generation of metadata has been somewhat fragmented. So far automated methods of generating metadata for digital text documents have been developed to support the generation of very specific metadata (e.g. technical metadata, bibliographic metadata) for a limited number of document formats (e.g. HTML, post script, JPEG) and/or types (e.g. email, news, scientific article). Efforts to integrate these results into a general framework that could initiate, upon request, any metadata for any given text document format and type, is severely lacking. We describe some of the current tools and propose a model for how these can be incorporated as resources in an automated management framework for prompting the generation of metadata for any type of text document and any metadata, thereby creating a “universal metadata generation tool”. Although the framework proposed is tailored here to meet the needs of text documents, the principal workflow is independent of its particular implementation.

At first the construction of a universal metadata generation tool may seem unrealistic, given the variety of file formats and document types. But, actually, once you know the type of the document, it is fairly easy to select the tools, if they exist, to generate the metadata from the object and its context effectively. If the tools do not exist, the same framework can initiate the creation of a tool or the manual generation of the required metadata. We suggest a workflow model for the management of automated metadata generation, optimising the selection of tools for given document types and metadata once we have identified them.

Here, we propose document genre as a good first level entry point for determining document type. However, the general framework model being proposed is not reliant on this choice. The model operates on the basis of choosing the metadata extraction tool that has the best track record for the detected document type with respect to extracting the requested metadata. The approach we describe is agnostic in terms of metadata schema or elements, i.e. it establishes a metadata generation framework that can be adapted to generate any single metadata element upon request. We propose an automated metadata generation management workflow model that could:

- reduce the cost of metadata assignment to digital objects; - enable us to keep up with the fast pace of information production; - support the maintenance of consistency (also observed by Greenberg et al 2005): to keep pace using manual generation, an increasing number of metadata collectors have to get involved, propagating inconsistency across the quality of generated metadata, as different collectors have different levels of experience, background, working environment, as well as physical and emotional states; - help to identify classes of metadata that can be generated by fully automated methods, classes of metadata for which new tools or manual generation are required, and a workflow that would ensure quality control.

Page 8 Digital Curation Manual

A central feature of the proposed model is that it foresees quality control to encourage that only digital objects supplied with metadata of the desired quality are ingested into the digital repository. The framework we present operates on the premise that the generation of quality controlled metdata is a continuing process that begins at the time of object creation and ingest into a repository, and continues throughout the object’s lifecycle as situations change (e.g. tools become available and migration takes place). Following an analysis of specific preservation quality metadata needs, and the ingest workflows, we pursue the question as to what tools might increase both the capacity and quality of document ingest into digital repositories with regard to automated metadata extraction. We will demonstrate how our framework may improve the ability of archives, repositories and other digital object management services to handle not only the extraction, creation, and assigment of metadata for digital objects in their collection but also the quality control of the assigned metadata.

In the next sections, we will illustrate how previous researches have been fragmented to focus only on selected situations and scenarios. This will be followed by a discussion of how the previous works in automation can be combined to create a digital curation workflow for automated metadata generation. A detailed description of the model will then be presented along with a discussion of future suggested developments and concluding discussion.

Dobreva, Kim, Ross: Automated Metadata Generation Page 9

Background and developments to date

The need for automation in various preservation-related activities has been recognised and raised in various policy documents in the past decade, including the DigitalPreservationEurope research roadmap (DPE), which echoed the conclusions of an earlier study of Ross and Hedstrom (2005) by highlighting the insufficient level of automation implemented in real life. The results of the AMeGA project (Greenberg et al. 2005), however, demonstrated how the experimental research in automated metadata generation, metadata generation applications, and the content creation software are mostly developed independently with little effort to benefit from each other’s work.

The disconnected nature of development is also apparent across the experimental research landscape in automated generation. This has been observed in our previous work (Kim and Ross 2006; Kim and Ross 2007). As we have mentioned, ERPANET’s (Electronic Resources Preservation Access Network) Packaged Object Ingest Project (ERPANET: POIP) 14 identified several automatic extraction tools for technical metadata, e.g. National Archives UK Digital Object Identification (DROID) and the National Library of New Zealand: Metadata Extraction Tool and s ubstantial work has been published on extracting descriptive metadata within specific domains, e.g. UKOLN DC-dot 15 , Giuffrida, Shek, & Yang (2000), and Thoma (2001). Other work in related areas of information extraction (e.g. Arens & Blaesius (2003), Bekkerman, McCallum, & Huang (2004), Breuel (2003) Ke, Bowerman, & Oakes (2006), Sebastiani (2002), Shafait, Keysers, & Breuel (2006), Shao & Futrelle (2005), Witte, Krestel, & Bergler (2005)) is also observable. Nonetheless, these initiatives are fragmented and rarely brought together in an integrated framework.

Greenberg et al. (2004) present a survey on the metadata experts’ opinions on the functionalities for automatic metadata generation applications (2004). The paper reports on the automatic metadata generation applications (AMeGA) project’s metadata expert survey. Participants anticipate greater accuracy with automatic techniques for technical metadata (e.g., ID, language, and format metadata) compared to metadata requiring intellectual discretion (e.g., subject and description metadata). Support for implementing automatic techniques paralleled anticipated accuracy results. Metadata experts are in favour of using automatic techniques, although they are generally not in favour of eliminating human evaluation or production for the more intellectually demanding metadata. Results are incorporated into Version 1.0 of the Recommended Functionalities for automatic metadata generation applications.

Greenberg also explored the capabilities of two Dublin Core automatic metadata generation applications – Klarity 16 and UKOLN DC.dot 17 . The top level Web page for each resource, from a sample of 29 resources obtained from National Institute of Environmental Health Sciences (NIEHS), was submitted to both generators. Results indicate that extraction processing

14 http://archive.ifla.org/IV/ifla74/papers/084-Rusbridge_Ross-en.pdf 15 http://www.ukoln.ac.uk/metadata/dcdot/ 16 This tool seems not to be supported anymore after the producing company has been bought. 17 http://www.ukoln.ac.uk/metadata/dcdot/ Page 10 Digital Curation Manual algorithms can contribute to useful automatic metadata generation. Results also indicate that harvesting metadata from META tags created by humans can have a positive impact on automatic metadata generation. The conclusion of the study is that integrating automated generation methods will be the best approach to creating optimal metadata, and more research is needed to identify which method will give best results in terms of amount and quality of metadata.

A small-scale study which compares the retrieval effectiveness (recall and precision) for queries run against professionally and automatically generated metadata records had been performed by Irvin (2003). The metadata records represented web pages from the National Institute of Environmental Health Sciences. The results of 10 queries were analyzed in terms of recall and precision. The results of the study suggest that professionally generated metadata records are not significantly better in terms of information retrieval effectiveness than automatically generated metadata records.

Adding more formats and document types to one tool would definitely deteriorate the currently achieved levels of accuracy with respect to these automated tools. Even research groups applying similar methods on various file formats tend to use the same method but in different tools and the results for the various file formats vary.

None of these solutions addresses the issue of extracting contextual information because they do not check anything external to the document. More research is needed in this direction. The issue of quality is not addressed sufficiently. Despite the recall and precision measure, other important parameters such as completeness, sufficiency, trustworthiness are neglected.

Most of the researchers working on metadata extraction choose a specific format and type of content to conduct their experiments. The ongoing research shows that it is relatively easy to make a tool when the structure and class of content of the document is already known. However, the implementation of ongoing work in practice is difficult because the digital repositories ingest objects which come in many different formats and present different types of documents. This motivates the need in development of a tool which would be flexible and will choose the best algorithm according to the file format and the type of document.

While the popular formats are formally well-defined and their recognition seems to be a trivial task, the document types are more challenging. Is it realistic to build a tool which would automatically recognise the type of document? We examine this question in the next section, Automated Genre Classification as a First Step in Metadata Extraction.

What hints could manual metadata creation give us? The human operators follow three basic steps: 1) visual scanning of the document, 2) mental analysis to identify the metadata types and their values, and, 3) entry of the recognized/generated metadata in the proper form. To be able to do such work, the operators should have proper training and have sufficient knowledge about Dobreva, Kim, Ross: Automated Metadata Generation Page 11 the metadata structure, the computer standards used, and the quality requirements – what type of metadata should be entered and how detailed they should be. Manual entry of metadata, especially in specialised annotation of texts (e.g. annotation of mediæval manuscripts and archival documents, and linguistic annotation within text) does not guarantee correct and complete metadata entry, because the quality of work depends on the experience and level of involvement of the operator, and such factors as the emotional state and alertness of the operator.

The basic challenge in automated metadata extraction is to find ways to execute the second step, the analysis that results in the identification of the metadata types and the values which can be directly extracted or derived from the document. This general task could have also the following variant: checking existing metadata against the source and filling in missing or incomplete metadata. Such work is of special value for the digital curation centres. Current research aims to create: more metadata and metadata with better quality.

The previous methods suggested for automated metadata generation fall into three categories: rule based approach, neural networks based approach and statistical modelling based approach. Usually researchers apply one of these methods, but they are not exclusive to each other and could be also applied in combination. The ongoing research on metadata extraction is usually targeted to specific types of documents in the same file format.

One of the most popular elements for which extraction methods are being developed is the document title . For example, Giuffrida et al. developed a rule-based system for metadata extraction from research papers in Postscript (2000). The authors used the general layout rules similar to the following one: “titles are usually located on the upper portions of the first pages and they are usually in the largest font sizes”. It is difficult to say what part of the research papers nowadays are available in postscript format, however, this is quite a typical example of work targeted to recognition and extraction of a specific text from a chosen format.

A similar task, extraction of theme from headline metadata from webpages containing news, is approached by the support vector machine (SVM) method by Debnath and Giles (2005). The researchers use the fact that news metadata generally include elements such as DateLine, ByLine and HeadLine. Authors found that HeadLine information is useful for guessing the theme of the news article. The paper demonstrates that it is especially helpful in locating explanatory sentences related to major events such as significant changes in stock prices for financial news articles.

The titles are the within the basic interest of the research work of Hu et al. They present the automatic extraction of titles from the bodies of documents encoded in HTML (2005). The motivation for this work is that although titles fields of HTML documents should be correctly filled in, in reality, this is not done. The authors suggest that, in the case of such incomplete information, the title should be constructed using the body of the HTML document. A Page 12 Digital Curation Manual supervised machine learning approach was used by these researchers. Their approach is based on format information (font size and weight, and position) as additional features in the process of title extraction. It is reported that the proposed method significantly outperforms the baseline method of using the lines in largest font size as title.

Title extraction was developed further in the subsequent publications of the same team expanding the task from webpages to Word and PowerPoint documents (Hu et al., 2005b; 2006). The authors again suggest the machine learning approach to title extraction from general documents which belong to a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. In their approach, titles in sample documents (for Word and PowerPoint respectively) are annotated and taken as training data, machine learning models are further trained, and finally used to perform title extraction. The method is unique because it mainly utilizes formatting information such as font size as features in the models for different types of documents. The reported results show that the use of formatting information can lead to quite accurate title extraction from general documents. Another noteworthy result from the presented approach is the fact that models can be trained in one domain and then applied to another domain – the team started with webpages and the continued with Word and PowerPoint documents.

Research of other teams was aimed at extracting of a set of metadata elements. For example, Yilmazel et al. developed the system MetaExtract which assigns Dublin Core and GEM (Gateway to Educational Materials) metadata to educational materials using a mixture of rule-based natural language processing technologies and statistics (2004). MetaExtract has three distinct extraction modules: eQuery module (a rule-based system using shallow parsing rules and multiple levels of NLP tagging), an HTML-based Extraction module (operates by comparing the text to a list of clue words developed previously) and a Keyword Generator module (operates by computing the standard TF–IDF 18 metric on each document). The quality of the extracted metadata was evaluated through a web–based survey which showed a significant difference between manual and automated extraction of Title and Keyword (the manual quality was higher). The quality of the remaining elements (Description, Grade, Duration, Essential Resources, Pedagogy-Teaching Method, and Pedagogy-Group) were shown not to be significantly different.

Another specific subject domain was approached by Mao et al. who conducted automated metadata extraction from medical research papers using rules on formatting information (2004). Their work is concerned with the preservation of scanned and online medical journal articles at the US National Library of Medicine (NLM) where a system has been developed to generate descriptive metadata (title, author, affiliation, and abstract) from scanned medical journals. The system consists of the following modules: ZoneMatch which

18 term frequency - inverse document frequency Dobreva, Kim, Ross: Automated Metadata Generation Page 13 generates geometric and contextual features from a set of issues of each journal, and ZoneCzar - a rule–based labelling module.

Yet another example of research aimed at extraction of a set of metadata from research publications is presented by Day et al. (2005) who used a hierarchical template-based method. The authors implemented a hierarchical knowledge representation approach in a tool called INFOMAP, which automatically extracts metadata. The experimental results show that, by using INFOMAP, author, title, journal, volume, number (issue), year, and page information can be extracted from different kinds of reference styles with a high degree of precision.

A patented method for automatic extraction of metadata using a neural network exists (US Patent 6044375, 2000). The patent is intended for use in data archiving systems. The method is adaptable to non-standard documents where metadata locations are unknown. The claim is that the method contributes to higher rates of accuracy and reliability in extracting metadata. We have not been able to identify practical implementations of this patent. The ingredients for this method include a computer readable text document, an authority list consisting of common uses of a set of words, and a neural network trained to extract metadata from groupings of data called compounds . In the first step, the words within the document are compared against the authority list and information derived from the blocks of the document are grouped together into compounds. In the next step, the compounds are processed through the neural network to generate metadata guesses. Finally, the metadata are derived from the metadata guesses by selecting those document, compound, and word guesses with the largest confidence factors. There is also research, e.g. Liu et al. (2006) , that presents the automatic identification, extraction, and search for the tables and their contents in PDF documents.

We have summarized some key features of the current research endeavours we described above in Table 2.

Page 14 Digital Curation Manual

Table 2. Comparison of results of metadata extraction research

Source documents Extracted elements and supported types Source Approach Results of files document titles research papers in Giuffrida et al. rule-based postscript format 2000 (layout) generic application not specified US patent neural 6044375 :2000 networks assigns DublinCore header part of Han et al. 2003 ML, SVM 92,9% accuracy; precision Title, Author, Affiliation, research papers - support for various elements is Address, Note, Email, Date, vector between 0,795 and 0,969 Asbtract, Phone, Keyword, machines recall – between 0,622 and Web, Degree, Publication 0,991 number assigns Dublin Core + GEM web pages Yilmazel et al. rule-based web-based survey (users (Gateway to Educational containing 2004 (NLP) evaluate the metadata Materials) tags: Title, educational quality) Keyword, Description, Grade, materials Duration, Essential Resources, Pedagogy-Teaching Method, and Pedagogy-Group Title, Author, Affiliation, and scanned medical Mao et al. 2004a rule-based title with 92% accuracy, Abstract journal articles at (formatting author(s) with 87% NLM (US National ) accuracy, affiliation(s) with Library of 75% accuracy, author- Medicine) affiliations with 71% accuracy and table of contents with 76% accuracy headline titles financial articles Debnath, Giles. SVM 2005 Author, Title, Journal, volume, scholarly Day et al. 2005 hierarchica 0,9239 accuracy Number (issue), Year, and publications l template– Page information (reference based data) reference metadata extraction method acknowledgements research papers Council et al combinatio precision 0.7845, recall 2005 n of ML 0.8955 (SVM) and regular expression s document titles HTML files where Hu et al.. 2005a ML 20.9%–32.6% tag is not improvement of baseline used properly method (using lines in largest size) document titles Word, PowerPoint Hu et al. 2005b, ML 0.810 precision and 0,875 Hu et al. 2006 recall for Word documents; 0.837 precision and 0.845 recall for PowerPoint documents tables PDF Liu et al 2006 </p><p>Dobreva, Kim, Ross: Automated Metadata Generation Page 15 </p><p>How the topic applies to Digital Curation </p><p>Metadata are at the core of managing digital objects. To be able to access digital information in a human readable form, we need to understand the technical environment (e.g. software and hardware) that will make this possible. This is recorded as the technical metadata of the object. Even when we are able to render the object correctly, we cannot interpret the content correctly without further context of the content at the time of creation (e.g. what the column of the spreadsheet signified at the time the observed data was recorded). Metadata, therefore is central to digital preservation and curation. </p><p>Automated metadata generation is an immediate concern in battling the cost of maintaining high quality consistent metadata for objects that are introduced into repositories, libraries and archives. The amount of daily created information is now beyond the scope of manual maintenance. Data creators need to use automated metadata generation tools to help them identify and assign adequate metadata to the digital objects they create. Data curators must ensure that correct metadata have been assigned to objects in anticipation of curation activities that might take place with respect to objects with the same metadata element (e.g. migration of all pdf files to XML) and be able to assign new metadata easily to digital objects based on changing policies and needs within the collection. Data re-users may utilise the metadata to interpret retrieved data and to create new metadata to facilitate their own reuse of digital objects. All groups of users need to be able to understand the tools they are using, because this has a great influence on the quality of the metadata being generated. Key practical issues of metadata generation are presented in Table 1. </p><p>Table 1. Metadata extraction at a glance What is being extracted? Most commonly, fragments of text which could be used to fill in a metadata element value: e.g. title, date, author. Across different digital repositories abd applications, different metadata schemes and level of detail of metadata record can be found. Documents of one genre and the same file format (for example, research papers in PDF 19 ) are typically accompanied by a similar metdata set. How is it being extracted? There are various methods which aim to locate text fragments. The most popular methods are based on analyses of document layout and text patterns. Some of those methods are presented in the text below. How is performance We adopt recall, accuracy, and precision to evaluated performance 20 , as well as measured? consistency (e.g. robustness across a range of datasets) and completeness (e.g. coverage of the range of requested metadata). When is it extracted? Metadata are extracted at many different points in an object’s life cycle; e.g., 1. at the time of ingest, 2. during digital repository and metadata quality assessment processes, 3. curation processes such as migration, and, 4. at the point of object dissemination and reuse. Why is it extracted? Metadata are generated 1. in anticipation of other automated digital object management processes (e.g. migration for a single file format), 2. to enable assess risks to the collection based on risks to objects characterised by selected metadata, and 3. to increase accessibility to information (e.g. to render in human readable form). Automation of generation reduces the cost of manual extraction and formalises the quality control of existing metadata and encourages consistency (see previous section). </p><p>19 Adobe Portable Document Format, http://www.adobe.com/products/acrobat/adobepdf.html 20 Measures used in the evaluation of automated classification and retrieval performances. Page 16 Digital Curation Manual </p><p>Topic in action </p><p>In the section “Background and developments to date” we showed that, in the most popular existing workflows, the digital objects presented for ingest arrive at the repository either accompanied by their metadata or have their metadata added after ingest. In both scenarios, mechanisms to support metadata quality control are lacking, and this poses risks to the long-term management of the digital objects themselves. Metadata quality impacts on discovery and retrieval, data and preservation management, and how future users can access the objects. The metadata extraction workflow described here is designed to be a pre-ingest process that includes quality control before the object is submitted to a repository. It is designed to ensure that digital objects ingested into a repository pass a metadata quality threshold; this threshold is defined at repository level. </p><p>To improve both quality and automation in digital repositories, we introduce automated metadata extraction into our workflow based on the assumption that the intelligent choice of an appropriate metadata extraction tool can be made according to the digital object format, genre and metadata quality requirements. The eventual deployment of a service based on this model depends upon the creation of a public repository of metadata extraction tools as well as the tools themselves. </p><p>The input into the workflow is generally a digital object of unidentified genre and format. This is received by the Digital Repository Content Manager (represented by the icon ), which is a process implemented by a software agent or a human user, or even by a combination of both, at various stages of the task. It initiates and guides the ingest process of digital objects into the repository and includes several transformations and decision-making points. The workflow implements the following core processes: </p><p>– Digital object preparation , including digital object format detection and, if necessary, conversion to PDF. – Automated genre classification , involving analysis of the structure of the object and assignment of a genre. – Automated metadata extraction , featuring use of a distributed repository of metadata extraction tools for documents of various genres. – Quality control. A process where metadata are validated. </p><p>Table 1 summarises the input, output and repositories used in the four core activities. Dobreva, Kim, Ross: Automated Metadata Generation Page 17 </p><p>Table 1. Data flows in the automated metadata extraction workflow activities Repositories Process Data input Data output needed Digital object Digital Digital object Repository preparation object + Digital object in of PDF PDF converters Automated Digital Digital object Genre Class- object + + Digital object in ification Digital PDF object in + Genre PDF Automated Digital Digital object Repository Metadata object + + Digital object in of automated Extraction Digital PDF metadata object in + Genre extraction tools PDF + + Metadata Queue of digital Genre or objects of a Quality Quality Ingest of digital object Digital Control requirements and metadata repository preset by or </p><p>The idea that we adopted in outlining the general architecture of the workflow was to encapsulate the separate processes described as independent managers . Thus, on the highest level (Figure 1) we present the data flow, the managing components and the repositories needed. In subsequent figures (Figures 2-8) we present in detail the five managers; this approach is in line with service- oriented modelling. One of its advantages is that it provides the flexibility to build a distributed system, collecting under one umbrella components developed, and even implemented, at different institutions. </p><p>The output of the workflow depends on the outcome of the quality control with respect to the extracted metadata. In general, this outcome would be the document enriched with PDF representation, genre identification and metadata, ready for ingest into the repository. If metadata cannot be generated or do not meet the quality requirements, the process may be repeated. If the reason for the lack of metadata is the lack of availability of an appropriate metadata extraction tool, the digital object will be placed in a queue until the appropriate tool can be acquired (the workflow envisages communication with a public registry of metadata extraction tools). </p><p>As mentioned above, Figure 1 presents the framework model on the highest level as a combination of processes and data flows. The 3D boxes present the five managers, which are detailed further and numbered for easy reference. To facilitate the location of the correct diagram, we have placed the respective boxes in the upper right-hand corner of the detailed Figures 2 to 8. Processes that are not presented in more detail appear in light green rectangles. Decision points are represented by lozenge shapes. </p><p>Page 18 Digital Curation Manual </p><p>The central managing process is handled by the Digital Repository Content Manager. We have used red dotted arrows to represent its intervention. The regular flow of activities is indicated by black arrows. The digital objects and other data generated by the various processes are presented as blue rhombuses. The repositories used at various stages are also represented in Figure 1, as white document stacks with small icons. In real life, we could consider the support of one intelligent repository optimised to enable the easy location of a specific type of application, but to make things more explicit here we assume that these repositories are different (and perhaps even hosted at different institutions). We have used a green line to indicate the return of an object for which a metadata extraction tool had not previously been available but has now become available for use by the Automated Metadata Extraction Manager. </p><p>The workflow starts with submission of a digital object in an unidentified format for metadata extraction. Operation on the object is initiated by the Digital Repository Content Manager. For the sake of simplicity, we only consider here the situation where one object is processed at a time; in reality, it is more likely that multiple digital objects will be processed simultaneously. However, our principal aim is to present the logic of the process. Our assumption is that the Digital Repository Content Manager will place multiple objects in a queue when they arise, and they will be processed consecutively. </p><p>Fig. 1. Ingest framework. </p><p>When a digital object is presented for metadata extraction, the first step is to determine its file format type. If it is PDF, then the object is submitted directly to the Genre Classification Manager. Otherwise, it will be submitted to the Dobreva, Kim, Ross: Automated Metadata Generation Page 19 </p><p>PDF Conversion Manager for analysis and representation of the object in PDF format. Conversion to PDF 21 is intended to make all documents conform to one format for processing by the Genre Classification Manager, which we have optimised to work with PDF representations. The object is also preserved in the format in which it has been submitted. </p><p>The PDF Conversion Manager (see Figure 2) The first task of the PDF Conversion Manager is to identify the technical format (e.g. RTF, PS, JPEG) of the object. This is carried out within a Format Recognition Component. The format influences the tools that will be needed to render the object, and/or access information from the object. In this step, the system may categorise the objects into groups of documents, images, audio, composite or other files, as well as making a decision on a specific format. If the object can be identified to be in a document format, then a check will be performed to determine the specific format. </p><p>Figure 2. The PDF Conversion Manager. </p><p>If the format is not known or a tool for the particular format is not available or does not exist, the Digital Repository Content Manager will decide how to proceed. A possible scenario might be to publish a public request for the necessary tool; the failure to recognise a format might be a typical case of an ‘outlier’ – a digital object which for some reason does not conform to the rest of the collection. Such situations require decisions to be made on a case-by-case basis. If a converter exists, it produces a PDF version of the digital object, which is checked for quality and sent to the Genre Classification Manager (Figure 3). </p><p>21 We take conversion to PDF as an approach for preserving documents. This is not applicable for the case of curating for scientific data. Page 20 Digital Curation Manual </p><p>Genre Classification Manager (see Figure 3) Within this sub-system, the digital object is analysed and labelled with the genre to which it belongs. This step is intended to cluster objects into classes characterised by homogeneous structure and is expected to facilitate the location of further information within the object. The process utilises one or more of five types of object features (based on image analysis, syntactic analysis, stylistic analysis, semantic structure analysis and domain knowledge analysis) described by Kim and Ross (2006). In determining the genre of a digital object, classifiers are built on a discriminate use of these five feature types, as not all features are necessarily expected to be present in the object, and as the feature type most suitable for detecting documents of one genre is not necessarily the best for detecting documents of another genre [Kim and Ross (2008)]. </p><p>The Genre Classification Manager receives a PDF file. The process starts with an analysis performed by the Compound Object Handler to determine whether this is a simple or compound document. In the case of compound objects, it would create a queue consisting of the object followed by its sub-components. For example, in the case of books, journals or websites it is recommended to extract metadata not only on the higher-level genre but also on the constituent smaller identifiable pieces. For compound objects, the metadata extracted from the components will be integrated to form a composite metadata set at the end of the process. </p><p>Figure 3. Genre Classification Manager. The digital object is processed by a Submission Engine. Its role is to decide which classifiers to apply to a particular digital object. The model incorporates five classifiers: involving visual layout, language model (e.g. N-gram model of words), stylo-metrics (e.g. frequency of definite articles) and semantics of the text (e.g. number of subjective noun phrases), and domain knowledge (e.g. document source or format) [5, 6]. Each of the classifiers applied will return a Dobreva, Kim, Ross: Automated Metadata Generation Page 21 label value. If the classifier had not been used or could not extract any features from the object, it would return a null value, which is also informative in the further analysis. </p><p>Figure 4. Genre Labeller. </p><p>The acquired values are submitted to a Genre Labeller; see Figure 4. Its decision-making tool uses an estimated probability distribution of features in relation to classes in a selected training data set to predict the genre class or classes of a document from a predefined schema. If agreement on the genre can not be achieved, this tool communicates with the Digital Repository Content Manager, which would typically resubmit the object for a new iteration of the genre-labelling exercise. The Quality Manager again takes the lead before the result (an agreed genre label) is submitted to the next component, the Metadata Extraction Manager. The output of the tool is the digital object tagged with its genre label. </p><p>Metadata Extraction Manager (see Figure 5) The Metadata Extraction Manager deploys information gathered about the digital object and knowledge of its genre class to select the most appropriate metadata extraction tool from the Repository of Metadata Extraction Tools. Ross, Kim and Dobreva (2007) have examined at least eleven research initiatives targeted at metadata extraction for documents belonging to specific genres. Some of these have been developed into tools such as the plug-in for the CiteSeer Digital Library 22 which retrieves acknowledgements from research papers, INFOMAP which locates bibliographic information within scholarly publications, MetaExtract which generates Dublin Core and GEM (Gateway to Educational Materials) from educational materials, and UKOLN </p><p>22 http://citeseer.ist.psu.edu/index Page 22 Digital Curation Manual dc-dot which creates shallow Dublin Core metadata from webpages. In selecting the metadata extractor, threshold settings for metadata depth and quality as defined by the Digital Repository Content Manager are taken into account. </p><p>Figure 5. Metadata Extraction Manager. </p><p>A request for tools consists of a set of values [g, f, r, q] constructed to represent Genre (g) and Format (f) described above, Quality (q) described below, and Rights (r), where (r) is intended to convey the Digital Repository Content Manager’s preference with respect to product licence type (e.g. free or commercial) when selecting tools from the Repository of Metadata Extraction Tools. The Request Dispatcher then (see Figure 5) selects tools matching the values in the request. The most suitable metadata extractor is selected by submitting the retrieved tools to the Results Optimiser, which chooses the metadata extraction tool that has demonstrated greatest success on earlier occasions. If tools for a particular genre for either the PDF or the format in which the digital object was submitted are not available in the Repository of Metadata Extraction Tools, a check would be carried out to see what formats could be processed. The Metadata Extraction Manager could, as a result, initiate a process to generate a version of the digital object in a format that could be processed by an available metadata extraction tool. After the digital object is submitted to the chosen metadata extractor, quality control is applied to the extracted information before ingest (see Figure 6). Should an appropriate tool not be available in the Repository of Metadata Extraction Tools the Manager of Metadata Extraction Tools handles the exception (see below Figure 8). Dobreva, Kim, Ross: Automated Metadata Generation Page 23 </p><p>Figure 6. Metadata Extraction Tool. </p><p>Quality Control Manager (see Figure 7) </p><p>Figure 7. Quality Control Manager. Page 24 Digital Curation Manual </p><p>The Quality Control Manager checks the results from the various managers against a predefined and repository-weighted set of quality parameters, including precision and recall, consistency, sufficiency, and trustworthiness: Quality threshold value (q). Quality parameters will be of indicative value only if metadata extraction tools have been tested against a transparent benchmark dataset before their inclusion in the Repository of Automated Metadata Extraction Tools. This tool is fine-tuned by the Digital Repository Content Manager. It is the basic instrument for ensuring the desired quality level. If the quality control leads to a positive result, the object is ingested into the repository or a set of repositories, some which may be distributed. Quality assurance processes would also be implemented for the results of PDF Conversion and Genre Classification Managers. </p><p>However, if the quality control finds that the metadata do not pass the defined quality threshold, the digital object returns to the Genre Classification Manager with a request for its genre classification to be re-evaluated. As it would be pointless to return the digital object for genre re-assignment repeatedly after a certain number of failed attempts (although we have not yet identified the optimal number) the object will be sent to the Digital Repository Content Manager for further manual inspection. </p><p>Manager of Metadata Extraction Tools (see Figure 8) This manager handles the digital objects for which a tool for metadata extraction has not been incorporated into the Repository of Automated Metadata Extraction Tools or is not available. First, the manager invokes the Request Initiator which starts an external search for an existing tool and if that fails to produce results the manager announces to the community a request for open-source development of such a tool. Second, it adds information about the specific digital object and in which digital repository it is held to a registry of digital objects for which metadata extraction tools are unavailable. </p><p>Figure 8. Manager of Metadata Extraction Tools </p><p>There are two scenarios in which a digital object can be removed from the registry. The first depends upon manual extraction of metadata. The second is to return the digital object to the Manager of Metadata Extraction Tools when the necessary metadata extraction tool becomes available. The maintenance of the registry is carried out in communication with the Manager of Metadata Extraction Tools. Statistical data on numbers of digital objects from specific Dobreva, Kim, Ross: Automated Metadata Generation Page 25 genres expecting to be processed may be produced periodically and requests for development of the necessary tools periodically re-issued. The second component of this process is the Request Follow-up. This component periodically checks what tools have been submitted to the Repository of Automated Metadata Extraction Tools, and returns to the Metadata Extraction Manager details of tools that can be applied in the workflow. </p><p>Next steps </p><p>Park and Lu argued recently that “Most experimental research on automatic metadata generation claims promising results; however, feasibility and scalability have therefore not been sufficiently addressed in a realistic metadata environment”. (Park, Lu: 2009, p. 225). </p><p>With multiple tools available and the need to offer solutions for different types of objects, there seem to be several typical scenarios: </p><p>Automated metadata generation in homogeneous archives – Archives storing objects of the same format and genre should select the available tool for metadata extraction which provides the best possible quality in terms of precision of extraction. – How this can be improved? The data creators and curators should have access to regularly updated technology watch where tests of existing and newly appearing tools are made available. </p><p>Automated metadata generation in heterogeneous archives. – These archives need distributed solutions, following a framework similar to the one we described in the section “Topic in action”. – How this can be achieved? Providing a distributed tool which incorporates the management component which identifies missing types of tools, publishes requests for development of necessary tools and incorporates them in the general framework once created. </p><p>Automated metadata generation “on the fly” (just in time). – Such metadata generator tools might be quite specific in extracting a very limited number of elements; they probably would have to be specially developed and included in the bigger archival system architecture. </p><p>Further work is needed to: - create an initial extensive set of metadata extraction tools for various genres. The current research seems to have few preferred genres out of 70 mostly used in practice (research papers, web pages, and presentations are the traditional test examples); - improve the quality and precision of metadata extraction tools – currently their values are most typically in the range of 90-98% and yet to raise the trust to automated tools the results could be improved; - widen the number of elements extracted from various texts – there are many examples of extracting titles, names of authors, abstracts – but this is not sufficient even to create the most simple description; Page 26 Digital Curation Manual </p><p>- fine-tune metadata extraction to various metadata schemes used in the digital repositories; - implement in practice sets of quality parameters for PDF convertors, genre classificators and metadata extractors; - work on the extraction of context metadata – this area seems to be quite underdeveloped. </p><p>Here we would like to emphasise again the issue of quality. It can be considered from two points of view in the area of metadata extraction: - as the measure of success in the process of metadata extraction (this is measured through precision, recall and accuracy). - as a measure for preliminary estimation of existing metadata in order to take the decision whether additional or improved metadata are needed (the most relevant quality parameters for this purpose would be completeness, accuracy, consistency, provenance, confirmation of user expectations). </p><p>There is a definite need for further research within the second case. In addition, there is a need to raise the awareness of the digital repositories community on the issues of evaluation of metadata quality of resources. The fact that even large multinational repositories are relying on the work processes of their content providers and do not plan to undertake quality evaluation is symptomatic that there is over-trust in the content providers. This is in contrast to with the issue of trustworthiness of repositories which is of special value in the digital archives area. </p><p>In the current situation of high management costs for digital collection, it is also necessary to find ways of adding value of preservation metadata to digital library functioning through their effective use in other relevant functionalities. The use of DELOS Digital Library Reference Model helps to achieve better understanding of the preservation-related components and processes in the digital world and to model the processes for each separate case according to the specific collection needs. </p><p>In 2003 the National Science Foundation (NSF) Digital Library Initiative and DELOS project 23 released a research agenda which stressed the role of metadata in managing digital objects over time (see Hedstrom, Ross et al., 2003, and Ross and Hedstrom, 2005). Amongst the key areas in which development is shaping the landscape of current preservation activities, the authors identified emerging research domains in preservation strategies, re- engineering of the preservation processes, and automation in preservation systems and technology. The enhanced automation in preservation systems and technology is an issue which is connected with the rising volume of work, and with issues of efficiency, effectiveness and quality. </p><p>Suggesting general solutions in such a diverse environment is very difficult. On the one hand, there is a general framework for the digital libraries/repositories and similarities of processes and life cycles. On the other </p><p>23 http://www.delos.info Dobreva, Kim, Ross: Automated Metadata Generation Page 27 hand, the practicalities of each distant collection in terms of content type, file formats, size and services offered result in multiple different details which should be taken care of. </p><p>Future developments </p><p>Here we would like to draw the attention to unresolved issues and gaps which need to be addressed in future research and development, especially related to metadata: </p><p>Automation of metadata extraction: towards recognising the type of the object The different types of repositories impose various ingest workflows. The metadata bottleneck in a situation of growing number of digital repositories implies the use of automated metadata extraction methods. Additional research is needed to suggest a suitable proportion of human-generated and machine- extracted metadata in the various cases. The rich variety of digital repositories require intelligent procedures for ingest, fine-tuned to every particular repository type. </p><p>Metadata sufficiency. Metadata sufficiency has two aspects: completeness and quality. The quality of manually created metadata is found by researchers to rely heavily on the combination of two factors, institutional processes and personal behaviour. The motivation, difficulty of working with the application, difficulty of understanding the scope of the project and the subsequent use of metadata in information retrieval are mentioned amongst the basic factors which influence the quality by Crystal and Greenberg (2005). </p><p>Quite a common drawback is that repositories ingesting digital objects from different sources rely on a metadata encoding scheme and this is considered as a sufficient ‘quality guarantee’. Quality control is not envisaged as a necessary activity with the presumption that it is the responsibility of content providers. However, the level of quality offered is a key constituent of the repository trustfulness. </p><p>In the current effort of building a European Digital Library (EDL) 24 quality is approached in this traditional way, as a responsibility of the content providers. Quality evaluation for ingested objects is not foreseen, although the idea is that the EDL will ingest materials from various types of memory institutions, libraries, museums and archives – see Dekkers et al. (2007). </p><p>Efficiency issues: information redundancy. Although unexpected in this setting of deficient metadata quantity and quality, there are new emerging problems of information redundancy in metadata collections as M. Foulonneau showed in 2007. Information redundancy arises when the same digital object is supplied with metadata in different places (relating to replicated efforts where human resources are not sufficient) and </p><p>24 http://www.edlproject.eu/ Page 28 Digital Curation Manual during ingest of vast numbers of objects supplied with similar metadata into a digital repository making them hard to identify. </p><p>This research is set up motivated by the research context described here; again, it could not solve all pending issues but will contribute to better metadata quality and enhanced automation in the area of generation of preservation quality metadata in the time of ingest into a digital repository. </p><p>Conclusions </p><p>The research presented here touches only one small piece of the large picture: automated metadata extraction. Although this is just a tiny segment, it brings together serious knowledge from the areas of digital repositories, metadata, information extraction, genre classification, and digital object management. </p><p>This work tried to look at this complex area from the different perspectives and to find the most flexible approach which would be adaptable to the many repository types and to the distributed e-world. </p><p>We have learned from our research that: </p><p>• there are many ongoing research projects on automated metadata; extraction which suffer from attacking too specialised research areas but also give better results because they are narrowly-specialised; </p><p>• the typologies of digital repositories involve several various classification factors which potentially result in hundreds of varieties of digital collections in terms of content, coverage, users, objectives, quality and functionality; </p><p>• whatever the variety of a digital repository , it needs to maintain reliability and quality; </p><p>• the workflows on populating repositories are concerned with trustworthiness and time constraints; the evaluation of quality seems to be neglected – even the flagship European initiatives show trust in the providers. </p><p>Here we have described the need for creation of more sophisticated automated metadata generation workflows based on distributed resources, which could form a component for a trustworthy, quality assured ingest model for digital repositories. The framework delivers a range of benefits: </p><p>It supplies a valuable element in automation methodology. Automation promotes efficiency and facilitates qualitative consistency across different repositories. </p><p>Dobreva, Kim, Ross: Automated Metadata Generation Page 29 </p><p>It identifies areas and tools in need of development in order to take automation forward: the workflow framework we present breaks down the processes and makes it possible to see those that need to be refined, those that require the creation of new tools, those where we can borrow existing tools, and makes apparent the points at which interoperability between components would need to be addressed. </p><p>It enables stakeholders to compare tools, practices, costs of processes and quality across repositories, which results in better identification of best practices, selection of quality measurements and management of resources. </p><p>It provides a context for collaboration and integration across different institutions and research and development areas: the distributed resource framework brings together repositories, registries, extraction, conversion and enrichment tools and quality control and assurance under one collaborative management. </p><p>It encourages concentrated efforts: the distributed architecture allows focused research in distinct component areas, thereby enabling better performance through the development and integration of tools tailored to work well in specified environments. This is in contrast to a general tool that does not work particularly well in any environment. </p><p>It encourages the development of resources which may be independently useful, including repositories for conversion tools, extraction tools and format registries. </p><p>The understanding of how metadata generation tools work would help data curation practitioners to take better informed decisions about the practical workflows within their institutions, depending on the size of the archives, the situation with the metadata and the practical needs within the actions of assigning metadata to digital objects. </p><p>Page 30 Digital Curation Manual </p><p>References Arens, A., & Blaesius, K. H. (2003). Domain oriented information extraction from the Internet. SPIE Document Recognition and Retrieval, Vol 5010, p. 286. </p><p>Bekaert, J., Van de Sompel, H. (2005) A Standards-based Solution for the Accurate Transfer of Digital Assets, D-Lib Magazine, Vol. 11(6) (20). http://www.dlib.org/dlib/june05/bekaert/06bekaert.html </p><p>Bekkerman, R., McCallum, A., & Huang, G. (2004). Automatic categorization of email into folders. Benchmark experiments on Enron and SRI corpora (Tech. Report IR-418). University of Massachusetts: Center for Intelligent Information Retrieval. </p><p>Breuel, T. M. (2003). An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis. 7th International Conference on Document Analysis and Recognition, 2003, pp. 66–70. </p><p>Caplan, P. (2006) Preservation Metadata , In: Ross, S. and Day, M., eds., DCC Digital Curation Manual. http://www.dcc.ac.uk/sites/default/files/documents/resource/curation- manual/chapters/preservation-metadata/preservation-metadata.pdf </p><p>Council, I., Giles, C., Han H., Manavoglu, E. (2005) Automatic Acknowledgement Indexing: Expanding the Semantics of Contribution in the CiteSeer Digital Library. Proc. of the 3rd int. conf. on Knowledge capture, Banff, Alberta, Canada, 19--26, ISBN:1-59593-163-5. </p><p>Crystal A., Greenberg J. (2005) Usability of a Metadata Creation Application for Resource Authors, Library & Information Science Research V. 27(2), 177--189 (2005). </p><p>Day, M. (2005) Metadata, In: Ross, S. and Day, M., eds., DCC Curation Reference Manual . [S.l.]: Digital Curation Centre, ISSN 1747-1524. http://www.dcc.ac.uk/sites/default/files/documents/resource/curation- manual/chapters/metadata/metadata.pdf </p><p>Day, M., Tsai, R., Sung, C., Hsieh, C., Lee, C., Wu, C., Wu, K., Ong, C., Hsu, W. (2007) Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework, Decision Support Systems 43 pp. 152--167. </p><p>Debnath, S., Giles, C. (2005) A Learning Based Model for Headline Extraction of News Articles to Find Explanatory Sentences for Events. Proc. of the 3rd Int. Conf. on Knowledge Capture, Banff, Alberta, Canada, 189--190, ISBN:1-59593- 163-5. </p><p>Dekkers, M., Gradmann, S., Meghini, C., Aloia, N., Concordia, C. (2007) EDLnet D2.2, Initial Semantic and Technical Interoperability Requirements , v. 1.0., DigitalPreservationEurope: DPE Research Roadmap. DPE-D7.2. http://www.digitalpreservationeurope.eu/publications/reports/dpe_research_road map_D72.pdf Dobreva, Kim, Ross: Automated Metadata Generation Page 31 </p><p>Duff, W. and van Ballegooie, M. (2006) Archival Metadata, DCC Curation Reference Manual. http://www.dcc.ac.uk/sites/default/files/documents/resource/curation- manual/chapters/archival-metadata/archival-metadata.pdf </p><p>Duncan C, P. Douglas P., (2009) Automatic Metadata Generation: Use Cases and Tools/Priorities. Guidance on different automated metadata generation approaches for service providers in HE. 2009 (August):14 pp. </p><p>Giuffrida G., Shek E. and Yang J. (2000) Knowledge-Based Metadata Extraction from PostScript Files , in Proceedings of the Fifth ACM Conference on Digital Libraries-DL'00 (San Antonio TX, June 2000), ACM Press, 77-84. </p><p>Greenberg, J. (2004) Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. Journal of Internet Cataloging, 6(4) pp. 59—82. </p><p>Greenberg, J., Spurgin, K., and Crystal, A. (2005) Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf </p><p>Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W., Gladney, H., Huc, C. Kenney, A.R., Moore, R., and Neuhold, E. (2003) Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation , http://delos- noe.iei.pi.cnr.it/activities/internationalforum/Joint- WGs/digitalarchiving/Digitalarchiving.pdf . </p><p>Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H. (2005) Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval. Proc. 28th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Salvador, Brazil, 250-257, ISBN:1-59593-034-5 (2005). </p><p>Hu Y, Li H, Cao Y, Meyerzon D, Zheng Q. (2005) Automatic Extraction of Titles from General Documents using Machine Learning, Int. Conf. on Digital Libraries, Proceedings of the 5 th ACM/IEEE-CS joint conf. on Digital libraries, Denver, CO, USA, 145--154, ISBN:1-58113-876-8 (2005). </p><p>Hu, Y., Li. H., Cao, Y., Teng, L, Meyerzon, D., Zheng, Q. (2006) Automatic Extraction of Titles from General Documents using Machine Learning. In: Information Processing and Management 42, 1276--1293 (2006). </p><p>Ke, S. W., Bowerman, C. & Oakes, M. (2006). PERC: A personal email classifier. In Proceedings of 28th European Conference on Information Retrieval, ECIR 2006, 460–463. </p><p>Kim, Y., & Ross, S. (2006). Genre classification in automated ingest and appraisal metadata. In J. Gonzalo, (Ed.), Proceedings European Conference on advanced technology and research in Digital Libraries (ECDL), in Lecture Notes in Computer Science, Vol. 4172 (pp. 63-74). Berlin, Germany: Springer Verlag. Page 32 Digital Curation Manual </p><p>Kim, Y. & Ross, S. (2007) “The Naming of Cats”: Automated Genre Classification. The International Journal of Digital Curation. Issue 1, Volume 2. http://www.ijdc.net/index.php/ijdc/article/view/24 </p><p>Kim, Y. and Ross, S. (2008) Examining Variations of Prominent Features in Genre Classification. In Proceedings 41st Hawaiian International Conference on System Sciences, IEEE Computer Society Press, ISSN 1530-1605. </p><p>Lavoie, B., R. Gartner (2005) Preservation Metadata . A Joint Report of OCLC, Oxford Library Services, and the Digital Preservation Coalition (DPC), published electronically as a DPC Technology Watch Report (No. 05-01) http://www.dpconline.org/docs/reports/dpctw05-01.pdf . </p><p>Lee D. (2007) Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support. IJDL 6:313-326. </p><p>Liddy, E.D. (2002) A Breadth of NLP Applications . ELSENEWS of the European Network in Human Language Technologies. Winter. </p><p>Liu,Y., Mitra, P., Giles, C., Bai, K. (2006) Automatic Extraction of Table Metadata from Digital Documents. Proc. of the 6th ACM/IEEE-CS joint conference on Digital libraries, p. 339--340, ISBN:1-59593-354-9. </p><p>Mao, S., Kim, J., Thoma, G. (2004) A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. Proc of the First Int. Workshop on Document Image Analysis for Libraries, Palo Alto, CA, 225--232. </p><p>Park, J., and Lu C. (2009) Application of semi-automatic metadata generation in libraries: Types, tools and techniques. In: Library & Information Science Research 31, 225-231. </p><p>Ross, S. and Hedstrom, M. (2005) Preservation Research and Sustainable Digital Libraries. In: International Journal of Digital Libraries (Springer), pp. 317-324, http://eprints.erpanet.org/95/01/ross_hedstrom_Int_J_Digit_Libr_2005.pdf Ross, S., Kim, Y. and Dobreva, M. (2007) Preliminary framework for designing prototype tools for assisting with preservation quality metadata extraction for ingest into digital repository, Pisa, DELOS NoE, December 2007, ISBN 2- 912335-39-6. </p><p>Sebastiani, F. (2002). Machine learning in automated text categorization. In ACM Computing Surveys, Vol. 34, 1-47. </p><p>Shafait, F., Keysers, D., & Breuel, T., M. (2006). Performance comparison of six algorithms for page segmentation. 7th IAPR Workshop on Document Analysis Systems (DAS), 368–379. </p><p>Shao, M., & Futrelle, R. (2005). Graphics Recognition in PDF document. 6th IAPR International Workshop on Graphics Recognition (GREC2005), 218–227. Dobreva, Kim, Ross: Automated Metadata Generation Page 33 </p><p>Thoma, G. (2001). Automating the production of bibliographic records. (R&D report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine). </p><p>Witte, R., Krestel, R. & Bergler, S. (2005). ERSS 2005:Coreference-based summarization reloaded. In Proceedings of DUC 2005 Document Understanding Workshop, Vancouver, B.C., Canada. </p><p>Yilmazel, O., Finneran, C., Liddy, E. (2004) Metaextract: an NLP System to Automatically Assign Metadata. Proc. of the 4th ACM/IEEE-CS joint conference on Digital libraries, Tuscon, AZ, USA, 241--242, ISBN: 1-58113-832-6. </p><p>Page 34 Digital Curation Manual </p><p>Terminology </p><p>Precision is the fraction of retrieved documents that are relevant. That is, Precision = {number of relevant items}/{number of retrieved items} </p><p>Recall is the fraction of relevant documents that are retrieved. That is, Recall = {number of relevant items retrieved}/{number of relevant items in the entire collection. </p><p>Source: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-in-information- retrieval-1.html </p><p>Related Curation Manual Chapter or other DCC products • Metadata • Preservation metadata </p><p>Dobreva, Kim, Ross: Automated Metadata Generation Page 35 </p><p>An annotated list of key external resources </p><p>1. The Automatic Metadata Generation: use case identification and tools/services prioritisation 25 project suggested a set of use cases for the automatic generation and use of metadata for the following types of metadata: subject metadata, name metadata, geospatial metadata, factual metadata, bibliographic metadata, usage metadata, file format metadata. It also addresses integrating automatic metadata services and automatic language translation of metadata. The project also provides information on tools for automated extraction of metadata. 2. The Automatic Metadata Generation for Resource Discovery project 26 conducted a survey of the automatic metadata generation tools available. 3. Automatic Metadata Generation Application (AMeGa) Project 27 aimed to identify and recommend functionalities for applications supporting automatic metadata generation in the library / bibliographic control community.. 4. Barton, Currier and Hey (2003) Building Quality Assurance into Metadata Creation: An Analysis based on learning object and e-Prints Communities of Practice 28 . </p><p>25 Automatic Metadata Generation: use case identification and tools/services prioritisation http://www.jisc.ac.uk/whatwedo/programmes/inf11/resdis/automaticmetadata.aspx 26 Automatic Metadata Generation for Resource Discovery project: http://www.jisc.ac.uk/whatwedo/programmes/resourcediscovery/autometgen.aspx 27 Automatic Metadata Generation Application (AMeGa) Project: http://ils.unc.edu/mrc/amega/ 28 Barton, J., Currier, S. and Hey, J.M.N. (2003) Building Quality Assurance into Metadata Creation: An Analysis based on learning object and e-Prints Communities of Practice. Presented at DC2003 Conference: http://www.siderean.com/dc2003/201_paper60.pdf </p> </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'b0fabb3f13c2d7ae6505ef498df66885'; var endPage = 1; var totalPage = 35; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/b0fabb3f13c2d7ae6505ef498df66885-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html><script data-cfasync="false" src="/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js"></script>