Automatic Subject Indexing of Library Records with Wikipedia Concepts
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE Accepted for Publication provided by University of Limerick Institutional Repository By the Journal of Information Science: http://jis.sagepub.co.uk Article Journal of Information Science 1–11 Towards Linking Libraries and © The Author(s) 2013 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav Wikipedia: Automatic Subject DOI: 10.1177/0165551510000000 Indexing of Library Records with jis.sagepub.com Wikipedia Concepts Journal of Information Science 1–11 © The Author(s) 2013 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/1550059413486272 Arash Joorabchi jis.sagepub.com Department of Electronic and Computer Engineering, University of Limerick, Ireland Abdulhussain E. Mahdi Department of Electronic and Computer Engineering, University of Limerick, Ireland Abstract In this article, we first argue the importance and timely need of linking libraries and Wikipedia for improving the quality of their services to information consumers, as such linkage will enrich the quality of Wikipedia articles and at the same time increase the visibility of library resources which are currently overlooked to a large degree. We then describe the development of an automatic system for subject indexing of library metadata records with Wikipedia concepts as an important step towards Library-Wikipedia integration. The proposed system is based on first identifying all Wikipedia concepts occurring in the metadata elements of library records. This is then followed by training and deploying generic machine learning algorithms to automatically select those concepts which most accurately reflect the core subjects of the library materials whose records are being indexed. We have assessed the performance of the developed system using standard information retrieval measures of precision, recall, and F-score on a dataset consisting of 100 library metadata records manually indexed with a total of 469 Wikipedia concepts. The evaluation results show that the developed system is capable of achieving an averaged F-score as high as 0.92. Keywords Text mining; metadata generation; subject metadata; library metadata; bibliographic records; Wikipedia 1. Introduction Over the last decade, libraries have experienced a steady decline in the number of users of their online catalogues and websites, which in turn translates into a substantial decrease in the number of information consumers such as students and scholars who use library resources for their information needs. Reportedly, very few information searches (~1%) start from library websites, while the great majority of the rest of information seeking activities (84%) start from search engines such as Google (62%) [1]. On the other hand, Google-Wikipedia is becoming a prevalent information-seeking paradigm in which the information seeker submits an informational query, i.e., query on a particular topic, subject, or concept, to Google and then follows one of the search results redirecting to a relevant article on Wikipedia. Reportedly, Wikipedia appears on page one of the Google search results for 60% of informational queries and in 66% of such cases it appears in top-visibility positions (1-3) of the results page, where majority of clicks occur [2]. Wikipedia is the world’s largest web-based free-content encyclopedia project. The English Wikipedia alone currently contains more than four million articles [3]. Wikipedia articles are written, edited, and kept up-to-date and accurate (to a large degree) by a vast community of volunteer contributors, editors, and administrators who are collectively called Wikipedians. An investigation conducted by Nature in 2005 [4] suggested that Wikipedia comes close to Encyclopaedia Britannica in terms of the accuracy of its science entries, which was later disputed by the Britannica [5]. However, regardless of occasional controversies around the accuracy of its articles, Wikipedia is serving a significant role in Corresponding author: Abdulhussain E. Mahdi, Department of Electronic and Computer Engineering, University of Limerick, Limerick, Republic of Ireland. Email: [email protected] Accepted for Publication By the Journal of Information Science: http://jis.sagepub.co.uk Joorabchi et al 2 fulfilling public information needs. For example, results of a nationwide survey conducted in the U.S. in 2007 showed that Wikipedia attracted six times more traffic than the next closest website in the ‘educational and reference’ category and preceded websites such as Google Scholar and Google Books with a large margin [6]. In the context described above, linking Wikipedia articles to the records of relevant library materials will give information seekers the option to readily acquire lists of library resources which provide them with more in-depth knowledge on their subject of interest. In this paradigm each Wikipedia article will be linked to the records of relevant materials in a global union catalogue of libraries around the world, WorldCat.org, which in turn provides bibliographic information on the materials of interest and directs information seekers to their local libraries, where they can access those materials. Introduction of this new Wikipedia-Library information seeking paradigm will consequently improve the visibility of library resources which are currently overlooked to a large extend by those information consumers with lower information literacy skills. Looking at it from another perspective, in this paradigm Wikipedia plays the role of a new controlled vocabulary for subject indexing of library materials as an alternative (and/or compliment) to the traditional controlled vocabularies currently used in libraries, such as the Library of Congress Subject Headings (LCSH). As a controlled vocabulary, Wikipedia offers a number of unique advantages over traditional controlled vocabularies: (1) Extensive coverage and comprehensiveness: the English Wikipedia currently contains over 4 million articles covering subjects in all aspects of human knowledge and growing. Whereas, the latest version of LCSH (33rd edition), which is the de facto standard controlled vocabulary used in libraries around the world, contains approximately a total of 337,000 subject headings and references [7]. (2) Up-to-dateness: due to the crowd-sourced nature of Wikipedia and its large pool of editors, Wikipedia articles are generally well-maintained and kept quite up-to-date. For example, a recent study examining the potential of combining Twitter and Wikipedia data for event detection shows that in case of major events Wikipedia lags Twitter only by about three hours [8]. (3) Rich description: Wikipedia articles provide rich descriptive content for the represented concepts. Whereas, traditional library controlled vocabularies offer very little or no description for their subject headings. For example, the LCSH authority record for the subject heading ‘Metadata’1 offers only two pieces of information about this subject: (a) the subject may also be referred to by ‘data about data’ and ‘meta-data’; (b) the subject is related to two more specific subjects ‘Dublin Core’ and ‘Preservation metadata’. In contrast, the Wikipedia article for the concept of ‘metadata’2 provides a rich description of the concept including definition, variations, and applications complemented with links and references to relevant materials and related concepts. Therefore, extending the metadata records of library materials with Wikipedia concepts gives the users of online library catalogues the option to find descriptive information about unfamiliar indexing subjects that they may encounter as they browse and explore the collection. (4) Semantic richness: like the LC subject headings, each Wikipedia article has a descriptor which is the preferred and most commonly used term for the represented concept and it is also assigned a set of non-descriptors which are the less common synonyms and alternative lexical forms for the concept. Also, similar to the concept of ‘Related Terms’ in library controlled vocabularies, in Wikipedia, related articles are connected via hyperlinks. Furthermore, each Wikipedia article is classified according to the Wikipedia’s own community-built classification scheme into one or more broader categories which resembles the concept of ‘Broader Terms/Narrower Terms’ used in the library controlled vocabularies. Finally, similar to the library controlled vocabularies, Wikipedia addresses the problem of word-sense ambiguity by allowing an ambiguous term to correspond to multiple concepts each representing a different sense of the term, e.g., Java (programming language), Java (town), Java (band), etc. (5) Multilingual: as of September 2013 Wikipedia exists in more than 285 languages. Wikipedia has more than one million articles in each of the 8 most populated languages and more than one hundred thousand articles in each of the 38 less populated languages [9]. This high level of multilingualism in Wikipedia would allow the design and development of effective multilingual information retrieval systems for libraries once their metadata records are indexed with Wikipedia concepts. According to a report published by OCLC (Online Computer Library Centre) in 2009 [10], the majority of end users of online library catalogues surveyed considered adding more subject information to the metadata records of library materials to be the most helpful enhancement in relation to improving the discovery-related data quality