Enabling the Discovery of Digital Cultural Heritage Objects Through Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
Enabling the Discovery of Digital Cultural Heritage Objects through Wikipedia Mark M Hall Oier Lopez de Lacalle1;2 Aitor Soroa Paul D Clough 1IKERBASQUE Eneko Agirre Information School Basque Foundation for Science IXA NLP Group Sheffield University Bilbao, Spain University of the Basque Country Sheffield, UK 2School of Informatics Donostia, Spain [email protected] University of Edinburgh [email protected] [email protected] Edinburgh, UK [email protected] [email protected] Abstract to use, and have vague search goals, this method of access is unsatisfactory as this quote from Over the past years large digital cultural (Borgman, 2009) exemplifies: heritage collections have become increas- ingly available. While these provide ad- “So what use are the digital libraries, if equate search functionality for the expert all they do is put digitally unusable in- user, this may not offer the best support for formation on the web?” non-expert or novice users. In this paper we propose a novel mechanism for intro- Alternative item discovery methodolo- ducing new users to the items in a collection gies are required to introduce new users to by allowing them to browse Wikipedia arti- digital CH collections (Geser, 2004; Steem- cles, which are augmented with items from the cultural heritage collection. Using Eu- son, 2004). Exploratory search models ropeana as a case-study we demonstrate the (Marchionini, 2006; Pirolli, 2009) that en- effectiveness of our approach for encourag- able switching between collection overviews ing users to spend longer exploring items (Hornb[Pleaseinsertintopreamble]k and Hertzum, in Europeana compared with the existing 2011) and detailed exploration within the search provision. collection are frequently suggested as more appropriate. 1 Introduction We propose a novel mechanism that enables users to discover an unknown, aggregated collec- Large amounts of digital cultural heritage (CH) tion by browsing a second, known collection. Our information have become available over the past method lets the user browse through Wikipedia years, especially with the rise of large-scale ag- and automatically augments the page(s) the user 1 gregators such as Europeana , the European ag- is viewing with items drawn from the CH collec- gregator for museums, archives, libraries, and gal- tion, in our case Europeana. The items are chosen leries. These large collections present two chal- to match the page’s content and enable the user to lenges to the new user. The first is discovering acquire an overview of what information is avail- the collection in the first place. The second is able for a given topic. The goal is to introduce then discovering what items are present in the new users to the digital collection, so that they can collection. In current systems support for item then successfully use the existing search systems. discovery is mainly through the standard search paradigm (Sutcliffe and Ennis, 1998), which is 2 Background well suited for CH professionals who are highly Controlled vocabularies are often seen as a familiar with the collections, subject areas, and promising discovery methodology (Baca, 2003). have specific search goals. However, for new However, in the case of aggregated collections users who do not have a good understanding of such as Europeana, items from different providers what is in the collections, what search keywords are frequently aligned to different vocabularies, 1http://www.europeana.eu requiring an integration of the two vocabularies in order to present a unified structure. (Isaac et al., 2007) describe the use of automated methods for aligning vocabularies, however this is not always successfully possible. A proposed alternative is to synthesise a new vocabulary to cover all aggre- gated data, however (Chaudhry and Jiun, 2005) highlight the complexities involved in then link- ing the individual items to the new vocabulary. To overcome this automatic clustering and vi- sualisations based directly on the meta-data have been proposed, such as 2d semantic maps (An- Figure 1: Architectural structure of the Wikiana sys- drews et al., 2001), automatically generated tree tem structures (Chen et al., 2002), multi-dimensional scaling (Fortuna et al., 2005; Newton et al., 2009), the dynamic nature of discovery prohibits. self-organising maps (Lin, 1992), and dynamic Wikipedia was chosen as the discovery inter- taxonomies (Papadakos et al., 2009). However face as it is known to have good content cover- none of these have achieved sufficient success to age and frequently appears at the top of search find widespread use as exploration interfaces. results (Schweitzer, 2008) for many topics, its Faceted search systems (van Ossenbruggen et use has been studied (Lim, 2009; Lucassen and al., 2007; Schmitz and Black, 2008) have arisen Schraagen, 2010), and it is frequently used as as a flexible alternative for surfacing what meta- an information source for knowledge modelling data is available in a collection. Unlike the meth- (Suchanek et al., 2008; Milne and Witten, 2008), ods listed above, faceted search does not require information extraction (Weld et al., 2009; Ni et complex pre-processing and the values to display al., 2009), and similarity calculation (Gabrilovich for a facet can be calculated on the fly. However, and Markovitch, 2007). aggregated collections frequently have large num- bers of potential facets and values for these facets, 3 Discovering Europeana through making it hard to surface a sufficiently large frac- Wikipedia tion to support resource discovery. Time-lines such as those proposed by (Luo et As stated above our method lets users browse al., 2012) do not suffer from these issues, but are Wikipedia and at the same time exposes them to only of limited value if the user’s interest cannot items taken from Europeana, enabling them to be focused through time. A user interested in ex- discover items that exist in Europeana. amples of pottery across the ages or restricted to The Wikipedia article is augmented with Euro- a certain geographic area is not supported by a peana items at two levels. The article as a whole time-line-based interface. is augmented with up to 20 items that in a pre- The alternative we propose is to use a second processing step have been linked to the article and collection that the user is familiar with and that at the same time each paragraph in the article is acts as a proxy to the unfamiliar collection. (Villa augmented with one item relating to that para- et al., 2010) describe a similar approach where graph. Flickr is used as the proxy collection, enabling Our system (Wikiana, figure 1) sits between users to search an image collection that has no the user and the data-providers (Wikipedia, Eu- textual meta-data. ropeana, and the pre-computed article augmenta- In our proposed approach items from the unfa- tion links). When the user requests an article from miliar collection are surfaced via their thumbnail Wikiana, the system fetches the matching article images and similar approaches for automatically from Wikipedia and in a first step strips every- retrieving images for text have been tried by (Zhu thing except the article’s main content. It then et al., 2007; Borman et al., 2005). (Zhu et al., queries the augmentation database for Europeana 2007) report success rates that approach the qual- items that have been linked to the article and se- ity of manually selected images, however their lects the top 20 items from the results, as detailed approach requires complex pre-processing, which below. It then processes each paragraph and uses <record> <dc:identifier>http://www.kirkleesimage...</dc:identifier> <dc:title>Roman Coins found in 1820..., Lindley</dc:title> <dc:source>Kirklees Image Archive OAI Feed</dc:source> <dc:language>EN-GB</dc:language> <dc:subject>Kirklees</dc:subject> <dc:type>Image</dc:type> </record> Figure 3: Example of an ESE record, some fields have been omitted for clarity. peana that was processed followed the Europeana Semantic Elements (ESE) specifications4. Figure 3 shows an example of an ESE record describ- Figure 2: Screenshot of the augmented article ing a photograph of a Roman coin belonging to “Mediterranean Sea” with the pre-processed article- the Kirklees Image Archive. We scan each ESE level augmentation at the top and the first two para- record and try to match the “dc:title” field with graphs augmented with items as returned by the Euro- the dictionary entries. In the example in figure peana API. 3, the item will be mapped to the Wikipedia ar- ticle Roman currency because the string “ro- keywords drawn from the paragraphs (details be- man coins” appears in the title. low) to query Europeana’s OpenSearch API for As a result, we create a many-to-many mapping items. A random item is selected from the result- between Wikipedia articles and Europeana items. set and a link to its thumbnail image inserted into The Wikiana application displays at most 20 im- the paragraph. The augmented article is then sent ages per article, thus the Europeana items need to to the user’s browser, which in turn requests the be ranked. The goal is to rank interesting items thumbnail images from Europeana’s servers (fig. higher, with “interestingness” defined as how un- 2). usual the items are in the collection. This metric The system makes heavy use of caching to is an adaption of the standard inverse-document- speed up the process and also to reduce the frequency formula used widely in Information amount of load on the backend systems. Retrieval and is adapted to identify items that have meta-data field-values that are infrequent in the 3.1 Article augmentation collection. As in original IDF we diminish the weight of values that occur very frequently in To create the article-level augmentations we first the collection, the non-interesting items, and in- create a Wikipedia “dictionary”, which maps creases the weight of values that occur rarely, the strings to Wikipedia articles.