On Providing Semantic Alignment and Unified Access to Music Library
Total Page:16
File Type:pdf, Size:1020Kb
Int J Digit Libr DOI 10.1007/s00799-017-0223-9 On providing semantic alignment and unified access to music library metadata David M. Weigl1 · David Lewis2 · Tim Crawford2 · Ian Knopke3 · Kevin R. Page1 Received: 1 March 2016 / Revised: 27 June 2017 / Accepted: 30 June 2017 © The Author(s) 2017. This article is an open access publication Abstract A variety of digital data sources—including insti- providing aligned access to catalogue metadata and digitised tutional and formal digital libraries, crowd-sourced commu- score images from the British Library and other sources, and nity resources, and data feeds provided by media organisa- broadcast data from the BBC Radio 3 Early Music Show. tions such as the BBC—expose information of musicological interest, describing works, composers, performers, and wider Keywords Linked Data · Metadata · Semantic alignment · historical and cultural contexts. Aggregated access across Contextual matching · Musicology · Early music such datasets is desirable as these sources provide comple- mentary information on shared real-world entities. Where datasets do not share identifiers, an alignment process is 1 Introduction required, but this process is fraught with ambiguity and difficult to automate, whereas manual alignment may be The reconciliation of corpora providing access to historical time-consuming and error-prone. We address this problem catalogue data in a digital libraries context is made diffi- through the application of a Linked Data model and frame- cult by a range of challenges from ambiguities concerning work to assist domain experts in this process. Candidate the names of individuals to disputed or erroneous attribu- alignment suggestions are generated automatically based on tion (e.g. [1]). Relevant sources include digital libraries of textual and on contextual similarity. The latter is determined institutions such as the British Library, formal digital library according to user-configurable weighted graph traversals. resources provided by organisations such as the OCLC1 Match decisions confirming or disputing the candidate sug- (e.g. VIAF2), data feeds provided by commercial and media gestions are obtained in conjunction with user insight and industry institutions such as the BBC (the UK’s national expertise. These decisions are integrated into the knowledge public service broadcaster), and community resources such base, enabling further iterative alignment, and simplifying as MusicBrainz.3 Such datasets provide complementary the creation of unified viewing interfaces. Provenance of the information concerning the same historical entities (e.g. com- musicologist’s judgement is captured and published, support- posers, works), but the corresponding records may not share ing scholarly discourse and counter-proposals. We present identifiers across the datasets. This greatly complicates con- our implementation and evaluation of this framework, con- venient access to the entirety of the information available on ducting a user study with eight musicologists. We further a given entity. demonstrate the value of our approach through a case study Automated approaches employing heuristics to identify similarly named individuals or similarly titled works are insufficient to resolve ambiguous cases such as a name B David M. Weigl [email protected] being shared by multiple individuals or an individual being known by multiple names. Further complications arise from 1 Oxford e-Research Centre, University of Oxford, Oxford, UK 2 Department of Computing, Goldsmiths, University of 1 http://www.oclc.org. London, London, UK 2 http://viaf.org. 3 British Broadcasting Corporation, London, UK 3 http://musicbrainz.org. 123 D. M. Weigl et al. potential differences in language between datasets (e.g. via related work in the matching of disparate datasets. Section 4 anglicisation of Latin names), non-standardised spellings (a presents the design of the data model and framework under- prominent issue with historical catalogue data), and simple lying our approach towards addressing this problem. Section errors. The manual resolution of these concerns by domain 5 discusses the implementation of a Semantic Alignment and specialists is tedious and error-prone, as the number of poten- Linking Tool (SALT)5 that builds on this model. Section 6 tial alignment candidates grows exponentially with the size reports on a case study applying SALT to link academic and of the datasets in question. media industry datasets focusing on our early music scenario. In this paper, we address this issue by combining the Section 7 presents a user evaluation of the SALT data model expert guidance of the domain specialist with the efficiency and user interface, employing eight musicologist participants and tractability of an automated solution, combining sur- with expertise in early music. Section 8 presents the imple- face similarity and contextual semantics to generate match mentation of a unified viewing interface6 providing access to candidates for confirmation or disputation by the musicolo- datasets linked by a domain specialist using SALT. Finally, gist. By providing computational assistance, we ensure that Sect. 9 concludes this discussion and presents our plans for the user’s attention is required only for those aspects of the future work. alignment task providing or demanding insight. By track- ing provenance information regarding the user’s alignment 2 Musicological motivation and case study in early activities, we explicitly capture their judgement on contested music attributions and similar topics of scholarly dispute. We adopt a Linked Data approach [5] employing the Exploratory investigations have been conducted into the Resource Description Framework (RDF) [16], a standard information needs and information seeking behaviours of model for online data exchange. Linked Data extends the (potential) users of music information systems [24,42], and linking structure of the World Wide Web by employing URIs to a lesser degree, of musicologists [2,19,28]. However, to specify directed relationships between data instances. research in the field of Music Information Retrieval has pre- These data instances may themselves be encoded by URIs dominantly focused on system-centric concerns [42], and the or represented by literal values. A set of two such instances, needs and behaviours of musicologists in particular remain linked by such a relationship, is referred to as a triple.4 Col- relatively underexplored [19]. lections of triples may be stored in flat files on a server, In the course of our work, we have observed that musi- accessed via HTTP, or housed within specialised RDF cologists gather the information they require about music databases, known as triplestores. By employing this Linked and its contexts from a variety of sources. An academic may Data approach, the meanings of the relationships between the find literature through tools including Répertoire Interna- data are made explicit, allowing them to be understood by tional de Littérature Musicale (RILM)7 and JSTOR,8 through both humans and machines. Information represented in this general-purpose resources such as Wikipedia or Google way may be linked to external datasets and can in turn be Scholar, or using topic-specific encyclopaedias, for exam- linked to from external datasets, embedding the information ple The New Grove Dictionary of Music and Musicians within a wider web of knowledge and making it discoverable (New Grove) [36], and Musik in Geschichte und Gegenwart and reusable in other contexts [7]. [8], and through library catalogues. Resources about notated A key strength of the RDF model compared to the tradi- music are more diverse, including Répertoire International tional relational model is that it is robust to underspecification des Sources Musicales (RISM) [34] and more specialised cat- and mutability in the data schema, and thus facilitates the alogues such as Brown’s Instrumental Music Printed Before merging of disparate datasets that may each employ radically 1600: A Biography [10], and even more varied still are the different schemas or ontologies. Since it involves publishing sources of information on performances, recordings, and alignment outcomes as interlinked RDF triples, our solution broadcasts. facilitates the creation of combined views of the unified data Historically, the task of coordinated study of these areas within the now-reconciled corpora, while supporting reuse of of musical understanding has necessarily been a manual one. the musicologist’s decisions to drive further scholarly activ- Many of the resources are or have been published in book or ity. serial form, and the reader can carefully assemble a narra- The remainder of this article is organised as follows: Sect. tive from the separate parts. Online versions of these, along 2 describes the musicological motivations that have led to with the new digital-only materials, provide opportunities the development of the work presented here and introduces a specific motivating case in early music. Section 3 details 5 http://github.com/oerc-music/salt. 6 http://github.com/oerc-music/slobr-ui. 7 4 In the parlance of Linked Data, a subject is related by a predicate to http://rilm.org. an object. 8 http://jstor.org. 123 On providing semantic alignment and unified access to music library metadata for faster, more comprehensive study, along with larger-scale opments in musical performance and composition both in apprehension of a topic, but they also carry greater risks of Britain and abroad”. It plays almost