Report Reference
Total Page:16
File Type:pdf, Size:1020Kb
Report Development of a flexible tool for the automatic comparison of bibliographic records. Application to sample collections BOREL, Alain, KRAUSE, Jan Brice Abstract Due to the multiplication of digital bibliographic catalogues (open repositories, library and bookseller catalogues), information specialists are facing the challenge of mass-processing huge amounts of metadata for various purposes. Among the many possible applications, determining the similarity between records is an important issue. Such a similarity can be interesting from a bibliographic point of view (i.e., do the records describe the same document, the answer to which can be useful for deduplication or for collection overlap studies) as well as from a thematic point of view (suggestion of documents to the user, as well as content management within the framework of a library policy, automatic classification of documents, and so on). In order to fulfil such various needs, we propose a flexible, open-source, multiplatform software tool supporting the implementation of multiple strategies for record comparisons. In a second step, we study the relevance and performance of several algorithms applied to a selection of collections (size, origin, document types...). Reference BOREL, Alain, KRAUSE, Jan Brice. Development of a flexible tool for the automatic comparison of bibliographic records. Application to sample collections. 2009, 107p. Available at: http://archive-ouverte.unige.ch/unige:23174 Disclaimer: layout of this document may differ from the published version. 1 / 1 Université de Genève HEG – Département Information Documentaire CESID - Diplôme universitaire de formation continue en information documentaire Travail de recherche Development of a flexible tool for the automatic comparison of bibliographic records. Application to sample collections Développement d'un logiciel flexible pour la comparaison de notices bibliographiques et application à différentes collections Alain BOREL Jan KRAUSE Supervision: Prof. Jaques SAVOY 2009 2 Table of Contents I Résumé...............................................................................................................................................7 II Abstract.............................................................................................................................................9 III Introduction...................................................................................................................................11 IV Technical section...........................................................................................................................14 IV.1 MARCXML...........................................................................................................................14 IV.2 Text similarity and information retrieval models...................................................................14 IV.2.1 The Boolean model.........................................................................................................15 Wildcards and fuzzy searches..............................................................................................16 IV.2.2 The vector model............................................................................................................17 Introducing the vector model...............................................................................................17 Weighting.............................................................................................................................18 Vector space scoring and similarity......................................................................................18 IV.2.3 The probabilistic models................................................................................................20 IV.2.4 Global similarity computation........................................................................................22 IV.3 Analysis of similarity output..................................................................................................22 IV.3.1 Sorting............................................................................................................................22 IV.3.2 Clustering.......................................................................................................................23 IV.4 The Python programming language.......................................................................................23 V The MarcXimiL similarity framework...........................................................................................27 V.1 Framework licencing and availability.....................................................................................27 V.2 Framework portability.............................................................................................................27 V.3 Framework installation............................................................................................................27 V.3.1 Basic installation..............................................................................................................27 V.3.2 Additional tools................................................................................................................27 V.4 Framework structure................................................................................................................28 V.4.1 An introduction to this application's core structure.........................................................28 V.4.2 Whithin the file system....................................................................................................29 V.5 Framework execution phases..................................................................................................30 V.5.1 Loading records...............................................................................................................30 V.5.2 Parsing fields...................................................................................................................30 V.5.3 Pre-processing of fields...................................................................................................31 V.5.4 The caching subsystem....................................................................................................31 V.5.5 Managing comparisons....................................................................................................32 V.5.6 Computing the global similarity......................................................................................33 Maxsim.................................................................................................................................34 Means ..................................................................................................................................35 Breakout means ...................................................................................................................35 Boundaries............................................................................................................................35 Boundaries_max...................................................................................................................36 Ubiquist................................................................................................................................36 Abstract_fallback..................................................................................................................37 V.5.7 Comparison functions......................................................................................................37 RAW family..........................................................................................................................38 WC family............................................................................................................................39 INITIALS family..................................................................................................................40 SHINGLES family...............................................................................................................40 V.5.8 Writing output..................................................................................................................40 V.6 Program setup..........................................................................................................................41 3 V.6.1 Selection of collections....................................................................................................41 V.6.2 Structure of records & similarity.....................................................................................42 V.6.3 Caching options...............................................................................................................44 V.6.4 Other options...................................................................................................................44 V.7 Additionnal tools.....................................................................................................................44 V.7.1 oai.py...............................................................................................................................45 V.7.2 batch.py............................................................................................................................45 V.7.3 sort.py..............................................................................................................................46