Report Reference

Total Page:16

File Type:pdf, Size:1020Kb

Report Reference Report Development of a flexible tool for the automatic comparison of bibliographic records. Application to sample collections BOREL, Alain, KRAUSE, Jan Brice Abstract Due to the multiplication of digital bibliographic catalogues (open repositories, library and bookseller catalogues), information specialists are facing the challenge of mass-processing huge amounts of metadata for various purposes. Among the many possible applications, determining the similarity between records is an important issue. Such a similarity can be interesting from a bibliographic point of view (i.e., do the records describe the same document, the answer to which can be useful for deduplication or for collection overlap studies) as well as from a thematic point of view (suggestion of documents to the user, as well as content management within the framework of a library policy, automatic classification of documents, and so on). In order to fulfil such various needs, we propose a flexible, open-source, multiplatform software tool supporting the implementation of multiple strategies for record comparisons. In a second step, we study the relevance and performance of several algorithms applied to a selection of collections (size, origin, document types...). Reference BOREL, Alain, KRAUSE, Jan Brice. Development of a flexible tool for the automatic comparison of bibliographic records. Application to sample collections. 2009, 107p. Available at: http://archive-ouverte.unige.ch/unige:23174 Disclaimer: layout of this document may differ from the published version. 1 / 1 Université de Genève HEG – Département Information Documentaire CESID - Diplôme universitaire de formation continue en information documentaire Travail de recherche Development of a flexible tool for the automatic comparison of bibliographic records. Application to sample collections Développement d'un logiciel flexible pour la comparaison de notices bibliographiques et application à différentes collections Alain BOREL Jan KRAUSE Supervision: Prof. Jaques SAVOY 2009 2 Table of Contents I Résumé...............................................................................................................................................7 II Abstract.............................................................................................................................................9 III Introduction...................................................................................................................................11 IV Technical section...........................................................................................................................14 IV.1 MARCXML...........................................................................................................................14 IV.2 Text similarity and information retrieval models...................................................................14 IV.2.1 The Boolean model.........................................................................................................15 Wildcards and fuzzy searches..............................................................................................16 IV.2.2 The vector model............................................................................................................17 Introducing the vector model...............................................................................................17 Weighting.............................................................................................................................18 Vector space scoring and similarity......................................................................................18 IV.2.3 The probabilistic models................................................................................................20 IV.2.4 Global similarity computation........................................................................................22 IV.3 Analysis of similarity output..................................................................................................22 IV.3.1 Sorting............................................................................................................................22 IV.3.2 Clustering.......................................................................................................................23 IV.4 The Python programming language.......................................................................................23 V The MarcXimiL similarity framework...........................................................................................27 V.1 Framework licencing and availability.....................................................................................27 V.2 Framework portability.............................................................................................................27 V.3 Framework installation............................................................................................................27 V.3.1 Basic installation..............................................................................................................27 V.3.2 Additional tools................................................................................................................27 V.4 Framework structure................................................................................................................28 V.4.1 An introduction to this application's core structure.........................................................28 V.4.2 Whithin the file system....................................................................................................29 V.5 Framework execution phases..................................................................................................30 V.5.1 Loading records...............................................................................................................30 V.5.2 Parsing fields...................................................................................................................30 V.5.3 Pre-processing of fields...................................................................................................31 V.5.4 The caching subsystem....................................................................................................31 V.5.5 Managing comparisons....................................................................................................32 V.5.6 Computing the global similarity......................................................................................33 Maxsim.................................................................................................................................34 Means ..................................................................................................................................35 Breakout means ...................................................................................................................35 Boundaries............................................................................................................................35 Boundaries_max...................................................................................................................36 Ubiquist................................................................................................................................36 Abstract_fallback..................................................................................................................37 V.5.7 Comparison functions......................................................................................................37 RAW family..........................................................................................................................38 WC family............................................................................................................................39 INITIALS family..................................................................................................................40 SHINGLES family...............................................................................................................40 V.5.8 Writing output..................................................................................................................40 V.6 Program setup..........................................................................................................................41 3 V.6.1 Selection of collections....................................................................................................41 V.6.2 Structure of records & similarity.....................................................................................42 V.6.3 Caching options...............................................................................................................44 V.6.4 Other options...................................................................................................................44 V.7 Additionnal tools.....................................................................................................................44 V.7.1 oai.py...............................................................................................................................45 V.7.2 batch.py............................................................................................................................45 V.7.3 sort.py..............................................................................................................................46
Recommended publications
  • Research Tools F = Free; $ = for Purchase
    Chris Church www.christophermchurch.com Research Tools F = free; $ = for purchase Biblioscape - $ http://www.biblioscape.com/ Pay software to help researchers collect and manage bibliographic data, take notes while doing research, and generate citations and bibliographies for publication. Bibus - F http://sourceforge.net/projects/bibus-biblio/ Bibus is an open-source bibliographic database. It uses a MySQL or SQLite database to store references. It can directly insert references in OpenOffice.org and MS Word and generate the bibliographic index. Dropbox – F, $ www.dropbox.com A free service that lets you sync and backup your data. Great for working across multiple computers and for making sure your materials are backed up. 2gb free; more space requires monthly subscription. Endnote - F www.endnote.com Software tool for publishing and managing bibliographies. Evernote – F, $ www.evernote.com Online organizer that allows you to take notes and capture webpages, images, articles, etc. It syncs to an online account is available across numerous computers and smartphones. Similar to an online, free version of EndNote. Price depends on space needed, and pricing goes from free on up. Foxit PDF Reader– F, $ www.foxit.com A powerful alternative to Adobe Acrobat Reader. It has the ability to markup text, make annotations, highlight, etc. Also uses far less computer memory to run. Google Picasa picasa.google.com Great image viewer for pictures taken in the archives. Growly Notes - F http://growlybird.com/GrowlyBird/Notes.html Essentially a free, Macintosh version of Microsoft OneNote. Mendeley - F http://www.mendeley.com/ A free reference manager and academic social network that can help you organize your research, collaborate with others online, and discover the latest research.
    [Show full text]
  • Tools for Creating Bibliographic Databases for Use with Bibtex D.V.L.K.D.P
    The PracTEX Journal, 2007, No. 3 Article revision 2007/08/14 Tools for creating bibliographic databases for use with BibTEX D.V.L.K.D.P. Venugopal Email [email protected] Address Senior Personal Assistant Vice-Chancellor’s Office Banaras Hindu University Varanasi 221 005, India Abstract By using BibTEX we can easily change the style of Bibliography/References according to the style of the journal. But creating bibliographic databases for use with BibTEX is very cumbersome. This article describes the various software tools available for creating bibliographic databases easily, particu- larly for the Windows platform. 1 Introduction One of the major advantages of using LATEX is ease of inclusion of bibliographies in conjunction with BibTEX. Various bibliographic style files (bst) are available for different journals and publishing houses. Using these bst files the format and citation style can be changed easily[1]. This facility is not available in commercial wordprocessors or desktop publishing systems. Creating BibTEX databases is also discussed in LATEX Tutorials[2] and by Parthasarathy[3]. Bibliographic databases for BibTEX can be created using ordinary editors but it is tedious work involving a lot of typing. But once created, the same data can be used many times and by many users. A search on the Internet for BibTEX database managers gives a long list of free and commercial software1. To name a few: bibtex mode and Ebib (for Emacs), Pybliographer (using Python for Linux), gBib (for GNOME, Linux), Barracuda (for Linux), KBibTeX (for KDE, Linux), Sixpack (multi platform), JBibtexManager, Javabib and JabRef (multi platform using Java), BibDB (for DOS and Windows), 1.
    [Show full text]
  • Manage Your Information
    9/24/2008 Useful Tools for Study & Research Yiping Lin Department of Physics National Tsing Hua University Y. Lin Where to Look for Information Are these sources reliable? Y.Lin 1 9/24/2008 Google in Books & Papers Search the full text of books Search scholarly papers Y.Lin Databases of Research Papers Published or Preprint Published Cited References Y.Lin 2 9/24/2008 How Do You Organize These Papers? Y.Lin In File Folders? In A File Cabinet? Feel Like Sinking in Sea of Papers? Help! Y.Lin 3 9/24/2008 When Start to Writing Your Report Or Prepare Your Presentation Or Write Your Thesis ... … … … Y.Lin A Convenient Tool - Zotero A free, easy-to-use Firefox extension to help you collect, manage, and cite your research sources. It lives right where you do your work — in the web browser itself. Y.Lin 4 9/24/2008 Zotero’s Features Automatic capture of citation information from web pages Storage of PDFs, files, images, links, and whole web pages Flexible notetaking with autosave Fast, as-you-type search through your materials Playlist-like library organization, including saved searches and tags Runs right in your web browser Formatted citation export (style list to grow rapidly) Free and open source Integration with Microsoft Word and OpenOffice Y.Lin Annotation of Web Pages Y.Lin 5 9/24/2008 Collection History – Timeline Y.Lin Can generate the list of Export for Report bibliographies and switch the style In Word, OpenOffice Write In Google Docs Y.Lin 6 9/24/2008 A Helper to Manage References Software Handling the Bibliography Entries
    [Show full text]
  • References Bibliography In
    LYX: For the Impatient - 2 Adel M. Abdel-Azim, MDSC, PhD Professor and Chairman of Oral Pathology Department, Faculty of Dentistry, Ain Shams University, Egypt 6/5/2013 Preface Hopefully you’ll find something useful in this little book – Adel This document was created by LYX and is fully hyperlinkable that is to say you can click on the any item on the table of contents or on the list of figures to reach immediately to your target. Also, you can click on the figure number seen within the text to reach to the figure. Thanks are due to LYX for its ability to make documents fully hyperlinkable and hypreclickable. Note: This is a part of my book “LYX For the Impatient” which is still under preparation. Hope that I can finish it as soon as possible. 4 Contents 1 References (Bibliography) in LYX 7 1.1 Citation Versus Reference List ................. 7 1.2 Citation Styles ......................... 7 1.3 BibTEX ............................. 8 1.4 Inserting References in LYX . 12 1.5 Modifying References in LYX . 16 1.6 Citation Positioning ...................... 16 1.7 Style of the numbers in the bibliography . 17 1.8 Journal Styles .......................... 17 List of Figures 1.1 Citation and reference list .................... 8 1.2 JabRef Web Search Pane ..................... 12 1.3 Bibtex Key Generation ...................... 13 6 List of Figures 1 References (Bibliography) in LYX 1.1 Citation Versus Reference List F Citation or “reference in text” is the expression written in the text de- noting or referring the author, authors, published or unpublished source. F Reference list also known as bibliographical list is the list found usually at the end of your document in which you list all the authors or sources mentioned in your document.
    [Show full text]
  • An Insider's Insight Into Literature Searches
    An Insider’s Insight into Literature Searches Searching the literature can take various forms, ranging from a quick scan of recent publications to a formal, systematic interrogation of all available data sources to establish the scientific consensus on a specific topic. In these days of online journal databases, the relative ease of conducting a search means that they often start informally with no thought-out search strategy or defined goal. A long list of articles can be generated almost instantaneously, but what did you miss and how long will it take to review the data? How easily can the search strategy be repeated and adapted to obtain a more complete and refined set of references? We offer some insights from the Niche medical writing team who have been conducting literature searches for their clients since 1998. Copyright © 2016 Niche Science & Technology Ltd, UK 1 Before you start Prepare to succeed • Establish a formal plan for your literature • Your searches will create outputs in the form of search if you propose to do anything more than lists of publications. Decide what information conduct a cursory review of the literature you need to record about each reference in order to help determine its relevance, how you • Know what you want to achieve so you avoid will store the information and how you will endless futile or repetitive searching. Set ‘score’ the overall efficacy of a search strategy yourself an objective and identify an endpoint that qualifies whether or not you have achieved • Know something about your subject area your goal before you decide on the operational parameters of your search; consider the • Ensure you have access to an appropriate coverage history in the literature, controversies, search engine as different databases will specialist journals, sub categories, etc.
    [Show full text]
  • Análisis Comparativo De Los Gestores Bibliográficos Sociales Zotero, Docear Y Mendeley: Características Y Prestaciones
    Análisis comparativo de los gestores bibliográficos sociales Zotero, Docear y Mendeley: características y prestaciones. Comparative analysis of social bibliographic management software Zotero , Docear and Mendeley: fea- tures and benefits. Montserrat López Carreño Universidad de [email protected] Resumen Abstract Se realiza una aproximación al origen y evolu- A comparative analysis between the free biblio- ción de la gestión bibliográfica personal, esta- graphic reference management software. These bleciendo una cronología de la aparición de los bibliographic reference managers offer to the gestores bibliográficos más populares, resaltan- users the ability to retrieve, store, edit and dis- do sus características y evidenciando su utilidad seminate bibliographic information. An approach en el ámbito académico-científico. Para ello, hay is made to the origin and evolution of these sys- que tratar de analizar conceptos y procesos di- tems, establishing a chronology since the ap- rectamente relacionados con los gestores bi- pearance of the most popular bibliographic bliográficos personales, productos objeto de management software, highlighting their fea- estudio, tales como fuentes de información bi- tures and usefulness in academic research. We bliográfica y la normalización bibliográfica y, have analyzed concepts and processes directly fundamentalmente del ámbito científico, donde related to personal bibliographic management son productos esenciales en la formalización de applications under study, such as the sources of la su producción para su posterior difusión. Para bibliographic information and the set of biblio- ello se realiza un análisis comparativo entre los graphic standards employed by them, essential gestores de referencia bibliográficos gratuitos aspects of the scientific field, where these prod- Zotero, Docear y Mendeley, habiendo consegui- ucts have a great importance in the formalization do identificar algunas diferencias significativas of the scientific production for later broadcast .
    [Show full text]
  • Tcpware for Openvms User's Guide
    TCPware® for OpenVMS User's Guide Part Number: N-6004-60-NN-A January 2014 This manual describes how to use the network services provided by the TCPware for OpenVMS product. Revision/Update: This is a revised manual. Operating System/Version: VAX/VMS V5.5-2 or later, OpenVMS VAX V6.0 or later, OpenVMS Alpha V6.1 or later, or OpenVMS I64 V8.2 or later Software Version: 6.0 Process Software Framingham, Massachusetts USA i The material in this document is for informational purposes only and is subject to change without notice. It should not be construed as a commitment by Process Software. Process Software assumes no responsibility for any errors that may appear in this document. Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. The following third-party software may be included with your product and will be subject to the software license agreement. Network Time Protocol (NTP). Copyright © 1992 by David L. Mills. The University of Delaware makes no representations about the suitability of this software for any purpose. Point-to-Point Protocol. Copyright © 1989 by Carnegie-Mellon University. All rights reserved. The name of the University may not be used to endorse or promote products derived from this software without specific prior written permission. Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed by Carnegie Mellon University.
    [Show full text]
  • Managing Bibliographies with LATEX Lapo F
    36 TUGboat, Volume 30 (2009), No. 1 Multiple citations can be added by separating Bibliographies with a comma the bibliographic keys inside the same \cite command; for example \cite{Goossens1995,Kopka1995} Managing bibliographies with LATEX Lapo F. Mori gives Abstract (Goossens et al., 1995; Kopka and Daly, 1995) The bibliography is a fundamental part of most scien- Bibliographic entries that are not cited in the tific publications. This article presents and analyzes text can be added to the bibliography with the the main tools that LATEX offers to create, manage, \nocite{key} command. The \nocite{*} com- and customize both the references in the text and mand adds all entries to the bibliography. the list of references at the end of the document. 2.2 Automatic creation with BibTEX A 1 Introduction BibTEX is a separate program from LTEX that allows creating a bibliography from an external database Bibliographic references are an important, sometimes (.bib file). These databases can be conveniently fundamental, part of academic documents. In the shared by different LATEX documents. BibTEX, which past, preparation of a bibliography was difficult and will be described in the following paragraphs, has tedious mainly because the entries were numbered many advantages over the thebibliography envi- and ordered by hand. LATEX, which was developed ronment; in particular, automatic formatting and with this kind of document in mind, provides many ordering of the bibliographic entries. tools to automatically manage the bibliography and make the authors’ work easier. How to create a 2.2.1 How BibTEX works A bibliography with LTEX is described in section2, BibTEX requires: starting from the basics and arriving at advanced 1.
    [Show full text]
  • Les LGRB : Quels Formats ?
    Les LGRB : quels formats ? Urfist Lyon - 17 novembre 2014 - Thomas Colombéra 1 • Définitions • Rapport de tendances • L’existence de standards • Le contexte intégratif et collaboratif 2 un format, des formats, … de quoi parle-t-on ? “un agencement structuré des données numériques sur un support lors de leur production, leur affichage, leur stockage sur ce support, leur compression, impression ou diffusion.” (Arlette Boulogne, 2004) 3 Formats : de stockage, d’export / d’import de liste Utilisateur Logiciel Logiciel Base de données 4 Formats les plus utilisés, les plus disponibles… rapport de tendances source de la comparaison : http://en.wikipedia.org/wiki/Comparison_of_reference_management_software 5 Les formats d’export 6 BiBTeX RIS Endnote/Refer/BibIX Medline MODS XML EndNoteXML COinS DocBook OpenDocument unAPI SQL database PDF RTF CSV SRW XML via SRU TEI SQLite Delicious CSL formatted HTML BiBTeXML PostScript HTML XML Bookends Ovid TSV Reference Manager XML Wikipedia Citation Templates 7 0 7,5 15 22,5 30 Les formats d’import 8 BiBTeX RIS Endnote/Refer/BibIX PubMed ISI Medline SciFinder Ovid CSA Copac MODS XML MARC Jstor Signets de navigateur SilverPlatter PDF BiBTeXML Biblioscape Biomail Inspec MSBib PDF with XMP annot. REPEC (NEP) Sixpack Endnote XML Reference Miner Endnote ENW CHM eBook RefWorks RISX DublinCore SUTRS COinS RDF unAPI 0 7,5 15 22,5 30 Les formats de fichier de liste 10 HTML RTF Texte brut PDF RSS DOC LaTeX unAPI XML ODT Atom Oo-CSV PostScript Markdown TEI Presse-Papier 0 7,5 15 22,5 30 Les LGRB, du plus au moins inclusif
    [Show full text]
  • Bibliographic Related Software and Standards Information
    Bibliographic Related Software and Standards Information Last Modified A printer friendly PDF version of this page is available 2006-January-9 biblio-sw.pdf (120 Kb) Table of Contents Open Standards Open Standards Information Citproc Bibliophile MODS Z39.50 and SRW Other Links to Bibliographic Software Information Overview of Personal Bibliographic Software Java B3 JabRef Perl, Python Pbib RISImport.py MS Windows Scholars Aid Nota Bene EndNote ProCite Reference Manager Biblioscape Citit! Bibliographix Windows, Linux, Other Bibutils Bibus LaTeX / BibTeX Linux Bookcase BibTeXML BibML BibX ISDN Search YAZ Zoom RefDB Sixpack bp 1 gBib Pybliographer Kaspaliste The Jurabib package refbase MAC OS X BibDesk Open Standards Information Check this web site on Open Standards and software for bibliographies and cataloging. This site provides a quick overview of the landscape of open-source bibliographic software; both where is has been, but more importantly, where it may yet go, and may be better than this page. http://wwwsearch.sourceforge.net/bib/openbib.html A good source on open standards in regards to XML, is the OASIS site http://xml.coverpages.org, and of course www.w3.org - home of the internet. CitProc The Openoffice Bibliographic project is proposing to use Bibliographic citation and table generation via XSLT style-sheets using a new process called CiteProc. CiteProc style-sheets provide, for the first time, the opportunity for the creation and distribution of opensource bibliographic style definitions that are not specific to a particular word-processor or bibliographic package. Also see BiblioX for technical discusion of this approach. We now have working examples. Bibliophile Bibliophile is an initiative to align the development of bibliographic databases for the web.
    [Show full text]
  • Toward a Resource-Based Model of Open Source Software Development Communities
    Florida State University Libraries Electronic Theses, Treatises and Dissertations The Graduate School 2007 Is Bigger Always Better?: Toward a Resource-Based Model of Open Source Software Development Communities. Glen Sagers Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] THE FLORIDA STATE UNIVERSITY COLLEGE OF BUSINESS IS BIGGER ALWAYS BETTER? TOWARD A RESOURCE-BASED MODEL OF OPEN SOURCE SOFTWARE DEVELOPMENT COMMUNITIES. By GLEN SAGERS A Dissertation submitted to the Department of Management Information Systems in partial fulfillment of the requirements for the degree of Doctor Of Philosophy Degree Awarded: Spring Semester, 2007 The members of the Committee approve the Dissertation of Glen Sagers defended on March 12th, 2007. David Paradice _____________________________ Professor Directing Dissertation G. Stacy Sirmans _____________________________ Outside Committee Member Molly Wasko _____________________________ Committee Member Katherine Chudoba _____________________________ Committee Member Approved: _______________________________________________________ Caryn Beck-Dudley, Dean, College of Business The Office of Graduate Studies has verified and approved the above named committee members. ii ACKNOWLEDGEMENTS I would like to thank my dissertation co-chairs, Drs. Molly Wasko and David Paradice for their patience, support and guidance in this process, and for their aid in clarifying my thinking on this study. I would like to thank my dissertation committee, Drs. G. Stacy Sirmans and Kathy Chudoba, for their advice for making this research more focused. I appreciate the advice and commiseration of my fellow doctoral students. I would like to thank Dr. Terry Dennis and the faculty at Illinois State University for believing in me during the final stages of this dissertation.
    [Show full text]
  • Panorama Comparatif De Logiciels De Gestion De Références
    PRÉSENTATION DE LOGICIELS DE GESTION DE RÉFÉRENCES BIBLIOGRAPHIQUES I/ A quoi servent les logiciels de gestion de références bibliographiques ? CONTEXTE D’UTILISATION ET DEFINITION A QUOI SERVENT LES LOGICIELS DE GESTION DE RÉFÉRENCES BIBLIOGRAPHIQUES ? Actuellement avec l’avènement d’Internet, on assiste à une multiplication de bases de données en ligne, de catalogues, de revues électroniques, de moteurs de recherche et de sites web. Avec l’essor de la production scientifique, on assiste à une véritable explosion de références bibliographiques disponibles sur différentes bases de données en ligne. Grâce aux moteurs de recherche comme Google Scholar, on peut trouver parfois des articles scientifiques en pdf disponibles sur internet. A QUOI SERVENT LES LOGICIELS DE GESTION DE RÉFÉRENCES BIBLIOGRAPHIQUES ? C’est dans ce contexte que les logiciels de gestion de références bibliographiques sont indispensables pour le chercheur et pour tout travail de recherche. A la base de toute recherche, il y a un travail de stockage et de gestion de références bibliographiques et un travail de citation de ces références dans la Bibliographie de la Thèse ou du Mémoire. A QUOI SERVENT LES LOGICIELS DE GESTION DE RÉFÉRENCES BIBLIOGRAPHIQUES ? Les logiciels de gestion de références bibliographiques éditent une bibliographie conforme aux normes de référencement bibliographique. Ces logiciels proposent également des styles conformes aux normes de présentation des revues scientifiques. Les normes : Norme ISO 690 (Norme AFNOR NF Z 44-005 de 1987 pour les publications imprimées, livres et publications en série, leurs parties composantes et les brevets.) Norme ISO 690-2 (Norme AFNOR Z 44-005-2 de 1997 pour les documents électroniques.) Les styles : Les logiciels de gestion de références bibliographiques proposent différents styles : Vancouver est un style qui est souvent utilisé dans les thèses en Médecine.
    [Show full text]