Finding Scientific Articles in a Large Digital Archive: Biostor and the Biodiversity Heritage Library

Finding scientific articles in a large digital archive: BioStor and the Biodiversity Heritage Library Roderic D M Page∗1 1Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ, UK Email: Roderic D M Page∗- [email protected]; ∗Corresponding author Abstract Background: The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive. Results: A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article finding service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL. Conclusions: BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Her- itage Library. BioStor is available from http://biostor.org/. Nature Precedings : hdl:10101/npre.2010.4928.1 Posted 21 Sep 2010 Background synonym of Mammut Blummenbach) its existence meant the newly discovered whale had to be re- In July 2010 Lambert et al. [1] published a paper in named, which it duly was in a month after the orig- Nature that described an extinct sperm whale pos- inal publication [4]. sessing the biggest bite of any tetrapod known. They named this formidable predator Leviathan melvillei, The fate of Lambert et al.'s Leviathan illustrates the genus name Leviathan being derived from the a significant challenge facing researchers finding and Hebrew `Livyatan', the species name honouring Her- naming new species { the discoverability of exist- man Melville (author of Moby Dick [2]). As appro- ing names. In the absence of a global register of all priate as this name was, it quickly ran foul of the taxonomic names that have ever been published, a rules of zoological nomenclature because Leviathan researcher about to publish a new name may strug- had already been used 169 years ago for an ex- gle to establish that that it hasn't already been used. tinct species of mammoth [3]. Although the name Zoological nomenclature dates from 1758, botanical Leviathan Koch [3] had lapsed into obscurity (as a nomenclature from 1753, hence a comprehensive list 1 of taxonomic names must survey some 250 years of make use of this tool to map the the literature cited literature [5], much of which is obscure and may not in a manuscript to the corresponding DOI. In an exist in digital form. Digitising this legacy litera- ideal world the BHL model of (title, item, page) (Fig. ture is the goal of the Biodiversity Heritage Library 1) would map exactly to (journal, volume, page), (BHL) [6], a consortium of natural history museum such that an individual journal would correspond to libraries, botanic libraries, and research institutions. a title in BHL, and each volume of that journal was a The bulk of this digitisation is carried out by the separate item. Given that BHL stores page numbers Internet Archive [7], which scans books (broadly de- for each scanned page, locating articles would then fined to include bound issues of journals), creating a be trivial and linking to BHL content could be read- set of electronic files for each scanned item, which in- ily integrated into existing publication processes, as cludes images of individual pages, and text extracted well as bibliographic management tools that make from those pages using Optical Character Recogni- use of CrossRef's services to augment user-provided tion (OCR). BHL takes these files (together with metadata (e.g., Mendeley [11]). the output from the scanning projects of individ- Unfortunately, the actual mapping between ar- ual BHL members), indexes them by bibliographic ticles and BHL content is often rather more com- metadata and taxonomic names, and makes the con- plicated. Large articles (e.g., monographs) may be tent available on its web site [6] (both as web pages treated as separate \titles" (effectively as if they and web services). Although the bulk of BHL's scan- were books), rather than parts of the same title. ning activities focus on pre-1923 content that is out A contributing library may have bound several vol- of copyright, it has not inconsiderable post-1923 con- umes of a journal together, such that a single \item" tent contributed by its member institutions, notably may comprise multiple volumes. Volume numbers publications by various natural history museums. themselves may not be unique within a journal. The Annals and Magazine of Natural History (ISSN 0374-5481), published from 1828 until 1967 (being Finding articles in BHL succeeded by the Journal of Natural History, ISSN The BHL archive comprises \items" corresponding 0022-2933), is divided into 13 \series", each series to physical objects which are scanned. Items are numbering its volumes from one onwards. Hence, grouped together into \titles". A single volume book \volume 1" of Annals and Magazine of Natural His- corresponds to a single title and item, whereas a tory may refer to any one of 13 volumes from over a multi-volume work, such as a journal, will comprise span of 138 years [12]. Journals also differ in whether several items grouped under the same title (Fig. 1). pagination is unique within a volume, or within parts Noticeably absent from the BHL model is the stan- of a volume. For example, in the journal Arkiv för dard unit of scientific citation, the article. The in- Zoologi (ISSN 0004-2110) each article starts on page ability to easily find articles in BHL is a substan- 1, so that the triple (Arkiv förZoologi,13, 1) may re- Nature Precedings : hdl:10101/npre.2010.4928.1 Posted 21 Sep 2010 tial obstacle to integrating this legacy literature into fer to [13], [14], or any of 23 other articles in volume mainstream scientific publishing. The goal of BioS- 13 of that journal. tor is to provide tools to locate and extract articles Discovering articles also assumes that the pag- and associated metadata from the BHL archive. To ination in BHL is complete and correct, and that locate articles BioStor makes use of bibliographic one side of a sheet of paper corresponds to a \page". databases, assembled by authors, journal editors, BHL records the page number of regular pages, but and the taxonomic community. The assumption is not pages that are classified as special in some way, that metadata for most of the articles of interest will such as title pages, or tables of contents. For exam- exist in one or other of these databases, hence the ple, in page 1 in Lynch et al. [15] is recorded in BHL task becomes one of matching these bibliographic as being the title page, without any number, which references to the scanned content in BHL. will frustrate efforts to find this article by starting For most modern articles the triple of journal page alone. name, volume, and starting page is sufficient to While the triple (journal, volume, starting page) uniquely identify an article [8], and tools such as is usually sufficient to locate the start of an article, CrossRef's OpenURL resolver [9] can take this this we want to recover all the pages in the article, hence triple and discover whether a Digital Object Identi- we need both the starting and ending pages. Ideally fier (DOI) [10] exists for a that article. Publishers we could then extract the corresponding set of page 2 images from BHL and join them together to form discover why the relative usage frequency of these an article. However, it is not uncommon for older two names changed in the previous century. articles to have discontinuous physical pagination, for example by having plates inserted between pages in the text. In some publications, such as Isis von Construction and content Oken, the text on a page forms two columns, each A local copy of the core BHL tables (Fig. 4) was with its own page number (Fig. ??), hence one phys- created in MySQL using the data dump provided ical page need not equate to a bibliographic page. by BHL( http://www.biodiversitylibrary.org/data/ data.zip). Page images and OCR text for individual pages are retrieved as needed using the BHL API Metadata matters and cached locally (together with a thumbnail of the Given that locating articles in a archive of legacy page image). literature such as BHL is a non-trivial task, it is worth considering why such an undertaking is worth- while, beyond integrating BHL with existing citation Locating an article practices. Indeed, one could argue that, given that BioStor provides an OpenURL [22] resolver service the OCR text for BHL content has been indexed to locate articles in BHL. At a minimum the resolver by taxonomic name, the need for indexing by arti- requires the journal name, volume, and starting page cle has been greatly reduced { the user could simply of the article being searched for. It may also make search by taxonomic name and find the content they use of series and date, if these are provided. This require. This would be sufficient for many users, service first checks whether the article already exists especially if we were confident that BHL had cor- in the BioStor database.

Load more