<<

MASARYKOVA UNIVERZITA F}w  AKULTA INFORMATIKY !"#$%&'()+,-./012345

Citation Crawling

BACHELOR THESIS

Lukáš Lalinský

Brno, 2009 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Advisor: doc. RNDr. Petr Sojka, Ph.D.

iii Acknowledgement

I’d like to thank my advisor, doc. RNDr. Petr Sojka, Ph.D., for his continued support and valuable comments on the text of this thesis. I’d also like to thank Mgr. Martin Šárfy for his help with my technical questions regarding the DML-CZ infrastructure.

iv Abstract

The aim of this bachelor thesis is to design and implement a new citation look-up service for the DML-CZ project. This requires getting familiar with existing standards in the area of working with bibliographic data, as well as technical capabilities of existing bibliographic databases and search en- gines. Based on this knowledge we can design an application for reference look-up that yields better results than the currently available solution used in DML-CZ. The main part of the thesis focuses on the description of this application.

v Keywords citation, reference, bibliography, search, article, journal, digital library, cross-linking, DML-CZ

vi Contents

1 Cross-Linking Standards ...... 4 1.1 Digital Object Identifier ...... 4 1.2 OpenURL ...... 5 2 Bibliographic File Formats ...... 6 2.1 BibTeX ...... 6 2.2 AMSRefs ...... 7 2.3 EndNote ...... 7 2.4 ZMATH XML ...... 8 2.5 ZMATH ASCII ...... 9 3 Academic Databases and Search Engines ...... 11 3.1 Zentralblatt MATH ...... 11 3.2 and MathSciNet ...... 12 3.2.1 MRef ...... 12 3.2.2 MathSciNet Search ...... 13 3.3 CrossRef ...... 13 3.4 Scopus ...... 14 3.5 Google Scholar ...... 15 4 CiteCrawl Implementation ...... 16 4.1 Overview ...... 16 4.2 Input Normalization ...... 17 4.3 Searching ...... 18 4.3.1 Similarity Comparison ...... 19 4.3.2 Zentralblatt MATH ...... 20 4.3.3 MathSciNet ...... 21 4.4 XML Web Service ...... 22 4.5 Caching ...... 23 4.6 Possible Further Improvements ...... 24 5 Conclusion ...... 26 A Contents of the Attached CD ...... 29

1 Introduction

New research is always based on previous knowledge. It’s the natural way to more forward, to discover new ideas, to solve more complex problems. This is especially true for academic research. Knowledge is documented in form of papers, journal articles or other type of publications and the publi- cations typically refer to other publications, which served as the source of knowledge during the research. This allows the reader to get all the pre- vious knowledge that the author had. This become even more important with the arrival of digital libraries. Instead of searching in classical libraries which have limited resources, the referenced publications can be just a few clicks away. First attempts to build digital libraries happened in late 1960s [11]. They faced various technical issues, such as high cost of computers or the lack of computer networks. Technical development has moved further in the next thirty years and since early 1990s, there were no fundamental barri- ers to building digital libraries. Electronic storage has become cheaper than paper, high-speed computers networks were available, personal computer displays become more pleasant to use. Digital libraries can offer the user various advantages over traditional libraries, but a particularly important one is linking to other digital libraries or other digital resources. The Czech Digital Mathematics Library (DML-CZ) project aims to digi- tize, preserve and make easily accessible a large part of mathematical liter- ature published in the Czech lands since the 19th century [5]. As of Novem- ber 2009, it provides 275 000 pages of scientific text by almost 10 000 authors [20]. Many articles and references are linked to two major mathematical publication databases, Zentralblatt MATH and MathSciNet. This provides a valuable information to the reader, as they can read abstracts, reviews and in some cases full texts of the referenced publications, without having to look them up via external services. The main goal of this thesis is to build a new service for looking up references from DML-CZ in external databases and explore new ways of cross-linking publications in digital libraries. DML-CZ already has a tool

2 for searching in Zentralblatt MATH and MathSciNet, but this tool is rather simplistic and the ratio of matched publications can be improved. Everything in this thesis is related to the DML-CZ project in a way, but there are two kinds of text. The first three chapters are more theoretical and describe existing standards, file formats or projects that are useful for im- plementing the look up service or to the DML-CZ project in general. In the forth chapter I describe the implementation of the project that was created along with this text. It explains decisions that were made, how does it work and describes problems that came up during the development. Source code of the project can be found on the attached CD.

3 Chapter 1 Cross-Linking Standards

The need for cross-linking between digital libraries and other resources is not new. According to some definitions, it’s even one of the primary goals for building digital libraries. [21] This chapter describes a few established standards that aim at solving parts of this problem.

1.1 Digital Object Identifier

Digital Object Identifier1 (DOI) is an unique identifier, that publishers as- sign to articles or book chapters to provide a permanent link to the publi- cation. [4] The main idea is that the DOI link stays stable, even if the pub- lisher’s web site moves or changes. The identifiers are maintained multiple registration agencies. The DOI registration agencies specialize on a specific area of publications. A DOI registration agency contains a database of DOIs with their corresponding metadata and URLs to publisher web pages with additional information, such as full bibliographic citation and abstract and full-text access to the publication. DOI names follow a syntax standardized as ANSI/NISO Z39.84-2000. [16] It consists of two components, refered to as the preffix and the suffix. The two main components are separated by a forward slash. The prefix represents information about the registration agency. It itself contains of two components separated by a period: Directory Code (currently always 10) and Registrant’s Code, which is assigned to DOI registration agencies. The next component is an unique identifier assigned by the registration agency. There are no requirements for the format of the suffix. For example, in the DOI 10.1000/182, 10.1000 is the prefix and 182 is the suffix. A centralized service at http://dx.doi.org/ is used to resolve any DOI to it’s original URL. The DOI mentioned in the example before refers to the PDF version of The DOI°R Handbook. In order to link to the PDF, we can use the following URL http://dx.doi.org/10.1000/182.

1. http://www.doi.org/

4 1. CROSS-LINKING STANDARDS

The DOI system is still new, but it’s already used by a large number of publishers and it’s gaining popularity. I expect that at least for new publica- tions, it will become the most important linking standard, if it isn’t already.

1.2 OpenURL

OpenURL is a standard way of linking to a web site, primarily designed for libraries. It allows automatic construction of URLs that can provide more information about a publication. It can be seen as a standardized cross-site search interface. The URL can contain basic metadata fields, such as the title or the author name, but also very specific fields, like ISBN, which directly identify the publication. It is mainly used by libraries, so most documen- tation are specific to books or other publications, but OpenURL in general can be used for anything. Every OpenURL-enable web site has a special URL, called "endpoint", which accepts OpenURL requests. This is used as the base for the con- structed link. The request itself is defined by HTTP GET parameters, which differ based on the version of OpenURL used. There are two major versions, which are not compatible with each other.

5 Chapter 2 Bibliographic File Formats

Most of the file formats mentioned in this chapter were designed to serve the purpose of a small local citation database that is used writing. With the arrival of bibliographic servers and search engines, it become important to have a way to easily import data from these services and the formats started to be used also for exchanging data. This chapter doesn’t attempt to list all existing bibliographic formats, only a few that are commonly used as data exchange formats.

2.1 BibTeX

BIBTEX is a software package and a file format for managing bibliographies, usually used in combination with the LATEX document preparation system [12]. It was designed by Oren Patashnik and Leslie Lamport in 1985 [14]. The BIBTEX file format is one of the most commonly formats used for bibli- ographic information exchange. All bibliographic databases mentioned in the previous chapter support export to BIBTEX. A file in the BIBTEX format consists of multiple entries. An entry repre- sents a publication and consists of the type of publication, a citation key which used in LATEX documents for referring to this particular entry and a list of tags, such as author or title, describing the publication. BIBTEX defines a standard set of tags, but unknown tags are ignored, which make it easy to extend the format with additional information [13]. An example of bibliographic data in the BIBTEX format, including non- standard Mathematical Reviews specific fields:

@article{MR0042380, author={Green, J. A.}, title={On the structure of semigroups}, journal={Ann. of Math. (2)}, volumne={54}, year={1951}, pages={163--172},

6 2. BIBLIOGRAPHIC FILE FORMATS

issn={0003-486X}, mrclass={09.1X}, mrnumber={MR0042380 (13,100d)}, mrreviewer={J. Riguet}, }

2.2 AMSRefs

1 AMSRefs is another LATEX package for working with bibliographic infor- mation, developed by the American Mathematical Society [9]. Unlike Bib- TeX, AMSRefs is controlled completely through LATEX. It uses a file format that is a valid LATEX document and entries can be printed directly. The in- ner structure is similar to the BibTeX file format, the main difference is that information is wrapped in a LATEX command: \bib{MR0042380}{article}{ author={Green, J. A.}, title={On the structure of semigroups}, journal={Ann. of Math. (2)}, volume={54}, date={1951}, pages={163--172}, issn={0003-486X}, review={\MR{0042380 (13,100d)}}, }

2.3 EndNote

EndNote2 is a commercial software package for working with bibliogra- phies, developed by Thomson Reuters [6]. It uses multiple file proprietary formats. One of them is a simple plain-text format, which is normally used for import from external sources to EndNote. Data in this format are pro- vided by most bibliographic databases with EndNote support. It is a line- based format, where each field starts with a percent mark followed by a single character. This character denotes which field does the line represent. The next character is always a space and the rest of the line is the value of this field. If the value is too long, it can continue on the next line. In that case the line doesn’t start with a percent mark, but a space. Here is an example of the file format:

1. http://www.ams.org/tex/amsrefs.html 2. http://www.endnote.com/

7 2. BIBLIOGRAPHIC FILE FORMATS

%0 Journal Article %A Green, J. A. %T On the structure of semigroups %J Ann. of Math. (2) %V 54 %D 1951 %P 163--172 %@ 0003-486X %L MR0042380 (13,100d)

Unlike the two previous formats, which are mostly self-documenting, EndNote doesn’t use full field names. The most commonly used field codes are:

%0 Publication type (e.g. Book, Journal Article). %A Author(s). %T Title. %J Journal. %V Volume. %D Publication date/year. %P Page numbers. %@ ISBN3/ISSN4.

2.4 ZMATH XML

The Zentralblatt MATH database can return search results in various for- mats, one of them is based on XML5. The XML document contains the search query and a list of matching records from the ZMATH database. Here is an example of a returned XML document:

ti:Semigroups containing minimal ideals Zbl 0038.01103 Clifford, A.H. Semigroups containing minimal ideals. EN Am. J. Math. 70, 521-526 (1948).

3. International Standard Book Number 4. International Standard Serial Number 5. Extensible Markup Language

8 2. BIBLIOGRAPHIC FILE FORMATS

ISSN 0002-9327; ISSN 1080-6377 1948

J
Group theory The document element is always zbml, which has two child elements. The query element contains information about the performed search query and the answers element contains a list of matching records in the ZMATH database. There are a few attributes on the the answers element. When the to to attribute matches the total attribute, it means that all matching records are contained in the XML document. Otherwise from refers to the first returned record, to to the last returned record and total to the total number of matching results. Each returned record is represented by the rec element and it’s children, which can contain the following information [8]: an Identifier in the Zentralblatt MATH database. au Author(s). ti Title. la Language, represented as ISO 639-1 alpha-26. py Publication year. so Source data, includes journal or serial title, volumn and issue num- bers, page numbers, publisher and publication year. ut Uncontrolled terms and keywords. is ISBN/ISSN. dt Document type. The character J represents a journal article, B a book and A a book article.

Fields contain LATEX-formatted text, which is mainly used for mathemat- ical symbols and accented characters. Multiple values in fields such as au or is are separated by a semicolon.

2.5 ZMATH ASCII

ZMATH can also return results in a plain-text format. It is derived from the richer XML format, but doesn’t contain information about the search query and numbers of matching records. Multiple records are separated by one blank line and each record has the following format:

6. http://www.loc.gov/standards/iso639-2/php/code_list.php

9 2. BIBLIOGRAPHIC FILE FORMATS an: Zbl 0038.01103 au: Clifford, A.H. ti: Semigroups containing minimal ideals. la: EN so: Am. J. Math. 70, 521-526 (1948). py: 1948 dt: J ut: Group theory

Each line starts with a two-character code followed by a colon and one space. The rest of the line contains the field value. If the next line starts with four spaces, it means that the value on the line is a continuation of the previous field, not a new one. The field codes and their meaning is the same as in the case of ZMATH XML, which is described in the previous section.

10 Chapter 3 Academic Databases and Search Engines

This chapter focuses on selected academic databases with either focus on mathematics or general services that have good coverage of mathematical articles. There is a short overview of each project, but the largest part of the text concentrates on their usefulness in regard to the DML-CZ project. Whether they provide valuable information, what kind of programmable services they offer and how is it possible to work with these services.

3.1 Zentralblatt MATH

Zentralblatt MATH1 (ZMATH) is a large abstracting and reviewing service for mathematical articles. It is available online and also in printed form. The service was founded in 1931 by Otto Neugebauer. Currently it is main- tained by FIZ Karlsruhe, the European Mathematical Society and Heidel- berger Akademie der Wissenschaften [10]. In 2003, 200 000 entries from an earlier service Jahrbuch über die Fortschritte der Mathematik (JFM) was added to Zentralblatt MATH. As of September 2009, the Zentralblatt MATH database contains about 2.8 million entries from more than 3500 journals and 1100 serials [1]. The online version of the ZMATH database provides an extensive search engine. It uses a custom query language, allowing search within specific fields, phrase search and boolean operators for constructing com- plex queries. There are multiple frontends on the website for easier access to specific search type, but they all use the mentioned query language internally. The website allows search results to be returned in multiple formats. HTML, PDF, DVI and PostScript for displaying purposes and ZMATH XML, ZMATH ASCII and BibTeX for further machine processing. bi Basic index, includes some of the fields below, such as au or ti, but even fields that are not searchable on it’s own, such as abstracts or citations.

1. http://www.zentralblatt-math.org/zmath/

11 3. ACADEMIC DATABASES AND SEARCH ENGINES au Author(s)/Editor(s). ti Title. so Source. py Publication year. cc Classification code. rv Reviewer. dt Document type. an Accession number.

3.2 Mathematical Reviews and MathSciNet

Mathematical Reviews2 is a journal and online database of articles in mathematics and theoretical computer science, published by the American Mathematical Society (AMS). [7] It was founded in 1940 by Otto Neuge- bauer, after Zentralblatt MATH was taken over during the war and he moved to the United States. There are no exact statistics of the number of articles in the database, but according to the "About MathSciNet" page, it contains over 2 million entries. In November 2009, Mathematical Reviews started presenting DOI links for more than a million electronic publications in the database.

3.2.1 MRef

MRef is a tool for looking up references in the MathSciNet database. Based on unstructured reference text, it searches the MathSciNet database and if there is an unique match, it will return a standard reference including a link to the MathSciNet database. It can provide output in HTML, TEX, BIBTEX and AMSRefs formats. While this is a simple tool without too many options, it can provide good results in most cases without much effort on the client site. The main advantage of using MRef, as opposed to a more generic search engine, is that MRef is aware of internals the MathSciNet search engine and can pre- process the input data in a way that increases the chances to find a valid match. MRef is the default option for looking up MR links in the web service described in this thesis. If MRef fails, it tries to use the regular search engine, but as I mentioned, this is not necessary in most cases.

2. http://www.ams.org/mr-database

12 3. ACADEMIC DATABASES AND SEARCH ENGINES

3.2.2 MathSciNet Search The MathSciNet database provides also a traditional search engine. It al- lows users to search for data in specific fields on the database, restricting the results by publication year and type. The search query can contain up to 4 fields connected by standard boolean operators. Fields that can be searched include:

AUCN Author. ICN Author/Related. TI Title. RT Review text. JOUR Journal. IC Institution code. SE Series. CC MSC primary/secondary. PC MSC primary. MR MR number. RVCN Reviewer. ALLF Any field. REFF References.

Search results can be exported to the BibTeX, AMSRefs and EndNote file formats.

3.3 CrossRef

CrossRef3 is a DOI°R registration agency for scholarly and professional publications. It was established in 2000, as an independent membership association, with the goal to make cross-linking of online scholarly litera- ture efficient and reliable using the DOI system [3]. The DOI is an unique identifier, that publishers assign to articles or book chapters to provide a permanent link to the publication. A DOI registration agency contains a database of DOIs with their corresponding metadata and URLs to publisher web pages with additional information, such as full bibliographic citation and abstract and full-text access to the publication. In addition to resolving DOIs, CrossRef provides various services for looking up DOIs based on ex- isting metadata [2], which makes it similar to other bibliographic databases mentioned in this chapter. In addition to a standard search engine, where

3. http://www.crossref.org/

13 3. ACADEMIC DATABASES AND SEARCH ENGINES the user can search in various database fields, CrossRef also offers a service similar to MRef by Mathematical Reviews. It allows to use to type in the reference line as it is written in the bibliography section and let the search engine come up with the best match. In theory, the DOI system with a good search engine could replace the need for DML-CZ to go through other bibliographic databases in order to provide additional information about cited publications. If all articles and citations in the DML-CZ database had DOIs assigned, they could be used to link directly to the publisher’s website or other websites using the DOI system. Unfortunately, that’s not easily possible in practice. Even though CrossRef aims to be a distributed system, it relies on publishers to assign DOIs to their publications. A large part of the DML-CZ database consists of historical publications, for which there are no publishers to register DOIs anymore. Using a centralized databases that collect metadata about publi- cations independently on publishers is thus a better option. Although, for a digital library consisting mainly of new publications, it would be an inter- esting alternative.

3.4 Scopus

Scopus4 is a commercial abstract and citation database, collecting articles and publications from multiple fields, not only mathematics. It claims to be the largest bibliographic database. Coverage of mathematical publica- tions, especially older, is not as good as in the case of Zentralblatt MATH or Mathematical Reviews, though. It is a commercial product owned by Else- vier and all Scopus services are available only to subscribers. Articles in the database have DOI information where available, so it’s easy to link from the Scopus database to the original publisher’s web site. For linking from ex- ternal web sites to Scopus, it supports the OpenURL standard. The web site offers an extensive search engine, which makes use of the very rich meta- data of entries in the Scopus database. It also integrates Scirus, which is a science-specific web search engine by Elseview. The web site allows export to the BibTeX, RIS and RefWorks file formats.

4. http://info.scopus.com/

14 3. ACADEMIC DATABASES AND SEARCH ENGINES

3.5 Google Scholar

While not a traditional bibliographic database, Google Scholar5 is an use- ful resource for finding academic publications. Most of Google Scholar’s search results link to the original publisher’s web site, which allows users to see more details about the returned publications, in many cases also the full-text version or at least information how to access it. Many publications appear on the internet on multiple places. Instead of including duplicates in search results, Google Scholar tries to group multiple publication locations and provide only link to the most important one by default, but it is pos- sible to list all locations of a publication, too. Google Scholar also contains a citation database. In 2007, Google Scholar started a private program for digitization of old journal articles. [17] For citation purposes, it can export search results to the BibTeX, RefWorks and RefMan file formats. Unfortunately, as of December 2009, the Google AJAX6 Search API7 doesn’t support Google Scholar. This makes it unsuitable for automatic searches and publication referencing. Due to the large number of indexes publications and the fact that it searches in multiple sources, it is still useful to the user to just provide Google Scholar search links.

5. http://scholar.google.com/ 6. Asynchronous JavaScript and XML 7. Application Programming Interface

15 Chapter 4 CiteCrawl Implementation

The main goal of this thesis was to develop an application for the DML- CZ project to search for cited publications on large bibliographic servers. A working title for the application is CiteCrawl, but it’s possible that this name will be changed in the future. This chapter focuses on describing in detail the application, the approaches it uses for searching in various bibli- ographic databases and how the code is stuctured.

4.1 Overview

CiteCrawl has two operational modes. It can be used on-demand as a HTTP1-based web service or it can be used in batch match. These are implemented as two different frontends for the same backend classes, which means that from the functional point of view they are completely equivalent. Both the backend and frontend functionality is implemented in the Python2 programming language. The main use case for CiteCrawl is looking up publications in external bibliographic databases. All of the databases considered during the devel- opment assign unique identifiers to publications, so the output is a list of identifiers with information to database they point. This output is looked up based on scanned citations metadata from the DML-CZ database. For example, CiteCrawl gets a request with information that the cited publication’s author is "W. J. Gibbs", the title is "Tensors in Electrical Ma- chine Theory" and the title in the citation entry is followed by text "Chap- man and Hall, London, 1952". Based on this information, it will run a few external searches and return that the publication has identifier MR52986 in MathSciNet and Zbl 0049.26704 in Zentralblatt MATH. Having this iden- tifiers, the DML-CZ website can provide links to MathSciNet and Zentral- blatt MATH on article pages.

1. Hypertext Transfer Protocol (RFC 2616) 2. http://www.python.org/

16 4. CITECRAWL IMPLEMENTATION

Using the on-demand mode, this happens when the user enters a page on the DML-CZ website. JavaScript code on the page will start an AJAX re- quest against the CiteCrawl web service and when it receives the response, it will dynamically create links to the databases in which the publication was found. Doing searches in external databases is an expensive operation, so in order to return results as quickly as possible it is necessary to cache identifiers from previous searches. This cache will be pre-populated, so it is expected that most requests can be handled directly from the cache without doing external searches. Only requests for new entries from the database will require external search. In addition to using the web service, it is possible to make CiteCrawl run against the local DML-CZ database and periodically do searches for entries that are not found in the cache. This is used to pre-populate the cache, so that on-demand requests take as little time as possible.

4.2 Input Normalization

The metadata in the DML-CZ database comes from scanned images, which are converted to a textual form using an OCR software. This process is not error-prone. CiteCrawl therefore can’t assume the input it gets is always correct and has to be able to handle errors. In practice this means that before doing any searches, it checks the text for commonly occuring errors and performs some normalization procedures. A significant part of the DML-CZ database cites Russian publications and presumably the OCR3 software was configured to recognize Cyrillic text correctly. The problem is that some Cyrillic letters are visually very similar, or even identical, to Latin letters, but their Unicode code points differ. For example Cyrillic smaller letter "a" is visually the same as Latin small letter "a" in most fonts, but due to the fact that they have different code points, it’s not easily possible to search for a word that contains all Latins letters but the one Cyrillic "a", and expect search engines will return results as if it was the Latin "a". Therefore it’s necessary to look for these kind of errors before searching in external databases. The implementation uses a regular expression that matches single Cyrillic letters surrounded by Latin letters and replaces them with their Latin equivalents. This is done for all Cyrillic/Latin letter pairs that have the same visual representation. The databases CiteCrawl searches in are mostly English-centered and this affects how they represent foreign names and titles. Czech author

3. Optical Character Recognition

17 4. CITECRAWL IMPLEMENTATION names and publication titles are usually entered without diacritic marks. While ignoring diacritics as part of query pre-processing is a common tasks that modern search engines do, some of the used services were overly sen- sitive about this. During experimenting with the MathSciNet search engine, I’ve found that it returns matches that contain diacritics even for queries without diacritics, but not the other way around. This made it clear that removing diacritics before searching will help with returning more relevant results. Diacritic marks are removed by a function that converts input to the NFKD Unicode normal form, which decomposes letters with diacritic marks into two characters, and then removes all combining characters. The result is text without diacritics. There are more sophisticated algorithms for doing this, but for the purposes of CiteCrawl this is sufficient. Another issue is with author names. Listing authors with their first names abbreviated is the standard way of writing citations. For example, the name "Alfred Hoblitzelle Clifford" will be normally written as "A.H. Clifford". On the other hand, bibliographic databases usually use author’s full names and this sometimes causes problems. Every time CiteCrawl has to include the author name in a search query, it tries to remove first name abbreviations and leave only last names. In the example above, the name would be reduced just to "Clifford". This increases the chance of not missing a valid match. There are also other normalization rules, but there are applied only where necessary, not in general on all text for all search engines. These will be described in the following section, when explaining searching in particular bibliographic databases.

4.3 Searching

CiteCrawl uses an optimistic approach to searching in bibliographic databases. At first, it tries to provide as much original information as possible and hope that the external search engine will return the most relevant match. Ideally there would be only one match with the searched publication. Due to errors in the input data, limitations of the search engine, or other reasons, this doesn’t work in all cases. As a fallback, CiteCrawl constructs search queries with less information in hope that it excludes the erroneous metadata and the search engine will return the desired publication among with other results. Depending on how little original information was left in the search query, this method might result in a large amount of unrelated publications being returned. It is important

18 4. CITECRAWL IMPLEMENTATION compare the results with the original metadata, to determine how similar they are. Results with similarity below a defined threshold are ignored. If there are any results left, the algorithm will pick the result with the highest similarity.

4.3.1 Similarity Comparison

The basic idea of the function that compares publication metadata is to com- pare individual metadata fields separately and then calculate a weighted sum of the field’s similarities. That means that different fields don’t have the same impact on the total similarity. For example, the publication title is considered much more important than the source information by the algo- rithm. Many articles are published multiple times and this specific example allows to match the same article that is published in a different journal, if this is the only entry found in the database. On the other hand, if there are multiple entries for the article in the database, it will match the right one, because the source information still does have some impact on the result. Before it’s possible to run a string similarity algorithm on two metadata fields, it’s necessary to convert them to equal form. LATEX is the standard syntax for writing mathematical text and it’s expectable that it’s used as much as possible. This includes using special commands for writing ac- cented characters. Unfortunately, both MathSciNet and Zentralblatt MATH use this syntax even in formats that have native support for Unicode, like XML. This makes it complicated to compare strings from different sources, because one might contain Unicode code points and the other one might contain LATEX commands. CiteCrawl therefore contains a function that un- derstands a small subset of the LATEX syntax and can convert LATEX-encoded accented characters to their Unicode equivalents. The function uses a regu- lar expression to a few specific commands followed by a single letter. When it finds a match, it takes the characters, appends the appropriate Unicode diacritic mark and uses NFKC Unicode normalization to combine the two characters into one. After this, it calculates a generalized Levenshtein edit distance, between the two strings. This algorithm counts the number of "steps" that are neces- sary to transform one string into the other one. The standard Levenshtein edit distance allows insertion, deletion and substitution as operations that can be performed in each step. [15] The generalized algorithm adds trans- position of two characters to the set of operations. This algorithm is some- times called Damerau-Levenshtein distance. Similarity in range [0, 1] is cal- culated from the edit distance using the following formula:

19 4. CITECRAWL IMPLEMENTATION

distance similarity = 1 − max(lengtha, lengthb) I’ve tried a few different algorithms for calculating the similarity during the development, but using a generalized Levenshtein produced the best results. The first version used the Dice’s coefficient, which is based on the number of bigrams that are found in both strings. [19] Another alternative was to sequence matching, which is normally used for calculating string differences, but can be also used to get a number representing similarity of two strings. The similarity is then based on the length of the longest common subsequence. [18] The total similarity is calculated as a weighted sum of individual field similarities. Fields that are not present in either metadata are not included in the sum. Assuming we have n fields, the sum is calculated as follows:

Xn weighti.similarityi total = i=0 Xn weighti i=0 Weights are defined differently for each field. The current implementa- tion uses these values:

Field Weight title 2.2 author 1.7 year 1.0 source 0.1

4.3.2 Zentralblatt MATH

Zentralblatt MATH supports only a standard search engine (see section 3.1 for details about it’s abilities). Subscribers have access to a web service, which accepts the search query in HTTP GET parameters and returns re- sults in machine readable. CiteCrawl uses the ZMATH XML format, de- scribed in section 2.4, because Python has native support for parsing XML documents, which makes it the easiest format to write a parser for. Al- though the web service offers paging, CiteCrawl always uses only results from the first page. The number of results per page can be specified in the search query.

20 4. CITECRAWL IMPLEMENTATION

The first query it tries includes the publication title, year and the au- thor’s name. For example, "Semigroups containing minimal ideals au:(Clifford) py:(1948)". It issues a HTTP GET request with this query and some addi- tional static parameters and receives a XML document in the response. This document is parsed and transformed to a list of metadata objects. Each of these objects is compared to the requested metadata and if there is a match above a certain similarity threshold, which is currently set to 0.68, this match is returned as the result. If the first query fails to find a valid match, it tries to search only by the publication title. For short titles this might yield too many results, but in the usual case the title is unique enough to not return too many unrelevant matches. Same filtering process is applied to the results as in the case of the first query. As a last resort, if the previous query also fails, it tries to find publi- cations by the author name and the publication year. This only works for authors with a few publications. In case of authors with too many publi- cations, it would be necessary to iterate over multiple pages and it would most likely not contain the right publication anyway.

4.3.3 MathSciNet

Mathematical Reviews provides a small service for looking up references in the MathSciNet database, called MRef. This tool is described in section 3.2.1. It works surprisingly well. CiteCrawl therefore uses it as it’s first attempt. It joins all fields with a semicolon and without doing much normalization or pre-processing, it passes it to MRef via a HTTP GET request. MRef returns a HTML page with either one match or nothing. It can return also BibTeX or AMSRefs formats, but these are wrapped in HTML as well, so parsing HTML is unavoidable. Python doesn’t offer any standard solution for real- world HTML parsing, so CiteCrawl uses a 3rd party package called Beautiful Soup4. If MRef returns a match, it is accepted without metadata comparison and it doesn’t proceed with regular searches. Otherwise it tries to use the MathSciNet search engine and this goes through a similar process than in the case of Zentralblatt MATH. The first query includes the title, year and author name. Only the title and the au- thor name are included in the actual search query. The publication year is processed by MathSciNet as a separate field with different semantics.

4. http://www.crummy.com/software/BeautifulSoup/

21 4. CITECRAWL IMPLEMENTATION

This makes it possible to search for year ranges, for example, but CiteCrawl doesn’t use this functionality. Extracting search results is a little more complex. The search engine that MathSciNet provides is targetted at human users and doesn’t offer a fully machine-readable format. It can output the results in formats that are importable to various bibliographic tools, but the user is expected to copy&paste the data from the returned HTML web page. The MathSciNet module therefore first parses the HTML page, locates and extracts the main plain text section from it. Once it has the text, it can use a standard parser to get metadata. The EndNote file format is used, because it’s one of the sim- pler formats to parse and it provides all the details necessary for CiteCrawl.

4.4 XML Web Service

Above the federated search functionality is a HTTP-based frontend. It is a simple web service, that accepts queries via HTTP GET requests and re- sponds with a XML document. This web service was meant to be called via AJAX from the DML-CZ web site. Queries are passed by the "Query String" part of the URL, encoded in the format described by RFC 23965, for example http://example.com/search?field1=value1&field2=value2. Accepted fields are:

Field Description title Publication title. author Author name. year Publication year. source Source information (journal, publisher).

Each field can be used multiple times, if it’s necessary to pass multiple values. Realistically, this is only useful for the author name, because one publication can have multiple authors. After the web service receives these values, it starts the searching pro- cess. First it looks in the cache, to find whether it contains a matching entry. This cache is explained in detail in the next section. Results from different sources are stored separately, so it’s possible that identifier from one source is cached and from another isn’t. It collects the identifiers it found in the cache, and use the search modules to look up the missing identifiers. When this process is done and it found some results, they are stored in the cache

5. Uniform Resource Identifiers (URI): Generic Syntax

22 4. CITECRAWL IMPLEMENTATION for future use. The next step is is putting the information together and seri- alizing the resulting XML document, for example:

It’s a simple XML format with a list of results, which includes the source database code (source), raw external identifier (number), formatted exter- nal identifier (text) and the URL to the external web site (url) for each result. From more technical point of view, the web service is implemented as a WSGI6 application. WSGI is the standard way to interact with web servers from Python. It’s an abstract standard that doesn’t directly talk to a web server, but only describes the Python API. There are multiple adapters that allow hosting of WSGI applications on commonly used web servers, such as Apache7. With adapters for standard like CGI or FastCGI, it is possible to run CiteCrawl on virtually any web server.

4.5 Caching

The application uses two layers of caching. On the lower level, it caches all HTTP requests to external servers. This is implemented by generating a SHA-18 hash of the requested URL and storing the response body in a file named according to the hashed URL. Before doing a HTTP request, it looks if the file exists in the cache directory and if it does, it reads data from this file and returns it to the calling function. Otherwise it starts an actual HTTP request and before returning the data, it creates a new file for fu- ture requests to the same URL. This caching mechanism is primarily useful during development, because the process of fine-tuning the similarity com-

6. Web Server Gateway Interface - http://wsgi.org/wsgi/ 7. http://httpd.apache.org/ 8. US Secure Hash Algorithm 1 (RFC 3174)

23 4. CITECRAWL IMPLEMENTATION parison settings or search query building functions requires to repeatedly run tests over the whole data set. There is also a database-based caching mechanism on the application level. This uses a local SQLite9 database. SQLite was chosen mainly for sim- ple setup and maintenance. The database has two tables with the following structure:

The query table represents a cached request. It is identified by a SHA- 1 hash of the query metadata, which is stored in the hash column. Each row in this table has one or more linked rows in the result table. These represent found identifiers in external databases. The source column says which database is it and the number column is the actual identifier. A row is inserted in this table even for unsuccessful searches. In this case a NULL is stored in the number column. The reason for this that the application should not search for publications that can’t be found over and over again. The NULL value tells it that it tried to search in the past and it failed. How- ever, publications that were previously not found might eventually appear in the database, so it can’t assume that if the publication was not found one, it won’t be found ever. This is where the modified column is used. It stores the date when was the row last modified, which is necessary for deleting rows that are older than a certain number of days. The pinned column tells the application that even though the row is older, it should not delete or overwrite it. This is used for results that were manually corrected.

4.6 Possible Further Improvements

The system I’ve implemented works well for Zentralblatt MATH and Math- ematical Reviews, but DML-CZ already had some links to these databases before. A very useful improvement would be to add support for DOIs. As these are distributed identifiers, they can be available from multiple sources. Near the end of work on this thesis, Mathematical Reviews started providing DOI links for many publications. A good start would be reading

9. http://www.sqlite.org/

24 4. CITECRAWL IMPLEMENTATION the DOI information from MathSciNet’s export. This would require an API change that allows search modules to return different types of identifiers. The primary source for scientific DOIs is CrossRef and it would be in- teresting to investigate how good is the DOI coverage in Mathematics Re- views, i.e. whether there are many publications that have DOIs assigned, but Mathematical Reviews doesn’t contain it. If there are many of such pub- lications, it might be useful to add CrossRef as a new data source. Google Scholar has DOI support, too, but there is no API to automatize searches. If Google extends the Google AJAX Search API to include Google Scholar or if there is a new API for Google Scholar in the future, this would be a good source not only for DOIs, but also for other kinds of links.

25 Chapter 5 Conclusion

I’ve decided to work on a bachelor thesis in this area, because I enjoy search technologies and I had some previous experience with them. This helped me a lot in development of the CiteCrawl application. While I already was familiar with searching in textual databases before, I’ve had to learn in de- tail about a few specific search engines, how they work or what are their limitations. This was valuable for getting a more complete picture of differ- ent search needs in different applications. Working with bibliographic data was completely new to me, though. I’ve familiarized myself with the basics and then expanded my knowledge to areas related to digital libraries and bibliographic databases. This cov- ers the most commonly used exchange file formats, linking standards. I’ve had to learn about the organization of bibliographic databases, their usual structure and how they interact with external applications. The most important part of this thesis is the CiteCrawl application. I be- lieve I’ve managed to improve the coverage of citation links from DML-CZ to both Zentralblatt MATH and Mathematical Reviews. In the case of Zen- tralblatt MATH is the coverage almost doubled, which I consider a success. In addition to finding more results, it avoids false matches by post-filtering results from external services, which reduces the number of incorrect links. These improvements come at a price though. The price in this case is in- creased average time to look up a reference. The extensive caching helps, but it could still become a problem for usefulness of the web service. Fortu- natelly, if it turns out to be a problem, it is always possible to go back to the original batch processing approach, which CiteCrawl also supports.

26 Bibliography

[1] About Zentralblatt MATH. http://www.zentralblatt-math. org/zmath/en/about/. Online, accessed September 15, 2010.

[2] crossref.org :: affiliate benefits. http://www.crossref.org/ 04intermediaries/index.html. Online, accessed November 3, 2009.

[3] crossref.org :: library benefits. http://www.crossref.org/ 03libraries/index.html. Online, accessed November 3, 2009.

[4] The digital object identifier system. http://www.doi.org/. Online, accessed January 1, 2010.

[5] DML-CZ - Frequently Asked Questions (FAQ). http://dml.cz/ FAQ. Online, accessed November 23, 2009.

[6] EndNote - About Thompson Reuters EndNote. http://www. endnote.com/enabout.asp. Online, accessed October 2, 2009.

[7] MathSciNet - Mathematical Reviews on the Web. http://www.ams. org/mathscinet/help/about.html.

[8] Search in Zentralblatt MATH Database. http://www. zentralblatt-math.org/zmath/en/help/search/. Online, accessed September 16, 2009.

[9] The amsrefs package. http://www.ams.org/tex/amsrefs. html. Online, accessed October 1, 2009.

[10] Zentralblatt MATH - ZMATH Online Database. http://www. zentralblatt-math.org/zmath/en/. Online, accessed Septem- ber 15, 2009.

[11] W.Y. Arms. Digital libraries. The MIT Press, 2001.

[12] A. Feder. Bibtex. http://www.bibtex.org/. Online, accessed September 10, 2009.

27 [13] A. Feder. Bibtex format description. http://www.bibtex.org/ Format/. Online, accessed September 10, 2009.

[14] S.R. Kruk and B. McDaniel. Semantic Digital Libraries. Springer Pub- lishing Company, Incorporated, 2009.

[15] V.I. Levenshteiti. Binary codes capable of correcting deletions, inser- tions, and reversals. In Soviet Physics-Doklady, volume 10, 1966.

[16] N. Paskin. The DOI°R Handbook. International DOI Foundation, Inc., 2006.

[17] B. Quint. Changes at Google Scholar: A Conversation With Anurag Acharya. http://newsbreaks.infotoday.com/nbReader. asp?ArticleId=37309. Online, accessed December 29, 2009.

[18] J.W. Ratcliff and D. Metzener. Pattern matching: The gestalt approach. Dr. Dobb’s Journal, 7:46, 1988.

[19] C.J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, 1979.

[20] J. Rákosník. Ceskᡠdigitální matematická knihovna. Technical report, Matematický ústav AV CR,ˇ Praha, 2009.

[21] D. Stern. Digital libraries: philosophies, technical design considera- tions, and example scenarios. Routledge, 1999.

28 Appendix A Contents of the Attached CD

• Source code of the CiteCrawl application.

• Electronic version of this thesis (PDF document and LATEX source files).

29