Citation Crawling

MASARYKOVA UNIVERZITA F}w AKULTA INFORMATIKY !"#$%&'()+,-./012345<yA| Citation Crawling BACHELOR THESIS Lukáš Lalinský Brno, 2009 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Advisor: doc. RNDr. Petr Sojka, Ph.D. iii Acknowledgement I’d like to thank my advisor, doc. RNDr. Petr Sojka, Ph.D., for his continued support and valuable comments on the text of this thesis. I’d also like to thank Mgr. Martin Šárfy for his help with my technical questions regarding the DML-CZ infrastructure. iv Abstract The aim of this bachelor thesis is to design and implement a new citation look-up service for the DML-CZ project. This requires getting familiar with existing standards in the area of working with bibliographic data, as well as technical capabilities of existing bibliographic databases and search engines. Based on this knowledge we can design an application for reference look-up that yields better results than the currently available solution used in DML-CZ. The main part of the thesis focuses on the description of this application. v Keywords citation, reference, bibliography, search, article, journal, digital library, cross-linking, DML-CZ vi Contents 1 Cross-Linking Standards ....................... 4 1.1 Digital Object Identifier ..................... 4 1.2 OpenURL ............................. 5 2 Bibliographic File Formats ...................... 6 2.1 BibTeX ............................... 6 2.2 AMSRefs .............................. 7 2.3 EndNote .............................. 7 2.4 ZMATH XML ........................... 8 2.5 ZMATH ASCII ........................... 9 3 Academic Databases and Search Engines . 11 3.1 Zentralblatt MATH ........................ 11 3.2 Mathematical Reviews and MathSciNet . 12 3.2.1 MRef . 12 3.2.2 MathSciNet Search . 13 3.3 CrossRef .............................. 13 3.4 Scopus ............................... 14 3.5 Google Scholar .......................... 15 4 CiteCrawl Implementation ...................... 16 4.1 Overview .............................. 16 4.2 Input Normalization ....................... 17 4.3 Searching .............................. 18 4.3.1 Similarity Comparison . 19 4.3.2 Zentralblatt MATH . 20 4.3.3 MathSciNet . 21 4.4 XML Web Service ......................... 22 4.5 Caching ............................... 23 4.6 Possible Further Improvements . 24 5 Conclusion ................................ 26 A Contents of the Attached CD ..................... 29 1 Introduction New research is always based on previous knowledge. It’s the natural way to more forward, to discover new ideas, to solve more complex problems. This is especially true for academic research. Knowledge is documented in form of papers, journal articles or other type of publications and the publications typically refer to other publications, which served as the source of knowledge during the research. This allows the reader to get all the previous knowledge that the author had. This become even more important with the arrival of digital libraries. Instead of searching in classical libraries which have limited resources, the referenced publications can be just a few clicks away. First attempts to build digital libraries happened in late 1960s [11]. They faced various technical issues, such as high cost of computers or the lack of computer networks. Technical development has moved further in the next thirty years and since early 1990s, there were no fundamental barri- ers to building digital libraries. Electronic storage has become cheaper than paper, high-speed computers networks were available, personal computer displays become more pleasant to use. Digital libraries can offer the user various advantages over traditional libraries, but a particularly important one is linking to other digital libraries or other digital resources. The Czech Digital Mathematics Library (DML-CZ) project aims to digi- tize, preserve and make easily accessible a large part of mathematical literature published in the Czech lands since the 19th century [5]. As of Novem- ber 2009, it provides 275 000 pages of scientific text by almost 10 000 authors [20]. Many articles and references are linked to two major mathematical publication databases, Zentralblatt MATH and MathSciNet. This provides a valuable information to the reader, as they can read abstracts, reviews and in some cases full texts of the referenced publications, without having to look them up via external services. The main goal of this thesis is to build a new service for looking up references from DML-CZ in external databases and explore new ways of cross-linking publications in digital libraries. DML-CZ already has a tool 2 for searching in Zentralblatt MATH and MathSciNet, but this tool is rather simplistic and the ratio of matched publications can be improved. Everything in this thesis is related to the DML-CZ project in a way, but there are two kinds of text. The first three chapters are more theoretical and describe existing standards, file formats or projects that are useful for im- plementing the look up service or to the DML-CZ project in general. In the forth chapter I describe the implementation of the project that was created along with this text. It explains decisions that were made, how does it work and describes problems that came up during the development. Source code of the project can be found on the attached CD. 3 Chapter 1 Cross-Linking Standards The need for cross-linking between digital libraries and other resources is not new. According to some definitions, it’s even one of the primary goals for building digital libraries. [21] This chapter describes a few established standards that aim at solving parts of this problem. 1.1 Digital Object Identifier Digital Object Identifier1 (DOI) is an unique identifier, that publishers as- sign to articles or book chapters to provide a permanent link to the publication. [4] The main idea is that the DOI link stays stable, even if the publisher’s web site moves or changes. The identifiers are maintained multiple registration agencies. The DOI registration agencies specialize on a specific area of publications. A DOI registration agency contains a database of DOIs with their corresponding metadata and URLs to publisher web pages with additional information, such as full bibliographic citation and abstract and full-text access to the publication. DOI names follow a syntax standardized as ANSI/NISO Z39.84-2000. [16] It consists of two components, refered to as the preffix and the suffix. The two main components are separated by a forward slash. The prefix represents information about the registration agency. It itself contains of two components separated by a period: Directory Code (currently always 10) and Registrant’s Code, which is assigned to DOI registration agencies. The next component is an unique identifier assigned by the registration agency. There are no requirements for the format of the suffix. For example, in the DOI 10.1000/182, 10.1000 is the prefix and 182 is the suffix. A centralized service at http://dx.doi.org/ is used to resolve any DOI to it’s original URL. The DOI mentioned in the example before refers to the PDF version of The DOI°R Handbook. In order to link to the PDF, we can use the following URL http://dx.doi.org/10.1000/182. 1. http://www.doi.org/ 4 1. CROSS-LINKING STANDARDS The DOI system is still new, but it’s already used by a large number of publishers and it’s gaining popularity. I expect that at least for new publications, it will become the most important linking standard, if it isn’t already. 1.2 OpenURL OpenURL is a standard way of linking to a web site, primarily designed for libraries. It allows automatic construction of URLs that can provide more information about a publication. It can be seen as a standardized cross-site search interface. The URL can contain basic metadata fields, such as the title or the author name, but also very specific fields, like ISBN, which directly identify the publication. It is mainly used by libraries, so most documen- tation are specific to books or other publications, but OpenURL in general can be used for anything. Every OpenURL-enable web site has a special URL, called "endpoint", which accepts OpenURL requests. This is used as the base for the con- structed link. The request itself is defined by HTTP GET parameters, which differ based on the version of OpenURL used. There are two major versions, which are not compatible with each other. 5 Chapter 2 Bibliographic File Formats Most of the file formats mentioned in this chapter were designed to serve the purpose of a small local citation database that is used writing. With the arrival of bibliographic servers and search engines, it become important to have a way to easily import data from these services and the formats started to be used also for exchanging data. This chapter doesn’t attempt to list all existing bibliographic formats, only a few that are commonly used as data exchange formats. 2.1 BibTeX BIBTEX is a software package and a file format for managing bibliographies, usually used in combination with the LATEX document preparation system [12]. It was designed by Oren Patashnik and Leslie Lamport in 1985 [14]. The BIBTEX file format is one of the most commonly formats used for bibliographic information exchange. All bibliographic databases mentioned in the previous chapter support export to BIBTEX. A file in the BIBTEX format consists of multiple entries. An entry represents a publication and consists of the type of publication, a citation key which used in LATEX documents for referring to this particular entry and a list of tags, such as author or title, describing the publication. BIBTEX defines a standard set of tags, but unknown tags are ignored, which make it easy to extend the format with additional information [13].

Load more