Maths Information Retrieval for Digital Libraries
Total Page:16
File Type:pdf, Size:1020Kb
}w !"#$%&'()+,-./012345<yA| F I M U Maths Information Retrieval for Digital Libraries Michal Růžička Ph.D. Thesis Proposal Brno, 15th January 2013 Advisor: doc. RNDr. Petr Sojka, Ph.D. Advisor’s signature C Contents Contents 1 1 Introduction 3 1.1 Maths Digital Libraries .............................. 3 1.2 Maths Information Retrieval ........................... 4 1.3 Outline of the Thesis Proposal .......................... 4 2 State of the Art 5 2.1 Languages for Description of Mathematical Notations ............. 5 2.1.1 TEX/LATEX .................................. 5 2.1.2 MathML .................................. 6 2.1.3 OpenMath ................................. 7 2.1.4 OMDoc ................................... 7 2.2 Sources of MathML in Digital Libraries ...................... 8 2.2.1 Tralics ................................... 8 2.2.2 LATEXML ................................... 8 2.2.3 InftyReader and MaxTract ........................ 8 2.2.4 Other Sources ............................... 9 2.3 Digital Mathematical Libraries and Search Engines ............... 9 2.3.1 NIST DLMF ................................. 9 2.3.2 MathDex .................................. 10 2.3.3 EgoMath .................................. 11 2.3.4 LATEX Search / LATEX SpeedSearch ..................... 12 2.3.5 ActiveMath ................................. 12 2.3.6 MathWebSearch .............................. 13 2.3.7 EuDML ................................... 13 2.4 MIaS and WebMIaS ................................ 14 2.4.1 The MIR Happening and the NTCIR Math Task Competition ...... 14 3 Thesis Goals 15 3.1 Objectives and Expected Results ......................... 15 3.1.1 MathML Normalization .......................... 16 3.1.2 Classiication of Identiiers ........................ 17 3.1.3 Context Driven Search .......................... 18 3.1.4 Involvement of Computer Algebra Systems ............... 19 3.1.5 Image Search Experiment ........................ 19 3.2 Expected Outputs ................................. 20 3.3 Schedule ...................................... 21 1 C 4 Results of Study 22 4.1 On Topic of the Ph.D. Thesis ............................ 22 4.2 Outcomes ...................................... 22 4.3 Academic Work .................................. 23 5 Summary/Souhrn 24 5.1 Summary ...................................... 24 5.2 Souhrn ....................................... 25 References 26 A Examples of Languages for Description of Mathematical Notations 33 A.1 TEX ......................................... 33 A.2 Presentation MathML ............................... 33 A.3 Content MathML .................................. 33 A.4 OpenMath ..................................... 34 B Examples of MathML Code from Different Sources 35 B.1 “Hand Made” Formula ............................... 35 B.2 Tralics ........................................ 35 B.3 LATEXML ....................................... 35 B.4 InftyReader ..................................... 36 B.5 MaxTract ...................................... 37 B.6 MATLAB ...................................... 37 B.7 Wolfram Alpha ................................... 38 C Publications 39 Automated Processing of TeX-Typeset Articles for a Digital Library ......... 40 Data Enhancements in a Digital Mathematical Library ................ 50 Metadata Editing and Validation for a Digital Mathematics Library ......... 58 Building Corpora of Technical Texts : Approaches and Tools ............. 64 Interface and Collection for Mathematical Retrieval : WebMIaS and MREC ..... 75 Redakčnı́ systém odborného časopisu s podporou exportu do digitálnı́ knihovny . 83 Normalization of Digital Mathematics Library Content ................ 100 2 1 I 1 Introduction 1.1 Maths Digital Libraries For quite some long time, libraries of scientiic literature have been an integral part of aca- demic and education institutions. As with most areas of our professional lives, libraries have undergone a great deal of change with the advent of computers and later with the common availability of internet connections. Physical card catalogs have been transformed into di- gital databases and the documents themselves are neatly placed into all sorts of categories, enriched with various metadata and available in digital form in unlimited number of copies to anyone anywhere with ready access to the Internet. Among other specialized digital libraries mathematics digital libraries have emerged and are now on the rise: the Czech Republic has its project of the Czech Digital Mathematics Lib- rary (DML-CZ)¹. The development of the library began in 2005 and was inished in 2009. The library consists of the relevant mathematical literature which has been published dur- ing the history of the Czech lands. [DML13] There is a number of projects of digital mathematics libraries around the world, suppor- ted by academic institutions, business companies or national governments. For example, in France there is Centre de diffusion de revues académiques mathématiques (CEDRAM)² and Numérisation de documents anciens mathématiques (NUMDAM)³, and in Germany Göttin- gen Göttinger Digitalisierungszentrum (GDZ)⁴, Electronic Research Archive for Mathematics (ERAM)⁵ and The Electronic Library of Mathematics (ELibM)⁶. As well, there is the Journal STORage (JSTOR)⁷, Project Euclid⁸, Russian Digital Mathematics Library (RusDML)⁹, Polish Digital Mathematical Library (DML-PL)¹⁰ in Poland, Biblioteca Digital Española de Matemá- ticas (DML-E)¹¹ in Spain, Japanese Digital Mathematics Library (DML-JP)¹² in Japan, Riviste Elettroniche Italiane di Matematica (REIM)¹³ and Biblioteca Digitale Italiana di Matematica (bdim)¹⁴ in Italy. The vision of the World Digital Mathematics Library is to see the separate projects gradu- ally join together. For example, the project of the European Digital Mathematics Library (EuDML)¹⁵ began in Europe in 2010. Its aim is to provide a single point of access to European mathematical literature. It is scheduled for completion on 31 January 2013. ¹ http://www.dml.cz/ ² http://www.numdam.org/ ³ http://www.cedram.org/ ⁴ http://http://gdz.sub.uni-goettingen.de/ ⁵ http://www.emis.de/projects/JFM/ ⁶ http://siba-sinmemis.unile.it/ELibM.html ⁷ http://www.jstor.org/ ⁸ http://projecteuclid.org/ ⁹ http://www.rusdml.de/ ¹⁰ http://pldml.icm.edu.pl/ ¹¹ http://dmle.cindoc.csic.es/ ¹² http://sparc1.math.sci.hokudai.ac.jp/dmljp/ ¹³ http://siba2.unile.it/sinm/reim/ ¹⁴ http://www.bdim.eu/ ¹⁵ http://eudml.org/ 3 1.2 M I R 1.2 Maths Information Retrieval The increasing amount of data stored in digital libraries is making it increasingly dificult for the reader to ind relevant contents. Users are accustomed to looking for answers to their questions through search engines. On the current Internet, a very simple one ield search interface is de facto standard. It is especially thanks to famous search services like Google. However, such a simple search based on text keywords is not appropriate or suficient for mathematical contents. Mathematical expressions with the same meaning can be written in many ways by the author and consequently encoded in many ways in the computer system. Moreover, the authors of mathematical papers are usually preparing their documents for print. As such, tools are routinely used by the authors to encode the appearance of the for- mulae and not their meaning. Even though there are methods of doing so, common authors derive no direct additional value from semantically annotating their papers, and therefore, they do not. There is no reason to believe that it will change in the future, which makes it our responsibility to process the real documents. Thus, it is not easy to design and implement a mathematical aware search engine and integrate it into a digital mathematics library. [MY03, p. 10] summarises maths search issues as follows: 1. Recognition and proper processing of mathematical symbols in mathematical content and queries. 2. Capturing and indexing mathematical structures. 3. Providing a math-appropriate query language and user interface that enable users to express their information needs, which often involve math symbols and structures. 4. Developing and integrating techniques for taking into account mathematical synonyms and equivalences – at least some of the more common ones such as commutativity and associativity based equivalences. In the rest of this thesis proposal I introduce the aim of my Ph.D. studies – research on and implementation of methods of maths information retrieval in the context of digital mathematics libraries. More speciically, I want to maximize usability and usefulness of the full text search engine in digital mathematics libraries. 1.3 Outline of the Thesis Proposal The structure of this work is as follows: Section 2 on the following page summarises existing literature and related systems, and the next section discusses the objectives of my thesis. The results are introduced in Section 4 on page 22 and the concluding remarks are presented in Section 5 on page 24. Supplementary materials are attached in Appendices from page 33 onwards. 4 2 S A 2 State of the Art Maths information retrieval in digital libraries is closely related to the technologies that are used in digital libraries as well as to the habits of authors of mathematical documents. This section gives a brief overview of the tools that are used by authors of mathematical doc- uments and technologies used for encoding mathematical content in digital mathematical libraries. As these formats are not the same, selected conversion tools