}w  !"#$%&'()+,-./012345

Maths Information Retrieval for Digital Libraries

Michal Růžička

Ph.D. Thesis Proposal Brno, 15th January 2013

Advisor: doc. RNDr. Petr Sojka, Ph.D. Advisor’s signature C

Contents

Contents 1

1 Introduction 3 1.1 Maths Digital Libraries ...... 3 1.2 Maths Information Retrieval ...... 4 1.3 Outline of the Thesis Proposal ...... 4

2 State of the Art 5 2.1 Languages for Description of Mathematical Notations ...... 5 2.1.1 TEX/LATEX ...... 5 2.1.2 MathML ...... 6 2.1.3 OpenMath ...... 7 2.1.4 OMDoc ...... 7 2.2 Sources of MathML in Digital Libraries ...... 8 2.2.1 Tralics ...... 8 2.2.2 LATEXML ...... 8 2.2.3 InftyReader and MaxTract ...... 8 2.2.4 Other Sources ...... 9 2.3 Digital Mathematical Libraries and Search Engines ...... 9 2.3.1 NIST DLMF ...... 9 2.3.2 MathDex ...... 10 2.3.3 EgoMath ...... 11 2.3.4 LATEX Search / LATEX SpeedSearch ...... 12 2.3.5 ActiveMath ...... 12 2.3.6 MathWebSearch ...... 13 2.3.7 EuDML ...... 13 2.4 MIaS and WebMIaS ...... 14 2.4.1 The MIR Happening and the NTCIR Math Task Competition ...... 14

3 Thesis Goals 15 3.1 Objectives and Expected Results ...... 15 3.1.1 MathML Normalization ...... 16 3.1.2 Classiication of Identiiers ...... 17 3.1.3 Context Driven Search ...... 18 3.1.4 Involvement of Computer Algebra Systems ...... 19 3.1.5 Image Search Experiment ...... 19 3.2 Expected Outputs ...... 20 3.3 Schedule ...... 21

1 C

4 Results of Study 22 4.1 On Topic of the Ph.D. Thesis ...... 22 4.2 Outcomes ...... 22 4.3 Academic Work ...... 23

5 Summary/Souhrn 24 5.1 Summary ...... 24 5.2 Souhrn ...... 25

References 26

A Examples of Languages for Description of Mathematical Notations 33 A.1 TEX ...... 33 A.2 Presentation MathML ...... 33 A.3 Content MathML ...... 33 A.4 OpenMath ...... 34

B Examples of MathML Code from Different Sources 35 B.1 “Hand Made” Formula ...... 35 B.2 Tralics ...... 35 B.3 LATEXML ...... 35 B.4 InftyReader ...... 36 B.5 MaxTract ...... 37 B.6 MATLAB ...... 37 B.7 Wolfram Alpha ...... 38

C Publications 39 Automated Processing of TeX-Typeset Articles for a Digital Library ...... 40 Data Enhancements in a Digital Mathematical Library ...... 50 Metadata Editing and Validation for a Digital Mathematics Library ...... 58 Building Corpora of Technical Texts : Approaches and Tools ...... 64 Interface and Collection for Mathematical Retrieval : WebMIaS and MREC ..... 75 Redakčnı́ systém odborného časopisu s podporou exportu do digitálnı́ knihovny . 83 Normalization of Digital Mathematics Library Content ...... 100

2 1 I

1 Introduction

1.1 Maths Digital Libraries For quite some long time, libraries of scientiic literature have been an integral part of aca- demic and education institutions. As with most areas of our professional lives, libraries have undergone a great deal of change with the advent of computers and later with the common availability of internet connections. Physical card catalogs have been transformed into di- gital databases and the documents themselves are neatly placed into all sorts of categories, enriched with various metadata and available in digital form in unlimited number of copies to anyone anywhere with ready access to the Internet. Among other specialized digital libraries mathematics digital libraries have emerged and are now on the rise: the Czech Republic has its project of the Czech Digital Mathematics Lib- rary (DML-CZ)¹. The development of the library began in 2005 and was inished in 2009. The library consists of the relevant mathematical literature which has been published dur- ing the history of the Czech lands. [DML13] There is a number of projects of digital mathematics libraries around the world, suppor- ted by academic institutions, business companies or national governments. For example, in France there is Centre de diffusion de revues académiques mathématiques (CEDRAM)² and Numérisation de documents anciens mathématiques (NUMDAM)³, and in Germany Göttin- gen Göttinger Digitalisierungszentrum (GDZ)⁴, Electronic Research Archive for Mathematics (ERAM)⁵ and The Electronic Library of Mathematics (ELibM)⁶. As well, there is the Journal STORage (JSTOR)⁷, Project Euclid⁸, Russian Digital Mathematics Library (RusDML)⁹, Polish Digital Mathematical Library (DML-PL)¹⁰ in Poland, Biblioteca Digital Española de Matemá- ticas (DML-E)¹¹ in Spain, Japanese Digital Mathematics Library (DML-JP)¹² in Japan, Riviste Elettroniche Italiane di Matematica (REIM)¹³ and Biblioteca Digitale Italiana di Matematica (bdim)¹⁴ in Italy. The vision of the World Digital Mathematics Library is to see the separate projects gradu- ally join together. For example, the project of the European Digital Mathematics Library (EuDML)¹⁵ began in Europe in 2010. Its aim is to provide a single point of access to European mathematical literature. It is scheduled for completion on 31 January 2013.

¹ http://www.dml.cz/ ² http://www.numdam.org/ ³ http://www.cedram.org/ ⁴ http://http://gdz.sub.uni-goettingen.de/ ⁵ http://www.emis.de/projects/JFM/ ⁶ http://siba-sinmemis.unile.it/ELibM.html ⁷ http://www.jstor.org/ ⁸ http://projecteuclid.org/ ⁹ http://www.rusdml.de/ ¹⁰ http://pldml.icm.edu.pl/ ¹¹ http://dmle.cindoc.csic.es/ ¹² http://sparc1.math.sci.hokudai.ac.jp/dmljp/ ¹³ http://siba2.unile.it/sinm/reim/ ¹⁴ http://www.bdim.eu/ ¹⁵ http://eudml.org/

3 1.2 M I R

1.2 Maths Information Retrieval The increasing amount of data stored in digital libraries is making it increasingly dificult for the reader to ind relevant contents. Users are accustomed to looking for answers to their questions through search engines. On the current Internet, a very simple one ield search interface is de facto standard. It is especially thanks to famous search services like Google. However, such a simple search based on text keywords is not appropriate or suficient for mathematical contents. Mathematical expressions with the same meaning can be written in many ways by the author and consequently encoded in many ways in the computer system. Moreover, the authors of mathematical papers are usually preparing their documents for print. As such, tools are routinely used by the authors to encode the appearance of the for- mulae and not their meaning. Even though there are methods of doing so, common authors derive no direct additional value from semantically annotating their papers, and therefore, they do not. There is no reason to believe that it will change in the future, which makes it our responsibility to process the real documents. Thus, it is not easy to design and implement a mathematical aware search engine and integrate it into a digital mathematics library. [MY03, p. 10] summarises maths search issues as follows: 1. Recognition and proper processing of mathematical symbols in mathematical content and queries.

2. Capturing and indexing mathematical structures.

3. Providing a math-appropriate query language and user interface that enable users to express their information needs, which often involve math symbols and structures.

4. Developing and integrating techniques for taking into account mathematical synonyms and equivalences – at least some of the more common ones such as commutativity and associativity based equivalences. In the rest of this thesis proposal I introduce the aim of my Ph.D. studies – research on and implementation of methods of maths information retrieval in the of digital mathematics libraries. More speciically, I want to maximize usability and usefulness of the full text search engine in digital mathematics libraries.

1.3 Outline of the Thesis Proposal The structure of this work is as follows: Section 2 on the following page summarises existing literature and related systems, and the next section discusses the objectives of my thesis. The results are introduced in Section 4 on page 22 and the concluding remarks are presented in Section 5 on page 24. Supplementary materials are attached in Appendices from page 33 onwards.

4 2 S A

2 State of the Art

Maths information retrieval in digital libraries is closely related to the technologies that are used in digital libraries as well as to the habits of authors of mathematical documents. This section gives a brief overview of the tools that are used by authors of mathematical doc- uments and technologies used for encoding mathematical content in digital mathematical libraries. As these formats are not the same, selected conversion tools in use are described. The rest of the section is dedicated to the description of existing projects of digital mathe- matical libraries and maths search engines.

2.1 Languages for Description of Mathematical Notations Languages for encoding mathematical notations used by the authors of scientiic documents are not usually the same as the formats for encoding mathematical notations used on the Web or in digital libraries. Digital libraries are built to store large collection of electronic documents and provide eficient automated read only access and navigation through the collection for end users. In contrast, authors of scientiic papers are usually preparing their document for print, and use tools optimised for the creation of mathematical content by hand.

2.1.1 TEX/LATEX

A very popular tool for typesetting scientiic, especially mathematical, documents is TEX or some of its incarnations such as LATEX. [Knu86; Lam86] Thus, most of the contents of digital mathematical libraries was originally typeset using that tool and the only “source” version of maths of stored documents available is encoded using this language. The other advantages of LATEX include its programmability, speed, license, compatibility across different platforms, wide support among users and publishers. And one of the key reasons for its popularity is that it is a human readable and writable plain text notation for mathematical statements (see Appendix A.1 on page 33). The notation uses a very limited set of characters: an old-fashioned plain text editor using ASCII encoding is suficient. The notation can be quickly written by a human, yet still easily readable and powerful enough to code even the most complex mathematical statements. TEX mathematical syntax is so widespread and popular, that is often used as the basis for custom mathematical codings in other systems, e.g. MediaWiki system¹⁶ used by Wikipedia¹⁷ or ASCIIMath¹⁸ markup laguage [Jip06]. The output of the TEX system is famous for its quality of mathematical typesetting. How- ever, even the TEX maths coding was designed only for description of mathematical state-

¹⁶ http://www.mediawiki.org/ ¹⁷ https://en.wikipedia.org/wiki/Help:Displaying_a_formula ¹⁸ http://www1.chapman.edu/~jipsen/mathml/asciimath.html

5 2.1 L D M N ments for print output. As such, the TEX encoded mathematical formulae do not provide any information as to their meaning. The coding only says how to typeset all elements on the page. Although there were attempts to extend the LATEX syntax for semantics (see Sec- tion 2.3.1 on page 10) they are not in common use.

2.1.2 MathML

On the web and as a universal format for data interchange, XML markup is very popular and widely used. This XML markup for encoding of mathematical notation is MathML (currently in version 3.0) [Aus+10]. MathML consist of Presentational markup for encoding the “appearance” of mathemat- ical formulae and Content markup for encoding semantics [Aus+10]:

[…] MathML is an XML application for describing mathematical notation and capturing both its structure and content. The goal of MathML is to enable math- ematics to be served, received, and processed on the World Wide Web, just as HTML has enabled this functionality for text. […] MathML can be used to encode both mathematical notation and mathemati- cal content. About thirty-eight of the MathML tags describe abstract notational structures, while another about one hundred and seventy provide a way of un- ambiguously specifying the intended meaning of an expression. […] While MathML is human-readable, authors will typically use equation editors, conversion programs, and other specialized software tools to generate MathML. Several versions of such MathML tools exist, both freely available software and commercial products, and more are under development. […]

Examples of Presentation and Content MathML encoding of simple formula can be seen in Appendices A.2 on page 33 and A.3 on page 33. The majority of current web browsers have build-in support for MathML rendering, for others plugins or tools such as MathJax¹⁹ [Cer12] can be used. Thus, MathML is often used as an internal format for the encoding of maths in digital libraries as that encoded formulae can directly be used as commonly understood machine processable exchange format and served to the end user on the web. Moreover, Presentation or even Content MathML can be generated from the original TEX encoding of mathematical formulae with reasonable effort (see Section 2.2 on page 8).

¹⁹ http://www.mathjax.org/

6 2.1 L D M N

2.1.3 OpenMath

OpenMath²⁰ is another important standard [Bus+04] for encoding mathematical statements. Content MathML can be converted to OpenMath [Car+01]. In contrast to MathML, OpenMath is primarily targeted at semantic descriptions of math- ematical statements. The meaning of symbols in OpenMath is deined using so called Con- tent Dictionaries. OpenMath encoding is independent of abstract description of OpenMath objects that can be encoded in two forms [Bus+04, p. iii]:

The irst, in XML, is designed primarily for use on the web, in documents, and for applications which want to mix OpenMath as a content representation with MathML as a presentation format. The second, a binary format, is designed for applications which wish to exchange very large objects, or a lot of data as eficiently as possible.

Example of OpenMath encoding of simple formula is showen in Appendix A.4 on page 34. In comparison with MathML OpenMath is much less widespread in the wild. As far as I know there is no widespread end user tool for the preparation of publications with Open- Math encoding of mathematical contents or mathematical software using OpenMath format for the direct interchange of mathematical data.

2.1.4 OMDoc

In this section OMDoc²¹ should be mentioned even though its purpose is much wider than MathML or OpenMath mentioned above. OMDoc is an open markup language for mathematical documents and models mathe- matical expressions on three levels [Koh06, pp. 28–29]:

• theory level – Development graph of theories.

• statement level – Axioms, deinitions, theorems, proofs, examples and so on.

• object level – Objects as logical formulae in Content MathML / OpenMath.

Thus, OMDoc describes mathematical knowledge on higher level than MathML or Open- Math. It encodes not only individual formulae but also mathematical relationships between them. Similarly to the OpenMath the OMDoc format is not widely used among authors of sci- entiic documents and thus it is not possible to immediately beneit from its advantages.

²⁰ http://openmath.activemath.org/ ²¹ http://www.omdoc.org/

7 2.2 S MML D L

2.2 Sources of MathML in Digital Libraries Although MathML is often used by digital mathematical libraries and mathematical search engines as a common standard for data interchange, the code itself can be produced by a variety of different tools. Unfortunately, due to diversity of implementations, differences of input formats and the variability of coding own to the Presentation MathML, the MathML code produced by various tools differs even for trivial formulae as can be seen in Appendix B on page 35. These differences consequently make it harder to search for equal and similar expressions.

2.2.1 Tralics

If the basic source of mathematical expressions is TEX or some of its variants one of the pos- sible tools for conversion from TEX to MathML is Tralics²² – a LATEX to XML translator [Gri10]. This tool provides only Presentation MathML on its output (see Appendix B.2 on page 35). However, one of its great strengths is the reimplementation of full TEX macro interpreter, that makes it easier for use in digital libraries, where papers are provided by authors ex- tensively using their own complex macros. [Bou08] Good experience with the Tralics tool in the CEDRAM project [Bou08] has also inlu- enced other projects. Currently Tralics is also used in the DML-CZ²³ [Růž08] and EuDML²⁴ [Soj+12] projects.

2.2.2 LATEXML

An alternative approach to Tralics is used by the LATEXML²⁵ tool. LATEXML is currently under development and extensively used in the arXMLiv project²⁶ [Sta+10]. In contrast to Tralics, LATEXML is a re-implementation of a TEX engine for large fragment of the TEX language in Perl and the compilation of TEX inputs is done using the semantics of the original macros. The semantics of the original commands are consequently used to pro- duce not only Presentation MathML but also Content MathML (see Appendix B.3 on page 35) on the output. It is necessary, however, for LATEXML-speciic implementation of bindings to various LATEX style iles to enhance coverage of the translation process. [Sta+10]

2.2.3 InftyReader and MaxTract

To cover both new and classic publications current digital mathematical libraries have to cope not only with new digitally typeset papers but also with archival digital documents and even with old publications available only in the paper where they were subsequently

²² http://www-sop.inria.fr/marelle/tralics/ ²³ http://www.dml.cz/ ²⁴ http://eudml.org/ ²⁵ http://dlmf.nist.gov/LaTeXML/ ²⁶ http://arxmliv.kwarc.info/

8 2.3 D M L S E digitized. In fact, digitized archival documents are the majority of the contents of current digital mathematical libraries, not the new ones. To allow a full text search of digitized documents, they are processed using optical char- acter recognition (OCR) software. However, the majority of OCR tools is not able to recog- nize and correctly convert mathematical expressions that are a vital part of mathematical documents. The only OCR tool I am aware of that is able to recognize mathematical formulae and save them in an appropriate format is InftyReader²⁷ [Suz+03]. (See Appendix B.4 on page 36.) For this reason, InftyReader is incorporated into the EuDML worklow. [Soj+12] “Retro-born-digital” documents stands between the oldest documents available on pa- per only, and the newest ones prepared with the needs of digital library in mind. The retro- born-digital documents were typeset using computer – using TEX in the case of mathematical papers usually. However, its digital library does not have access to the source codes but to the resulting PDF (usually prepared for printing) only. The OCR approach is possible in this case. But with digitally born documents, more can be done. The MaxTract²⁸ tool [BSS12] was developed and deployed in the EuDML [Soj+12] to transform retro-born-digital full texts into formats suitable for further processing in the digital library. Among others, Presentation MathML is generated from found formulae (see Appendix B.5 on page 37).

2.2.4 Other Sources

TEX is certainly not the only source of MathML in the world. Especially when users are for- mulating their queries they can use a variety of other tools. One can prepare MathML by hand (see Appendix B.1 on page 35), use MathML encoded results of their previous compu- tations in mathematical software such as MATLAB²⁹ (see Appendix B.6 on page 37) or use results found using Wolfram Alpha³⁰ (see Appendix B.7 on page 38). We have to keep all these and even other possibilities in mind when preparing a maths search system.

2.3 Digital Mathematical Libraries and Search Engines

2.3.1 NIST DLMF

In 2002, [MY03] reported on the existence of NIST Digital Library of Mathematical Functions (DLMF) [NIS12] with the ability to search through its contents not only for text but also for mathematical expressions. The main purpose of DLMF is to concentrate standardized deinitions of important math- ematical formulas, graphs and tables in a single place for referencing, and to make it avail-

²⁷ http://www.inftyreader.org/ ²⁸ http://cs.bham.ac.uk/research/groupings/reasoning/sdag/maxtract.php ²⁹ http://www.mathworks.com/products/matlab/ ³⁰ https://www.wolframalpha.com/

9 2.3 D M L S E able in printed and electronic format. Providing both printed and electronic versions, as well as new requirements on search capabilities placed special requirements on input data and on the search engine itself. Due to the mathematical nature of DMLF it was decided a traditional full text search is not suficient and the search engine implementation needs to understand the structure and meaning of mathematical formulae. To support provision of semantically structured mathematical formulae directly from their authors – surely the best source of that information – authors implemented a special LATEX document class that provides macros for modestly semantic annotations of typeset formulae. [MY03] For the search, a special query language was created. The language based on mathe- matical patterns was designed to be lexibe. It was explicitly speciied that the TEX syntax should not be the only coding possible for mathematical queries. The matching process of the search engine has to be aware of the mathematical properties of mathematical formulae such as commutativity, associativity etc. It was also stated [MY03, p. 2] that a deep under- standing should be able to be used in generated “virtual documents”, i.e. generate a list of mathematical object with certain parameters.

2.3.2 MathDex

One of the irst mathematically aware full text search engines was the MathDex³¹ system. The home page of this project is no longer accessible and the project itself seems to be in- active, but theoretical principles and techniques it used were based on [MY03] and summa- rized in [MM07]. A useful summary of the key elements of the MathDex engine provides [WME09]:

MathDex is the oldest mathematical aware full text search engine based on Apa- che Lucene search engine. It went public at the beginning of 2007. The key features are: 1) support for semantically poor documents, 2) accepting differ- ent types of mathematical encoding, 3) allows searching on both mathematical notation and text, 4) attempts to match user text search expectations rather than strictly following the query.

[MM07] highlights the problems with data that is available on the current web. Mathe- matics in the documents is often hard to identify, extract and decode as the variety of differ- ent formats is in use. For this very reason some kind of mathematical input data normaliza- tion seems to be vital for the search engine functionality [MM07, p. 349]:

[W]hen we began testing it, we were somewhat surprised to discover that arti- facts of encoding, conversion, authoring tools and author coding choices com- pletely dominated, rendering the [search] algorithm virtually useless.

³¹ http://www.mathdex.com/

10 2.3 D M L S E

As a solution the authors are suggesting the use of a multi-pass normalization algorithm that attempts to produce a canonical MathML representation for equivalent mathematical notations. Correcting poor or ambiguous MathML was found to be the most challenging technically. It consists of correcting common errors in MathML code that are visually acceptable and thus often used by authors / produced by conversion tools such as 푓(푧) = 푤 having (푧) = 푤 grouped in an element. To cope with high variability in XML nesting from different sources, heuristic techniques of enriching MathML were used. [MM07, pp. 351–352] mentions pairing fence operators and reining the MathML structure by elements. At the end of the processing, all mathematical “synonyms” were substituted for the chosen canonical representant. How- ever, canonicalization of mathematical synonyms was made at a rather elementary level in the MathDex system and the authors are recommending improving this step of their al- gorithm in the future. [MM07, p. 353] draws attention to the importance of handling variable names. For the MathDex system, the authors deined equivalence classes of variable names based on the tradition of their use across different areas of mathematics, and they used them to build search indexes allowing searches not only for the exact match but for other variable names from the same class as well. It was suggested that this simple system be reworked in the future into a more robust system. As the next step of the normalization algorithm, mathematical normalization was marked, i.e. inding mathematically similar expressions by using numerical and symbolic methods. This could be supported through the use of specialized software tools such as computer al- gebra systems (Maple, MATLAB, Mathematica etc.). However, this was beyond of the Math- Dex project and only a few concessions towards mathematical equivalence were made.

2.3.3 EgoMath

EgoMath search engine³² was developed at Charles University in Prague with a special fo- cus on indexing and searching through digital mathematical content. Author summarises used methods in [MG08; MG11]. Mathematical data is stored in MathML encoding in both Presentation and Content form. The MathML is generated by LATEXML (see Section 2.2.2 on page 8). Although the system is built as a general purpose tool for the web content, the evaluation was done on small sets of Connections (CNX)³³ and arXiv³⁴ data [MG08]; the current version uses as a primary data set for indexing and searching mathematical formulae in the TEX notation from Wikipedia.org³⁵ articles [MG11]. This makes its use a somewhat limited and it is not clear how successful it would be as a tool on data from different sources.

³² http://egomath.projekty.ms.mff.cuni.cz/ ³³ http://cnx.org/ ³⁴ http://arxiv.org/ ³⁵ https://www.wikipedia.org/

11 2.3 D M L S E

At the time of writing this³⁶ the search interface at http://egomath.projekty.ms. mff.cuni.cz/ was unable to answer queries with the error message: “Backend instance exception: HTTP error: HTTP request failed” Thus, the status and future of the project is not clear.

2.3.4 LATEX Search / LATEX SpeedSearch

LATEX Search³⁷ is a LATEX code search engine provided by Springer³⁸. In contrast to the other maths search engines described in this section, there is little information available about methods used in its implementation. I am not aware of any publication on LATEX Search. The most comprehensive information available is in the “About” section of the homepage of the system [Spr12]. The system searches over LATEX sources of “content available from Springer’s corpus of literature” [Spr12] without any further speciication.³⁹ According to both [Spr12] and the tests of the system I conducted, it seems the system depends solely on the availability of original LATEX source codes of the publications. A search is based on string comparison of LATEX source code of the mathematical expressions included and users’ query. The query itself can not be enriched with text keywords: LATEX encoded formula is the only supported format. The system can work with exact match only or in similarity search mode supported by unspeciied similarity algorithms. The algorithm probably works only on an identiier sub- stitution basis. The number of made substitutions is indicated in the results list. No ad- vanced mathematically aware processing (commutativity of operators, for example) seems to be in use. For the exact match mode of operation, a special version of the web application called LATEX SpeedSearch⁴⁰ is available. The name of the mode relects signiicantly faster response of the exact match system in comparison with the similarity search system that has notice- able delays.

2.3.5 ActiveMath

The ActiveMath⁴¹ project focuses on research and development in the ield of mathematics e-learning. As part of this project, a mathematical search engine was developed. [LM06] The tools follows [You05] for mathematical content items and formulae. However, the system was designed and implemented for indexing and searching through semantically annotated mathematical expressions in OMDoc/OpenMath encoding (see Sec- tion 2.1.3 on page 7 and 2.1.4 on page 7) speciically prepared for the needs of this e-learning

³⁶ 2013-01-11 ³⁷ http://latexsearch.com/ ³⁸ https://www.springer.com/ ³⁹ During the search it is stated that more than 1 million equations are searched. The homepage itself cur- rently (2013-01-12) says “7,578,254 LATEX code snippets”. ⁴⁰ http://latexsearch.com/litehome.do ⁴¹ http://www.activemath.org/

12 2.3 D M L S E platform. This mathematical representation, however, is not in common use in real world mathematical documents and scientiic papers. That signiicantly limits the area of opera- tion of this tool.

2.3.6 MathWebSearch

MathWebSearch⁴² is a content oriented search engine for mathematical formulae. The his- tory of the system that the methods use for indexing and searching is summarized in [KMP12; Koh+08; KS06]. Similar to the ActiveMath system (see Section 2.3.5 on the preceding page) MathWeb- Search builds upon semantic data and thus faces similar problems with the lack of seman- tically annotated data. To deal with this issue, the MathWebSearch project uses the LATEXML tool (see Section 2.2.2 on page 8) to transform mathematical expressions from LATEX to the MathML with not only Presentation but also Content markup. Currently, the system is op- erating on a database of papers from the arXMLiv project⁴³ [Sta+10]. At irst, MathWebSearch was able to search only for mathematical formulae. In contrast to other systems, MathWebSearch does not use text tokens to represent formulae in the input documents, but builds upon substitution trees. Text search capabilities were added later in parallel to the formula search over separate index. [Koh+08] This modiication is reported to noticeably improve relevance of the search results.

2.3.7 EuDML

The European Digital Mathematics Library project⁴⁴ (EuDML) [Bor+11] is

[…] a CIP project of 15 partners to build the European Digital Mathematics Lib- rary […] the European Digital Mathematics Library strives to make the signi- icant corpus of mathematics scholarship published in Europe available online, in the form of an authoritative and enduring digital collection, developed and curated by a network of institutions. […] ([EuD13])

One of the content providers to the EuDML project is the Czech Digital Mathematics Library (DML-CZ)⁴⁵ [Bar+08; Rák11; Kre08] in whose development and maintenance I participate. As part of the EuDML project, mechanism of a search for maths expressions through the contents of the digital library was intended [Syl+10]. The MIaS system (see Section 2.4 on the following page) was selected, and then integrated into the development version of the EuDML system.

⁴² https://trac.mathweb.org/MWS/ ⁴³ http://arxmliv.kwarc.info/ ⁴⁴ http://eudml.org/ ⁴⁵ http://www.dml.cz/

13 2.4 MIS WMIS

2.4 MIaS and WebMIaS In 2010, Martin Lı́ška designed and implemented Math Indexer and Searcher (MIaS). The aim of the system is to index mathematical documents and to enable searching for math- ematical formulae throughout the database of these documents. To allow a user-friendly interaction between the researcher and the system, it includes a web user interface called WebMIaS as another component. The system was designed after evaluating existing systems [Lı́š10] and it remains under continuous development⁴⁶.[SL11a; SL11b] MIaS was also integrated as the engine of maths formula searches in the EuDML project (see Section 2.3.7 on the previous page). In contrast to some other systems (see Section 2.3 on page 9) (Web)MIaS is being de- veloped speciically for indexing and searching real world scientiic papers without any spe- cial semantic markup provided by their authors. Currently, Presentation MathML is used for mathematical expressions encoding but the use of Content MathML is under development. Toconduct tests of the system, it was necessary to build the MREC dataset [Lı́+11; SLR11] based on data of the arXMLiv project⁴⁷ [Sta+10]. (Web)MIaS is of special interest to my thesis as it forms the base development platform for our maths information retrieval research team at FI MU⁴⁸ of which I am a member. To address the weaknesses of Presentation MathML for similarity search (see Section 2.2 on page 8), normalization methods are under development. [For+12; Růž+12] Further research in this area will be a part of my thesis. The (Web)MIaS platform is of possible use to the development and testing of the other aims of my research (see Section 3 on the following page).

2.4.1 The MIR Happening and the NTCIR Math Task Competition

An important task of research in the ield of information retrieval is the proper evaluation of proposed systems. The evaluation of mathematical information retrieval systems, however, remains still in its early stages. Nevertheless, some attempts to evaluate the MIaS system were made at Mathematics Information Retrieval (MIR) Happening⁴⁹ that took place in Bremen, Germany in 2012. This happening was part of the MIR Workshop⁵⁰ and the Conferences on Intelligent Computer Mathematics (CICM) 2012⁵¹. A summary of the results and analysis of the state of the art of MIR evaluation can be found in the recent publication [Lı́š13]. Further evaluation is in progress within the NTCIR Math Task competition [NTC12].

⁴⁶ https://mir.fi.muni.cz/mias/ ⁴⁷ http://arxmliv.kwarc.info/ ⁴⁸ https://mir.fi.muni.cz/ ⁴⁹ http://www.cicm-conference.org/2012/cicm.php?event=mir&menu=happening ⁵⁰ http://www.cicm-conference.org/2012/cicm.php?event=mir&menu=general ⁵¹ http://www.cicm-conference.org/2012/

14 3 T G

3 Thesis Goals

In previous sections I provided samples of digital libraries for scientiic mathematical liter- ature. From the user’s point of view there are several important functions of digital libraries including the

• long term preservation of their content in appropriate digital form,

• persistent identiiers of digital objects usable for citations,

• guarantee of the level of quality that a user can expect based on the library policy for selecting the sources of its content,

• categorization and organization of digital objects,

• reader access to library content in such a way that maximises usability and usefulness of the content.

In my thesis I will focus on the last of these points with speciic reference to digital mathe- matical libraries.

3.1 Objectives and Expected Results The usefulness of both classic and digital libraries depends on the ability of a library to provide the user with all its content that they could possibly be interested in. This is dei- nitely no easy task. Classic libraries developed complex cataloging systems for the description and categor- isation of documents. Books were physically organized on shelves according to theirs topics and other criteria. Besides wandering among the shelves there were card catalogs ready for users willing to ind something in the library. With the advent of computers, card catalogs were transformed into digital databases. It is much easier, faster and more comfortable to search through the library catalog in digital form than browse through drawers full of small paper cards. However, the digital catalog still contains just a limited amount of metadata about the documents that themselves can only be seen when found on library shelves. Documents from current digital libraries are available to anyone anywhere with ready access to the Internet. Their digital form has more advantages over their physical counter- parts than just remote availability. In addition to the wide range of categorization possibil- ities based on document metadata, the full text itself is accessible in digital form. With the presence of full text, a very powerful new tool is available – the full text search. Full text search engines are widely used and matured for text documents. Millions of people

15 3.1 O E R

ind documents relevant to their interests in the depths of the world wide web using famous full text web search engines such as Google, Bing or Yahoo! Search. However, in the case of specialized digital libraries there are speciic needs for the search engine. Much of the most important content in mathematical documents are mathematical formulae. It is natural that users of mathematical digital libraries want to use full text search engine not only for the text that surrounds formulae. And importantly, it is also the case that a math formulae have a great potential to fulil a discriminative function in search. General text search engines are not suitable for this purpose and all of the existing mathematical search engines described in Section 2.3 on page 9 have their limitations.

3.1.1 MathML Normalization

Section 2.1 on page 5 describes various languages for the description of maths. The most used in the real systems and documents is MathML. Its main problem is the lack of semantics and the high variability of Presentation MathML coding (see Section 2.2 on page 8) that is necessary for the proper handling of formulae equivalence that is needed for a mathemat- ical search engine. Although the authors themselves are undoubtedly the best source of semantic information, it does not seem possible to persuade them to add semantic annota- tions to their every day documents. As mentioned in Section 2.3.1 on page 10, as part of the DLMF effort, a LATEX document class for semantic annotations. However, in practice there is almost no use of this or any similar tool. The primary intention of its authors is to present formulae visually. Even the slightest semantic annotation means extra work for them. They derive no direct additional value from this and as such, it is omitted. According to the above aforementioned it is important to be able to unify different cod- ings of semantically equivalent Presentation MathML or “understand formulae” – convert them to Content MathML or OpenMath – in real world mathematical documents. As early as [MY03] normalization of different codings of equivalent formulae was identiied as neces- sary. However, the DLMF approach of sorted parse tree normal form [MY03, p. 15] seems to be too simple to unify equivalent formulae with more than a trivial difference in its coding. As described in Section 2.4 on page 14, our MIR research team has developed a Presenta- tion MathML normalization tool [Růž+12] for our maths search engine. As part of my thesis, this normalization tool will be improved. We will try to better handle real world Presenta- tion MathML or preferably automatically convert Presentation MathML to Content MathML, which should be feasible as promising results of [Ngh+12] indicate. It should be noted that even Content MathML may need some kind of normalization of its coding, depending on its source. For example Listing 1 on the following page shows both constructions for “∈” operation are valid Content MathML constructions.

16 3.1 O E R

β superscript F

β superscript F Listing 1: Two possible codings of 훽 ∈ 퐹×

Although the former format is deinitely better, the latter form is present in the real world data⁵² used for the NTCIR Math Task competition [NTC12]. As we want and have to process real world data we need tools capable of coping with even this kind of issue and unifying different codings for indexing by the search engine. As part of my thesis, the necessity of normalizing the Content MathML will be explored, evaluated and implemented in the nor- malization tool.

3.1.2 Classiication of Identiiers

In such a vast ield that the whole of mathematics encompasses, I believe it is not possible to prepare universal rules for grouping identiiers for investigating the equivalence of formulae as mentioned in Section 2.3.2 on page 11. Maybe the very basic ones e.g. 푖, 푗, 푘 for use in sub- and superscripts are an exception. In my thesis I intend to develop a system that will be able to classify types of identiiers in documents over which the search will be conducted, i.e. it

⁵² http://arxmliv.kwarc.info/files/0712/0712.3704/0712.3704.

17 3.1 O E R will be able to mark particular identiiers as variable name, function name, and so on. This classiication will be based on topic and research areas of each document and the notation common to this ield. Part of my thesis research will involve developing an exact approach. The resulting annotated formulae should then be the input of the indexing engine. During a search, the search engine itself should not be restricted for substitution of iden- tiier names in a general case where no further information is available. However, the system should be able to accept user instructions on formula structure. Thus, the user will be able to use placeholders for variables, functions and so on to draw up the skeleton of the desired formulae. This possibility will challenge the search system to properly rate and sort the resulting list and balance exact match restrictions against the presentation of all the similar construc- tions found, that could be useful to the user. A thorough investigation will be conducted in this area as this will strongly affect the relevance of the results. To fully exploit the advantages of the classiied identiiers in the document database, I will prepare, evaluate and implement techniques to automatically classify types of identii- ers in the user query and use the data to both

• modify the scores of the results found by the search engine, and

• give feedback to the user suggesting possible modiications to the query.

This is expected to be a more challenging [Soj12] and complex task than MathML nor- malization. The majority of user queries will be just short snippets with very limited context information usable for query analysis. I believe these shortcomings can be solved by proper investigation of the users and their interaction with the search system as described in the next section.

3.1.3 Context Driven Search

Using context during a search seems to me to be the most promising way of improving maths search engine results. At present, I suggest implementing and evaluating the following tech- niques:

1. The user will be able to specify areas of expected results directly by choosing from a predeined list of possible topics.

2. The system will store information about user interests (documents used in the past) and directly use information from the user’s account proile (such as areas of interests).

3. The users will be able to provide samples of documents from the area they are inter- ested in.

18 3.1 O E R

The topic list for Point 1 will be built dynamically based on the metadata of the content of the index over which the search is conducted. It will also test the possibility of using automated tools such as Gensim⁵³ [RS10; RS08; SR07]. Point 3 seems to be the most promising and will be an important part of my thesis. I believe context is an important piece of information and I will use it to signiicantly reduce the ambiguity in short user queries. Knowledge of context can be used to extrapolate the scientiic area of the user query. This can consequently be used to estimate elements of a formula which should be ixed, e.g. the usual notation of well known functions often used in the area, or, conversely, which should be substituted by an arbitrary element or elements from a given set in the case of various commonly used notations. This context will also be used to improve the ranking of results, something that seems to be of great importance as mentioned in Section 2.4.1 on page 14. Context aware search systems can favour a certain group of documents from the index during a search and use this information to increase their ranking in the sorted results set as presented to the user. To keep the search system as simple and user friendly as possible it should not be expec- ted the users will be willing to provide sample documents by putting URLs or even uploading them to the search engine application. Similarly to the case of preparation of scientiic art- icles where the users target for print output and ignores semantic annotations as extra work with no beneit for print we cannot expect our users to do signiicantly more than what they are used to doing in very simple user interfaces of famous search services like Google. However, we can expect the users to do research, i.e. we can expect them to start with a fairly general query, inspect the results and reuse collected information to reine the next query till suficient results are found. In my thesis I will focus on unifying this worklow into one system that will allow the collection of this data and especially use it to build context that can be used for reducing ambiguities of maths searching as described above.

3.1.4 Involvement of Computer Algebra Systems

Several authors [MY03; MM07] claim that advanced maths equality search will need the involvement of computer algebra systems. As far as I know, however, there is still no use of this approach in practice. Thus as the next step towards better mathematical similarity search I will test the potential of cooperation with different computer algebra systems and the integration of them with our maths search system. This will allow us to ind as equivalent both simple and complex formulae using numeric and symbolic methods.

3.1.5 Image Search Experiment

A large part of current digital mathematical libraries contents are old documents that were digitised from physical printed publications. Basic data obtained by this process are images

⁵³ https://mir.fi.muni.cz/gensim/

19 3.2 E O of pages that have to be postprocessed to a format suitable for the digital library. The post- processing usually involves optical character recognition software (OCR), e.g. Infty Reader (see Section 2.2.3 on page 8), that analyses the image of the page and transforms its found objects to the digital representation of texts and mathematical formulae. This processing led us to the idea of searching for similar mathematical formulae based on the similarity of images of the printed formulae. This could prove invaluable for the old digitised publications where it would not then be necessary to properly decode the formula meaning from the image to search for similar formulae. As part of my thesis I will conduct experiments with image searching for images of mathematical formulae. We are currently negotiating cooperation with the MUFIN⁵⁴ Im- age Search⁵⁵ project [Nov+07; NBZ09] team. If experiments indicate the feasibility of such image searches the next step will be to implement to a real maths search system.

3.2 Expected Outputs • Research on

– necessity of Presentation and Content MathML normalization for maths search system, – content driven search for maths, – feasibility of image search for mathematical formulae and

their documentation in the text of my Ph.D. thesis.

• Practical implementation of the above in appropriate software tools (e.g. Apache Solr⁵⁶ plugin) and their integration with a maths search system (e.g. (Web)MIaS, see Sec- tion 2.4 on page 14).

• Evaluation of the implementations on signiicantly large collection of real data such as MREC (see Section 2.4 on page 14) or DML-CZ/EuDML contents (see Section 2.3.7 on page 13).

• Publications. Possible conferences:

– CICM (Conference on Intelligent Computer Mathematics) http://trac.mathweb.org/CICM – DocEng (The ACM Symposium on Document Engineering) http://www.documentengineering.org/

⁵⁴ http://mufin.fi.muni.cz/ ⁵⁵ http://mufin.fi.muni.cz/imgsearch/ ⁵⁶ https://lucene.apache.org/solr/

20 3.3 S

– TPDL (International Conference on Theory and Practice of Digital Libraries) http://www.tpdl.eu/ – ECIR (European Conference on Information Retrieval) http://irsg.bcs.org/ecir.php – JCDL (ACM/IEEE-CS Joint Conference on Digital Libraries) http://www.jcdl.org/ – SIGIR (ACM Special Interest Group on Information Retrieval) http://sigir.org/ – CIKM (ACM Conference on Information and Knowledge Management) http://www.cikmconference.org/

3.3 Schedule

Spring 2013

• Research on MathML normalization tool. • Research on content driven search of maths. • Research on image search for mathematical formulae. • Defence of the thesis proposal and doctoral exam.

Autumn 2013

• Research on content driven search of maths. • Research on image searching for mathematical formulae and implementation if pos- sible. • Implementation and integration of improvements of MathML normalization tool.

Spring 2014

• Research and implementation of content driven search in maths search system such as (Web)MIaS. • Evaluation of the implementations. • Writing the Ph.D. thesis.

Autumn 2014

• Writing the Ph.D. thesis. • Finishing and submitting the thesis. 21 4 R S

4 Results of Study

4.1 On Topic of the Ph.D. Thesis In 2007 my involvement in the project of Czech Digital Mathematics Library (DML-CZ)⁵⁷ begun. I suggested and implemented a system for processing retro-born-digital documents, i.e. documents that are available in a digital format but not in a form suitable for direct use in the digital mathematical library. Processing focussed on the extraction of metadata and full texts of retro-born-digital documents. I then designed and implemented systems for edit- orial ofices of mathematical journals able to generate DML-CZ metadata as a by-product of the preparation of the printed version of the journal. [Růž08] Since 2010 I have been work- ing on development of the DML-CZ [RS10], its internal Metadata Editor [Mih+10; BKS08] software and other systems. In February 2010 the project of the European Digital Mathematics Library (EuDML)⁵⁸ started in which DML-CZ participates as a content provider (see Section 2.3.7 on page 13). Our team at the Faculty of Informatics cooperates in EuDML development especially on Work Package 7 – Metadata enhancer toolset implementation. One of the tools provided by our team is the MIaS maths search system and related technologies (see Section 2.4 on page 14) that have been integrated into the EuDML project. Currently, I am a member of maths information retrieval research team at FI MU.⁵⁹

4.2 Outcomes • LISKA, Martin and David FORMANEK and Michal RUZICKA and Petr SOJKA. MathML Canonicalizer (MathML Canonicalizer). (software) 2012.

• RUZICKA, Michal. Automated Processing of TeX-Typeset Articles for a Digital Library. In DML 2008: Towards Digital Mathematics Library. First edition. Brno: Masaryk University, 2008. p. 167–176. ISBN 978-80-210-4658-0.

• RUZICKA, Michal and Petr SOJKA. Data Enhancements in a Digital Mathematical Library. In DML 2010 Towards a Digital Mathematics Library. First edition. Brno: Masaryk University Press, 2010. p. 69–76, 8 pp. ISBN 978-80-210-5242-0. I am the author of the vast majority of the paper.

• FILEJ, Miha and Michal RUZICKA and Martin SARFY and Petr SOJKA. Metadata Editing and Validation for a Digital Mathematics Library. In DML 2010 Towards a Digital Mathematics Library. First edition. Brno: Masaryk University Press, 2010. p. 57–62, 6 pp. ISBN 978-80-210-5242-0.

⁵⁷ http://www.dml.cz/ ⁵⁸ http://eudml.org/ ⁵⁹ https://mir.fi.muni.cz/

22 4.3 A W

My contribution are the irst two sections and support of integration of Miha’s software with the Metadata Editor.

• SOJKA, Petr and Martin LISKA and Michal RUZICKA. Building Corpora of Technical Texts : Approaches and Tools. In Aleš Horák, Pavel Rychlý. Fifth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2011. Prvnı́ vydánı́. Brno: Tribun EU, 2011. p. 71–82, 11 pp. ISBN 978-80-263-0077-9. For this paper I have prepared MathML the normalization and canonicalization sections and partially the MREC description.

• LISKA, Martin and Petr SOJKA and Michal RUZICKA and Peter MRAVEC. Web Interface and Collection for Mathematical Retrieval : WebMIaS and MREC. In Petr Sojka, Thierry Bouche. DML 2011: Towards a Digital Mathematics Library. Brno: Masaryk University, 2011. p. 77–84, 8 pp. ISBN 978-80-210-5542-1. I contributed to the paper with description of normalization of MathML.

• RUZICKA, Michal and Petr SOJKA. Redakčnı́ systém odborného časopisu s podporou exportu do digitálnı́ knihovny. Zpravodaj CSTUG, Brno: CSTUG, 2011, 1, p. 4–20, 17 pp. ISSN 1211–6661. 2011. doi:10.5300/2011-1/4. I am the main author of this publication.

• FORMANEK, David and Martin LISKA and Michal RUZICKA and Petr SOJKA. Normalization of Digital Mathematics Library Content. CEUR Workshop Proceedings, Aachen, 921, October, p. 91–103, 13 pp. ISSN 1613–0073. 2012. I cooperated in the formulation of the ideas of the article and contributed to the MathML Sources section.

4.3 Academic Work • I have lectured PB029 – Electronic Document Preparation since 2006.

• I am a supervisor of Bachelor’s thesis of Antonín Crha on topic Administrative interface of the Czechoslovak TEX Users Group since autumn of 2011. • I am a consultant of Bachelor’s thesis of Miroslav Hrdina on topic Recognition of Math- ematical Texts since spring of 2011 and Bachelor’s thesis of Jan Hrubeš on topic Imple- mentation of PDF/A-1a format in PHP library and its deployment in the Filing Service since autumn of 2012.

23 5 S/S

5 Summary/Souhrn

5.1 Summary For a long time libraries of scientiic literature have been an integral part of academic and education institutions. With the advent of computers and the rapid progress of their de- velopment, along with the common availability of fast internet access in the last decades, a lot of documents moved partially or completely into cyberspace. Digital libraries are in common use today and have gradually assumed the functions of classic libraries. Mathe- matics is no exception. There are several national digital mathematics libraries: the project of the European Digital Mathematics Library is almost inished. Working mathematicians are looking forward to the emergence of the World Digital Mathematics Library. The usefulness of digital libraries depends on their ability to provide the user with all its content that they could possibly be interested in. Sophisticated technologies and tech- niques were developed for “textual” digital libraries in course of time. Unfortunately, these techniques are not fully suitable for the speciic needs of mathematical contents in digital mathematics libraries. In my opinion, the most important missing feature is a robust implementation of math- ematically aware full text search able to cope with mathematical publications as they are routinely produced by the authors, i.e. without semantic annotations. In this Ph.D. thesis proposal I described several existing systems, used techniques and formats and showed their limitations. These systems are not yet mature, but their imple- mentations by recognised publishers, such as Springer and large digital mathematics librar- ies like EuDML, veriies their need. Another indication of the growing importance of math- ematics information retrieval (MIR) are recent competitions among MIR systems. As a member of maths information retrieval research team at FI MU, I aim to utilize exist- ence of our Math Indexer and Searcher (MIaS) system and underlying data set to implement and evaluate techniques for the improvement of a mathematically aware search system that will be developed as part of my Ph.D. studies. The irst steps were taken towards the improvement of MathML similarity search by the canonicalization of Presentation MathML: the next will be taken towards Content MathML normalization. Special attention will be paid to opportunities to exploit the context during the search in order to improve relevance of results and give feedback to users suggesting possible modiications to their query. Several techniques are proposed for this. Further improvement can be achieved through the involvement of computer algebra sys- tems in maths equality search. Optical character recognition processing of old publications led us to the idea of experimenting with searches for similar mathematical formulae based on the similarity of images of the printed formulae. Evaluation of the implementations will be done on signiicantly large collection of real data. The results will be documented in the text of my thesis and published in peer reviewed international forums. 24 5.2 S

5.2 Souhrn Již velmi dlouho jsou vědecké knihovny nedı́lnou součástı́ akademických a vzdělávacı́ch in- stitucı́. S přı́chodem počı́tačů, jejich rychlým vývojem a s obecnou dostupnostı́ rychlého přı́- stupu k internetu v poslednı́ch dekádách bylo mnoho dokumentů částečně nebo zcela přesu- nuto na Internet. Digitálnı́ knihovny jsou dnes běžně využı́vány a postupně přebı́rajı́ funkce knihoven klasických. Matematika nenı́ výjimkou. Existuje několik národnı́ch digitálnı́ch ma- tematických knihoven, projekt Evropské digitálnı́ matematické knihovny je téměř dokončen a matematici vyhlı́žejı́ zrod Světové digitálnı́ matematické knihovny. Užitečnost digitálnı́ch knihoven závisı́ na jejich schopnosti poskytnout uživateli veškerý svůj obsah, o který by uživatel mohl mı́t zájem. V průběhu času byly vyvinuty propracova- né techniky a technologie pro „textové“ digitálnı́ knihovny. Naneštěstı́ nejsou tyto techniky plně použitelné pro speciické potřeby matematického obsahu matematických digitálnı́ch knihoven. Podle mého názoru je nejdůležitějšı́ chybějı́cı́ funkcı́ robustnı́ implementace matemati- ku zohledňujı́cı́ho plnotextového vyhledávače, který by byl schopen vypořádat se s matema- tickými publikacemi tak, jak jsou běžně připravovány jejich autory, tzn. bez sémantických anotacı́. V těchto tezı́ch disertačnı́ práce jsem popsal několik existujı́cı́ch systémů, použı́vaných technologiı́ a formátů společně s jejich omezenı́mi. Tyto systémy zatı́m nejsou vyspělé, ale samotná jejich implementace uznávanými vydavateli, jako je Springer, nebo velkými digitál- nı́mi knihovnami, jako je EuDML, svědčı́ o tom, že jsou potřebné. Dalšı́ ukázkou rostoucı́ho významu vyhledávánı́ matematických informacı́ jsou nedávná klánı́ těchto systémů. Jako člen týmu zabývajı́cı́ho se výzkumem v této oblasti na FI MU chci využı́t existen- ce našeho systému Math Indexer and Searcher (MIaS) a přı́slušné sady dat. Budou využity k implementaci a vyhodnocenı́ technik a vylepšenı́ matematiku zohledňujı́cı́ho plnotextové- ho vyhledávače, které budou vyvinuty jako součást mého Ph.D. studia. Prvnı́ kroky směrem k vylepšenı́ hledánı́ podobnostı́ MathML byly učiněny pomocı́ ka- nonizace Presentation MathML. Dalšı́ povedou k normalizaci Content MathML. Zvláštnı́ po- zornost byla věnována možnosti využitı́ kontextu hledánı́ pro vylepšenı́ relevance výsledků a poskytnutı́ uživatelské zpětné vazby o možných dalšı́ch úpravách dotazu. K dosaženı́ těch- to cı́lů bylo navrženo několik technik. Dalšı́ho vylepšenı́ při hledánı́ podobnostı́ matematických výrazů může být dosaženo za- pojenı́m specializovaného matematického software. Optické rozpoznávánı́ znaků využı́vané pro zpracovánı́ starých publikacı́ nás pak přivedlo na myšlenku experimentů s hledánı́m po- dobných formulı́ na základě podobnosti obrázků jejich tištěné podoby. Výsledná implementace bude ověřena na významně velké kolekci reálných dat. Výsledky budou dokumentovány v textu mé disertačnı́ práce a publikovány na recenzovaných mezi- národnı́ch fórech.

25 R

References

[Aus+10] Ron Ausbrooks et al. Mathematical Markup Language (MathML). Ed. by David Carlisle, Patrick Ion, and Robert Miner. Version 3.0. W3C Recommendation 21 October 2010. World Wide Web Consortium (W3C). 2010-10-21. : http://www.w3.org/TR/2010/REC-MathML3-20101021/ (visited on 2013-01-06). [Bar+08] Miroslav Bartošek, Martin Lhoták, Jiřı́ Rákosnı́k, Petr Sojka, and Martin Sárfy. “DML-CZ: The Objectives and the First Steps”. In: CMDE 2006: Communicating Mathematics in the Digital Era. Ed. by Jonathan Borwein, Eugénio M. Rocha, and José Francisco Rodrigues. MA, USA: A. K. Peters, 2008, pp. 69–79. [BKS08] Miroslav Bartošek, Petr Kovář, and Martin Sárfy. “DML-CZ Metadata Editor: Content Creation System for Digital Libraries”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Birmingham, UK: Masaryk University, 2008-07, pp. 139–151. : 978-80-210-4658-0. : http://dml.cz/dmlcz/702537 (visited on 2013-01-09). [Bor+11] José Borbinha, Thierry Bouche, Aleksander Nowiński, and Petr Sojka. “Project EuDML—A First Year Demonstration”. In: Intelligent Computer Mathematics. Proceedings of 18th Symposium, Calculemus 2011, and 10th International Conference, MKM 2011. Ed. by James H. Davenport, William M. Farmer, Josef Urban, and Florian Rabe. Vol. 6824. Lecture Notes in Artiicial Intelligence, LNAI. Bertinoro, Italy: Springer-Verlag, 2011-07, pp. 281–284. : 10.1007/978-3-642-22673-1_21. [Bou08] Thierry Bouche. “CEDRICS: When CEDRAM Meets Tralics”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Birmingham, UK: Masaryk University, 2008-07, pp. 153–165. : 978-80-210-4658-0. : http://dml.cz/dmlcz/702544 (visited on 2013-01-09). [BSS12] Josef Baker, Alan P. Sexton, and Volker Sorge. “MaxTract: Converting PDF to LaTeX, MathML and Text”. In: Conferences on Intelligent Computer Mathematics (CICM 2012). Ed. by Johan Jeuring et al. Vol. 7362. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2012, pp. 422–426. : 978-3-642-31373-8. : 10.1007/978-3-642-31374-5_29. [Bus+04] S. Buswell et al. The OpenMath Standard. Technical report. Version 2.0. The OpenMath Society, 2004-06. : http: //openmath.activemath.org/standard/om20-2004-06-30/omstd20.pdf (visited on 2013-01-06).

26 R

[Car+01] D. Carlisle, J. Davenport, M. Dewar, N. Hur, and W. Naylor. Conversion between MathML and OpenMath. Bath/NAG. 2001. : http://openmath.activemath.org/documents/om-mml.pdf (visited on 2013-01-06). [Cer12] D. Cervone. “MathJax: A Platform for Mathematics on the Web”. In: Notices of the AMS 59.2 (2012), pp. 312–316. : http://www.ams.org/journals/notices/201202/rtx120200312p.pdf (visited on 2013-01-06). [DML13] Czech Digital Mathematics Library. Czech Digital Mathematics Library – project. 2013-03-31. : http://projekt.dml.cz/ (visited on 2013-01-14). [EuD13] The European Digital Mathematics Library. EuDML. About Us. 2013-01-13. : https://project.eudml.org/about-us (visited on 2013-01-13). [For+12] David Formánek, Martin Lı́ška, Michal Růžička, and Petr Sojka. “Normalization of Digital Mathematics Library Content”. In: Joint Proceedings of the 24th OpenMath Workshop, the 7th Workshop on Mathematical User Interfaces (MathUI), and the Work in Progress Section of the Conference on Intelligent Computer Mathematics. (Bremen, Germany, 2012-07-09/2012-07-13). Ed. by James Davenport, Johan Jeuring, Christoph Lange, and Paul Libbrecht. CEUR Workshop Proceedings 921. Aachen, 2012, pp. 91–103. : http://ceur-ws.org/Vol-921/wip-05.pdf (visited on 2013-01-13). [Gri10] José Grimm. “Producing MathML with Tralics”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Paris, France: Masaryk University, 2010-07, pp. 105–117. : 978-80-210-5242-0. : http://dml.cz/dmlcz/702579 (visited on 2013-01-09). [Jip06] P. Jipsen. Text-based input formats for mathematical formulas. The Evolution of Mathematical Communication in the Age of Digital Libraries, IMA “Hot Topics” Workshop, USA. 2006. : http: //math.chapman.edu/~jipsen/talks/IMA2006/JipsenIMA2006talk.pdf (visited on 2013-01-06). [KMP12] , Bogdan A. Matican, and Corneliu-Claudiu Prodescu. “MathWebSearch 0.5: Scaling an Open Formula Search Engine”. In: Intelligent Computer Mathematics. Ed. by Johan Jeuring et al. Vol. 7362. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, pp. 342–357. : 978-3-642-31373-8. : 10.1007/978-3-642-31374-5_23. : http://dx.doi.org/10.1007/978-3-642-31374-5_23. [Knu86] Donald E. Knuth. The TEXbook. Vol. A. Computers and Typesetting. Reading, MA, USA: Addison-Wesley, 1986, pp. ix + 483. : 0-201-13447-0.

27 R

[Koh+08] Michael Kohlhase, Ştefan Anca, Constantin Jucovschi, Alberto González Palomo, and Ioan A. Şucan. “MathWebSearch 0.4. A Semantic Search Engine for Mathematics”. Manuscript. 2008. : http://mathweb.org/projects/mws/pubs/mkm08.pdf (visited on 2013-01-12). [Koh06] Michael Kohlhase. OMDoc–An Open Markup Format for Mathematical Documents [version 1.2]. Foreword by Allan Bundy. Ed. by J. G. Carbonell and J. Siekmann. Vol. 4180. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2006-08. : 978-3-540-37898-3. : 10.1007/11826095. [Kre08] Vlastimil Krejčı́ř. “Building Czech Digital Mathematics Library upon DSpace System”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Birmingham, UK: Masaryk University, 2008-07, pp. 117–126. : 978-80-210-4658-0. : http://dml.cz/dmlcz/702539 (visited on 2013-01-09). [KS06] Michael Kohlhase and Ioan Sucan. “A Search Engine for Mathematical Formulae”. In: Artiicial Intelligence and Symbolic Computation. Ed. by Jacques Calmet, Tetsuo Ida, and Dongming Wang. Vol. 4120. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2006, pp. 241–253. : 978-3-540-39728-1. : 10.1007/11856290_21. [Lam86] Leslie Lamport. LATEX: A Document Preparation System. Reading, Massachusets: Addison-Wesley, 1986. [Lı́š10] Martin Lı́ška. “Vyhledávánı́ v matematickém textu”. Bachelor’s thesis. Masaryk University, Faculty of Informatics, Brno, Czech Republic, 2010-05-24. : http://is.muni.cz/th/255768/fi_b/ (visited on 2013-01-12). [Lı́š13] Martin Lı́ška. “Evaluation of Mathematics Retrieval”. MA thesis. Masaryk University, Faculty of Informatics, Brno, Czech Republic, 2013-01-07. : http://is.muni.cz/th/255768/fi_m/ (visited on 2013-01-11). [Lı́+11] Martin Lı́ška, Petr Sojka, Michal Růžička, and Petr Mravec. “Web Interface and Collection for Mathematical Retrieval”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka and Thierry Bouche. Bertinoro, Italy: Masaryk University, 2011-07, pp. 77–84. : 978-80-210-5542-1. : http://dml.cz/dmlcz/702604 (visited on 2013-01-13). [LM06] Paul Libbrecht and Erica Melis. “Methods to Access and Retrieve Mathematical Content in ActiveMath”. In: Mathematical Software - ICMS 2006. Ed. by Andrés Iglesias and Nobuki Takayama. Vol. 4151. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2006, pp. 331–342. : 978-3-540-38084-9. : 10.1007/11832225_33.

28 R

[MG08] Jozef Mišutka and Leo Galamboš. “Extending Full Text Search Engine for Mathematical Content”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Birmingham, UK: Masaryk University, 2008-07, pp. 55–67. : 978-80-210-4658-0. : http://dml.cz/dmlcz/702546 (visited on 2013-01-13). [MG11] Jozef Mišutka and Leo Galamboš. “System Description: EgoMath2 As a Tool for Mathematical Searching on Wikipedia.org”. In: Intelligent Computer Mathematics. Ed. by James Davenport, William Farmer, Josef Urban, and Florian Rabe. Vol. 6824. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011, pp. 307–309. : 978-3-642-22672-4. : 10.1007/978-3-642-22673-1_30. [Mih+10] Filej Miha, Michal Růžička, Martin Sárfy, and Petr Sojka. “Metadata Editing and Validation for a Digital Mathematics Library”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Paris, France: Masaryk University, 2010-07, pp. 57–62. : 978-80-210-5242-0. : http://dml.cz/dmlcz/702573 (visited on 2013-01-13). [MM07] Robert Miner and Rajesh Munavalli. “An Approach to Mathematical Search Through Query Formulation and Data Normalization”. In: Towards Mechanized Mathematical Assistants. Ed. by Manuel Kauers, Manfred Kerber, Robert Miner, and Wolfgang Windsteiger. Vol. 4573. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2007, pp. 342–355. : 978-3-540-73083-5. : 10.1007/978-3-540-73086-6_27. [MY03] Bruce R. Miller and Abdou Youssef. “Technical Aspects of the Digital Library of Mathematical Functions”. English. In: Annals of Mathematics and Artiicial Intelligence 38 (1-3 2003), pp. 121–136. : 1012-2443. : 10.1023/A:1022967814992. [NBZ09] David Novák, Michal Batko, and Pavel Zezula. “Generic Similarity Search Engine Demonstrated by an Image Retrieval Application”. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA, 2009, pp. 840–840. : 978-1-60558-483-6. : http://portal.acm.org/citation.cfm?id=1572160. [Ngh+12] Minh-Quoc Nghiem, Kristianto Giovanni Yoko, Yuichiroh Matsubayashi, and Akiko Aizawa. “Automatic Approach to Understanding Mathematical Expressions Using MathML Parallel Markup Corpora”. In: The 26th Annual Conference of the Japanese Society for Artiicial Intelligence. (Yamaguchi city, Yamaguchi Prefecture, Japan, 2012-06-12). 2012. :

29 R

https://kaigi.org/jsai/webprogram/2012/pdf/712.pdf (visited on 2012-11-23). [NIS12] National Institute of Standards and Technology (NIST). NIST Digital Library of Mathematical Functions. Version 1.0.5. Online companion to [Olv+10]. 2012-10-01. : http://dlmf.nist.gov/ (visited on 2012-11-18). [Nov+07] David Novák, Michal Batko, Vlastislav Dohnal, and Pavel Zezula. “Scaling up the Image Content-based Retrieval”. In: Second DELOS Conference – Working Notes. Pisa, Italy, 2007, pp. 1–10. : 2-912335-36-1. : http://www.delos.info. [NTC12] NTCIR Project. NTCIR Pilot Task: Math Task. 2012-12-19. : http://ntcir-math.nii.ac.jp/ (visited on 2012-12-28). [Olv+10] F. W. J. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark, eds. NIST Handbook of Mathematical Functions. Print companion to [NIS12]. New York, NY: Cambridge University Press, 2010. [Rák11] Jiřı́ Rákosnı́k. “Recent Development of the DML-CZ and Its Current State”. In: Towards a Digital Mathematics Library. Bertinoro, Italy, July 20–21st, 2011. Ed. by Petr Sojka and Thierry Bouche. Bertinoro, Italy: Masaryk University, 2011-07, pp. 9–14. : 978-80-210-5542-1. : http://dml.cz/dmlcz/702597 (visited on 2013-01-09). [RS10] Michal Růžička and Petr Sojka. “Data Enhancements in a Digital Mathematics Library”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Paris, France: Masaryk University, 2010-07, pp. 69–76. : 978-80-210-5242-0. : http://dml.cz/dmlcz/702575 (visited on 2013-01-13). [Růž+12] Michal Růžička, David Formánek, Martin Lı́ška, and Maroš Kucbel. MIR@MU. MathML Normalization. 2012-12-02. : https://mir.fi.muni.cz/mathml-normalization/ (visited on 2012-12-16). [Růž08] Michal Růžička. “Automated Processing of TEX-typeset Articles for a Digital Library”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Birmingham, UK: Masaryk University, 2008-07, pp. 167–176. : 978-80-210-4658-0. : http://dml.cz/dmlcz/702533 (visited on 2013-01-13). [RS08] Radim Rehůřek and Petr Sojka. “Automated Classiication and Categorization of Mathematical Knowledge”. In: Intelligent Computer Mathematics—Proceedings of 7th International Conference on Mathematical Knowledge Management MKM 2008. Ed. by Serge Autexier et al. Vol. 5144.

30 R

Lecture Notes in Computer Science LNCS/LNAI. Berlin, Heidelberg: Springer-Verlag, 2008-07, pp. 543–557. [RS10] Radim Rehůřek and Petr Sojka. “Software Framework for Topic Modelling with Large Corpora”. In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. Valletta, Malta: ELRA, 2010-05-22, pp. 45–50. : http://is.muni.cz/publication/884893/en. [SL11a] Petr Sojka and Martin Lı́ška. “Indexing and Searching Mathematics in Digital Libraries. Architecture, Design and Scalability Issues”. In: Intelligent Computer Mathematics, Proceedings of 18th Symposium, Calculemus 2011, and 10th International Conference, MKM 2011. Ed. by James H. Davenport, William M. Farmer, Josef Urban, and Florian Rabe. Vol. 6824. Lecture Notes in Computer Science. Bertinoro, Italy: Springer Berlin Heidelberg, 2011, pp. 228–243. : 978-3-642-22672-4. : 10.1007/978-3-642-22673-1_16. [SL11b] Petr Sojka and Martin Lı́ška. “The Art of Mathematics Retrieval”. In: Proceedings of the ACM Conference on Document Engineering, DocEng 2011. Mountain View, CA: Association of Computing Machinery, 2011-09, pp. 57–60. : 978-1-4503-0863-2. : 10.1145/2034691.2034703. [SLR11] Petr Sojka, Martin Lı́ška, and Michal Růžička. “Building Corpora of Technical Texts. Approaches and Tools”. In: Fifth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2011. (2011-12-02/2011-12-04). Karlova Studánka, Czech Republic: Tribun EU, 2011, pp. 71–82. : 978-80-263-0077-9. : http://raslan2011.nlp-consulting.net/program/paper01.pdf (visited on 2013-01-13). [Soj+12] Petr Sojka, Krzysztof Wojciechowski, Nicolas Houillon, Michal Růžička, and Radim Hatlapatka. Toolset for Image and Text Processing and Metadata Enhancements – Value Release. Deliverable D7.3 of EU CIP-ICT-PSP project 250503 EuDML: The European Digital Mathematics Library. 2012-03. : https://project.eudml.org/sites/default/files/D7.3.pdf (visited on 2013-01-09). [Soj08] Petr Sojka, ed. Towards a Digital Mathematics Library. Birmingham, UK: Masaryk University, 2008-07. : 978-80-210-4658-0. : http://dml.cz/dmlcz/702564 (visited on 2013-01-13). [Soj10] Petr Sojka, ed. Towards a Digital Mathematics Library. Paris, France: Masaryk University, 2010-07. : 978-80-210-5242-0. : http://dml.cz/dmlcz/702567 (visited on 2013-01-13).

31 R

[Soj12] Petr Sojka. “Exploiting Semantic Annotations in Math Information Retrieval”. In: Proceedings of ESAIR 2012 c/o CIKM 2012. Ed. by Jaap Kamps, Jussi Karlgren, Peter Mika, and Vanessa Murdock. Maui, Hawaii, USA: Association for Computing Machinery, 2012, pp. 15–16. : 978-1-4503-1717-7. : 10.1145/2390148.2390157. [Spr12] Springer. About LaTeXSearch. 2012-01-12. : http://latexsearch.com/static/about.jsp (visited on 2012-01-12). [SR07] Petr Sojka and Radim Rehůřek. “Classiication of Multilingual Mathematical Papers in DML-CZ”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing—RASLAN 2007. Ed. by Petr Sojka and Aleš Horák. Karlova Studánka, Czech Republic: Masaryk University, 2007-12, pp. 89–96. : 978-80-210-4471-5. [Sta+10] Heinrich Stamerjohanns, Michael Kohlhase, Deyan Ginev, Catalin David, and Bruce Miller. “Transforming Large Collections of Scientiic Publications to XML”. English. In: Mathematics in Computer Science 3 (3 2010), pp. 299–307. : 1661-8270. : 10.1007/s11786-010-0024-7. [Suz+03] Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. “INFTY — An integrated OCR system for mathematical documents”. In: Proceedings of ACM Symposium on Document Engineering 2003. Ed. by C. Vanoirbeek, C. Roisin, and E. Munson. Grenoble, France: ACM, 2003, pp. 95–104. [Syl+10] Wojtek Sylwestrzak, José Borbinha, Thierry Bouche, Aleksander Nowiński, and Petr Sojka. “EuDML—Towards the European Digital Mathematics Library”. In: Towards a Digital Mathematics Library. Ed. by Petr Sojka. Paris, France: Masaryk University, 2010-07, pp. 11–24. : 978-80-210-5242-0. : http://dml.cz/dmlcz/702569 (visited on 2013-01-09). [WME09] WME & MKM Group in Lanzhou University. MathDex. [2009]. : http://wme.lzu.edu.cn/mathsearch/MathDex.html (visited on 2012-11-13). [You05] Abdou S. Youssef. “Information Search And Retrieval of Mathematical Contents: Issues And Methods”. In: The ISCA 14th International Conference on Intelligent and Adaptive Systems and Software Engineering (IASSE-2005). (Toronto, Canada, 2005-07-20). 2005. : http://www.seas.gwu.edu/~ayoussef/search/IASSE.pdf (visited on 2012-12-08).

32 A E L D M N

A Examples of Languages for Description of Mathematical Notations

A.1 TEX

$x^2 + y^2$ Listing 2: TEX coding of formula 푥 + 푦 A.2 Presentation MathML

x2 + y2 Listing 3: Presentation MathML coding of formula 푥 + 푦

A.3 Content MathML

superscript x 2 superscript y 2 Listing 4: Content MathML coding of formula 푥 + 푦

33 A.4 OM

A.4 OpenMath

2 2 Listing 5: OpenMath coding of formula 푥 + 푦

34 B E MML C D S

B Examples of MathML Code from Different Sources

B.1 “Hand Made” Formula

x2 + y2 Listing 6: Example of the “hand made” formula 푥 + 푦

B.2 Tralics

x 2 + y 2 Listing 7: Example of Tralics generated MathML of formula 푥 + 푦

B.3 LATEXML

x 2

35 B.4 IR

+ y 2 superscript x 2 superscript y 2 x^{2}+y^{2} Listing 8: Example of LATEXML generated MathML of formula 푥 + 푦 B.4 InftyReader

x 2 + y 2

36 B.5 MT

Listing 9: Example of InftyReader generated MathML from a PDF document containing only the formula 푥 + 푦 in its body

B.5 MaxTract

x

y

Listing 10: Example of XHTML + MathML generated by the development version of MaxTract from a PDF document containing only the formula 푥 + 푦 in its body

B.6 MATLAB generate::MathML(x^2 + y^2, Content = FALSE, Annotation = FALSE) x 2 +

37 B.7 W A

y 2 Listing 11: Example of MathML export of the formula 푥 + 푦 by MATLAB 7.9.0 MuPAD symbolic engine

B.7 Wolfram Alpha

x 2 + y 2 Listing 12: Example of the MathML export of the Wolfram Alpha input query ‘x^2 + y^2’

38 C P

C Publications

The attached publications are listed in Section 4.2 on page 22.

The full list of my publications can be seen at: https://is.muni.cz/osoba/mruzicka?lang=en#publikace

39 Automated Processing of TEX-Typeset Articles for a Digital Library

Michal Ruži˚ ckaˇ

Masaryk University, Faculty of Informatics Botanická 68a, 602 00 Brno, Czech Republic E-mail: [email protected]

Abstract. Experience in setting up a comprehensive journal processing system based on the TEX typesetting system with the CEDRAMworkflow is described, following the example of the Archivum Mathematicum jour- nal. The system automates the preparation of issues and simultaneously generates the materials needed for the Czech Digital Mathematics Library project (DML-CZ). The second part of the article describes the process of transformation of archival born-digital articles into a DML-CZ-suitable format.

Key words: TEX, publishing system, digital mathematical library, DML-CZ, metadata

1 Introduction

Since 2005, a digital mathematics library has been under development in the Czech Republic. The goal of the Czech Digital Mathematics Library project (DML-CZ) [1,2,3,4,5] is the preservation in digital form of the contents of the major part of mathematical literature ever published in the Czech lands, and to provide free access to the digital content and bibliographical data. [6] From a viewpoint of the contents of the digital library, there are three main periods of time that must be addressed within the digital library project.

1. A retro-digitization period — Documents that are available only in paper format and must be digitized for the needs of a digital library. 2. A retro-born-digital period — Documents that are already born-digital but have been made without awareness of the digital library. The format of these documents is often not suitable for the needs of the digital library. 3. A born-digital period — Documents that are born-digital and made in such a way as to meet the needs of both the publisher and the digital library.

This article discusses the processing of the Archivum Mathematicum journal [7] in order to acquire both the retro-born-digital and born-digital data needed for the DML-CZ project.

Petr Sojka (editor): DML 2008, Towards Digital Mathematics Library, pp. 167–176. © Masaryk University, 2008 ISBN 978-80-210-4658-0 168 Michal Ruži˚ ckaˇ

2 The Retro-Born-Digital Period of the Archivum Mathematicum Journal

The Archivum Mathematicum journal has been published using AMS-TEX and LATEX since 1992. During this period, there have been several changes in style files and the initial mixture of the AMS-TEX and the LATEX sources nearly became a LATEX amsart.cls monoculture. Since 1992, there have also been some changes in the editorial staff of the journal. Consequently, it has not been possible to collect all the source codes for all the issues, something that further complicated the whole task.

2.1 Extraction of Bibliographical Metadata

It was especially necessary to collect bibliographical metadata for the DML-CZ project, more specifically, the list of references from every article that included one. Further metadata about the articles and issues were already available from other sources.

Differences Between AMS-TEX and LATEX Sources. It has already been mentioned that the format of the source codes of the articles was not homogeneous and varied not only from issue to issue but also among the articles within one issue. In general, there were two major formats (each containing roughly 50% of the articles) — articles written using AMS-TEX and articles written using LATEX document class amsart.cls. Over the course of time, there was a trend towards the latter. In addition to the necessity for a slightly different process of metadata extraction, there was one major difference between the two groups of articles — AMS-TEX has a group of logical macros for bibliography marking, so that it was possible to preserve semantic information of all bibliography items even 1 on the output. This contrasts with the LATEX thebibliography environment, which contains visual but not logical markup.

Preprocessing of Articles. The internal format of DML-CZ metadata is XML. Therefore it was desirable to extract metadata from the original TEX format directly into XML. A very good tool for transforming LATEX documents into XML is the LAT X Tralics program [8,9]. Tralics is a E translator. It was therefore necessary to perform some preprocessing of the AMS-TEX articles. Inasmuch as only article bibliographies were extracted, the LATEX articles were also preprocessed in a similar way in order to prepare input files in the LATEX format containing only the bibliography.

1 Regrettably, not all authors used these macros properly and a non-negligible portion of the AMS-TEX articles had items such as publisher, year of publication, and so on marked using a common \paperinfo macro without further structuring. Processing of TEX-Typeset Articles for a Digital Library 169

For both AMS-TEX and LATEX articles, scripts (in the language of the ex program2 in that case) were prepared. Those scripts transform the source code of a regular AMS-TEX/LATEX article into a minimal LATEX document that is ready for further processing by Tralics. The workflow can be seen in Figure 1. The following is an example of a minimal LATEX document derived from a AMS-TEX article:

\documentclass{archivum} \begin{document} \Refs \ref\key1\by Gancarzewicz, J., Michor P. W.\paper Natural... \endref \ref\key2\by Zajtz, A.\paper On the order of natural... \endref ... \endRefs \end{document}

Fig. 1. Schema of the Archivum Mathematicum retro-born-digital period workflow

2 The ex program is a part of the widespread vim text editor. 170 Michal Ruži˚ ckaˇ

Conversion of LATEX Sources into XML by Tralics. The minimal LATEX document mentioned above is ready for further processing by Tralics. It was again necessary to prepare two different configurations for the AMS-TEX and LATEX sets of bibliographical macros. These configuration files instructed Tralics in the translation of input TEX macros into output XML document. In order to make the Tralics configuration as simple as possible, the Tralics configuration files were made in such a way as to produce ‘neutral’ XML output containing just logically marked bibliographical data reflecting the original AMS-TEX markup (in the case of articles originally written in the AMS-TEX language). The Tralics configuration files contained a special definition of the AMS-TEX bibliographical macros using Tralics-specific commands. The bibliographical macros defined in this way took their arguments and enclosed them by an XML element of the name of the original macro in the output. The translation of the ‘neutral’ XML files into the desired final XML format was performed using XSLT (see Figure 1 on the previous page). The following is an example of an output XML file: [1] Natural... Gancarzewicz, J., Michor P. W. ... [2] On the order of natural... Zajtz, A. ... ...

The articles originally written in the LATEX language did not contain any logical markup. The ‘neutral’ XML produced by Tralics therefore reflected visual markup rather than semantic structure. Thus XSLT was made to produce only a plain text output with minimal markup — every bibliographical record was divided into ‘authors’, ‘title’ and ‘suffix’ fields. As visual markup varied slightly between different authors and articles, this method was not firm enough and further human checking was necessary.

2.2 Conversion of Articles from PostScript into PDF The DML-CZ digital library requires not only metadata about the articles but also the articles themselves in electronic form. Due to the changes in style files Processing of TEX-Typeset Articles for a Digital Library 171 and incomplete source codes, it was not possible to recompile all the articles. Even a small change in the output is strongly incompatible with the purposes of a digital library. Fortunately, nearly all the articles of the retro-born-digital period were available as PostScript files. However, the form of these files was not fully suitable to the needs of the digital library. Desirable final format of the articles was PDF.

Automated Modifications of the PostScript Files. The first problem of the Post- Script files was their BoundingBox — the smallest axis-aligned rectangle that entirely encloses all elements on the page. The PostScript files were incorrect in terms of both the BoundingBox and the paper format information. The position of the text on the page was also incorrect. The sheer number of such articles dictated that the whole process of Post- Script correction be automated. The BoundingBox of each PostScript file was detected by the ps2eps utility from the standard TEX Live distribution [10] and fixed within the PostScript file. Using the real BoundingBox value it was also possible to calculate a correct position for the text on the page. See Figure 1 on page 169.

Substitution of Bitmap Fonts. The second problem with the PostScript files involved embedded low-resolution bitmap fonts, which are unfavourable to the future needs of digital library users. Bitmap fonts with a fixed resolution (300 DPI in this case) are appropriate to use at that particular resolution. However, compared with outline fonts, the visual quality of bitmap fonts tends to be poor when scaled or otherwise trans- formed. Nowadays, publications are printed using much higher resolutions, so low-resolution bitmap fonts are less suitable than outline ones. Moreover, publications in a digital library are very often read on a computer screen. Com- puter screens usually have much lower resolution than 300 DPI and electronic publications are frequently scaled on the screen. Thus, bitmap fonts are not ap- propriate even for this purpose. Therefore, several ways of exchanging original bitmap fonts for their outline alternatives were investigated. All the PostScript files were made with the dvips program. There have been several changes in font embedding since 1992. Bitmap fonts with a resolution of 300 DPI were embedded in the older articles and outline fonts in the newer ones. Some methods of font substitution are mentioned in [11]. However, the FixFont program [12] mentioned in the article did not succeed. Moreover, there was no helpful error message. The FontRep Adobe Acrobat plug-in [12], which was mentioned in [11] as well, is unavailable from the plug-in homepage and there is no contact information for its author. Finally, the PStill program [13] was partially successful. PStill is able to substitute bitmap fonts in a dvips-produced PostScript file as a part of the conversion of this file into PDF. However, PStill depends on the presence of 172 Michal Ruži˚ ckaˇ the names of the fonts used in the comments in the PostScript code. Older versions of the dvips program did not include these comments. Therefore, not all PostScript files containing bitmap fonts could be substituted. The remainder of the articles were translated by the well-known GhostScript program-suite. See Figure 1 on page 169.

3 The Born-Digital Period of the Archivum Mathematicum Journal

The Czech Digital Mathematics Library is going to archive not only retro- digitalized and retro-born-digital publications but also new publications. Therefore, the preparation of a publishing system able to generate the materials needed for the digital library is required. Nowadays, publishers often use some kind of an automated document workflow [14], frequently based on XML. XML formats allow users to separate visual representation of data from content. This is important not only for database publishing but also for deriving document metadata. The TEX typesetting system is good at the separation of format from content as well. Thus a publishing workflow could easily be based on TEX and/or XML [15]. A combination of the TEX system and XML has also been chosen for the new processing system of the Archivum Mathematicum journal.

3.1 CEDRAM Base The system of the Archivum Mathematicum journal is based on a system [16,17] used for the French CEDRAM project [18]. Journals collated around the CEDRAM project use a common processing system for the preparation of issues. This system is based on the TEX typesetting system and the Tralics LATEX to XML translator. Both these components are also used in the new Archivum Mathematicum system. One part of the metadata for the DML-CZ project is the list of references from every article that included one. To avoid loosing semantic information, as did the retro-born-digital articles written using the LATEX thebibliography environment, the BibTex program is used for typesetting bibliographies. Preserving the structure in bibliographical metadata allows us to provide functions such as searching in particular fields in bibliographies and grouping publications written by the same authors, as well as generating web pages with different visual representation for different parts of a bibliographical record. The structure of the bibliographical metadata corresponds to the structure of BibTex bibliography files. As BibTex is used for article preparation it is a very natural way to structure bibliographical metadata. The Tralics program is a really important part of the system. The conjunction of Tralics and BibTex bibliographies is used for the generation of article metadata in an XML format. The Tralics ability to use BibTex bibliographical databases directly makes the generation of metadata considerably simpler. Processing of TEX-Typeset Articles for a Digital Library 173

3.2 Workflow of an Issue

From the user’s point of view, the new system is not too different from the ‘traditional’ way of issue preparation. The cedram.cls classfile used is based on the amsart.cls classfile with only a slightly extended set of user macros. The amsart.cls origin of the CEDRAM classfile and only minimal changes in macro set made the transfer of the Archivum Mathematicum journal even simpler, in view of the fact that the amsart.cls file had been used in the past. The preparation of the articles is nearly the same as in the past and practically all the new actions are processed automatically. The base of every issue under control of the new system is a ‘driving’ file and a set of independent articles in separated directories (see Figure 2 on the following page). The following is an example of a driving file:

\documentclass[AM,english,RedoBibTeX,Volume,Couverture,XML]{cedram} % volume number, issue number, month, year \IssueInfo{44}{1}{}{2008} \SetFirstPage{1}

\begin{document} \makefront \articles \includearticle{article1} \includearticle{article2} ... \makeback \end{document}

All of the processing is driven directly from the classfile using the TEX \write18 feature. This TEX command allows the user to carry out ordinal 3 system commands directly from the TEX source code. In this way, article metadata are translated into XML directly using Tralics (Step 2b in Figure 2 on the next page). The compilation of the ‘driving’ file in the pdflatex program (Step 1 in Figure 2 on the following page) starts a huge set of automated actions. In general, all the articles are compiled independently. The compilation produces a PDF of each article (Step 2a) and these are subsequently merged into the final issue PDF file (Step 3). The cover of the issue in PDF format, which also contains an automatically generated table of contents, is produced as well (Step 4). A side-benefit of this compilation is the creation of metadata (Step 2b). A big advantage of this way of processing articles is the complete isolation of each article. Any unwanted interference is eliminated.

3 The \write18 feature has to be explicitly enabled by specifying -shell-escape or similar option on the TEX command-line. 174 Michal Ruži˚ ckaˇ

Fig. 2. Schema of the Archivum Mathematicum born-digital period workflow

3.3 Extensions to the CEDRAM Workflow

The new processing system of the Archivum Mathematicum journal contains some extensions to the CEDRAM workflow. The compilation of the issue and further actions are driven by the standard Unix make program using Makefiles. Metadata on both the whole issue and every article included are automa- tically generated as a set of XML files by Tralics during the compilation of documents (Step 2b in Figure 2). The format of these XML files is nearly the same as in the CEDRAM workflow. The final XML files in DML-CZ format are subsequently created by XSLT (Step 2c). The DML-CZ XML files, all the articles in the PDF format, and the plain text of these articles are subsequently packed into a common ZIP archive. This archive can be sent to the appropriate person directly by e-mail with the proper Makefile target. The system also creates an electronic version of the Archivum Mathema- ticum journal. Web pages containing information about the articles are made Processing of TEX-Typeset Articles for a Digital Library 175 automatically using the issue metadata. These web pages and the articles in both the PDF and the PostScript format are then saved in a separate directory. Furthermore, the system contains minor supportive tools such as the creation of a database review form and the generation of lists of ‘suspicious’ (unnaturally long) words in the source codes of the articles, which helps to reveal typing errors.

4 Conclusion

Since 2008, the Archivum Mathematicum journal has been using the new publishing system. The first issue of this journal was published using this system, and some other journals are considering using the system. The final goal is the preparation of comprehensive journal processing system based on the TEX typesetting system that would automate the preparation of issues and simultaneously generate the materials needed for the DML-CZ project. In addition to the documents for print and metadata for the digital library, the system also generates supporting outputs for editorial staff. The web pages of the journal electronic edition are automatically generated as well as the database review form. The transformation of the articles of the retro-born-digital period was another part of the project. Bibliographical metadata of the 1992–2007 period have been extracted and translated into DML-CZ metadata-rich format. The articles were automatically processed, corrected and saved in the form needed for the DML-CZ project.

Acknowledgement This research was supported by grant reg. no. 1ET200190513 of the Academy of Sciences of the Czech Republic.

References

1. Sojka, P.: From Scanned Image to Knowledge Sharing. In Tochtermann, K., Maurer, H., eds.: Proceedings of I-KNOW ’05: Fifth International Conference on Knowledge Management, Graz, Austria, Know-Center in coop. with Graz Uni, Joanneum Research and Springer Pub. Co. (June 2005) 664–672. 2. Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectives and the First Steps. In Borwein, J., Rocha, E. M., Rodrigues, J. F., eds.: CMDE 2006: Communicating Mathematics in the Digital Era. A. K. Peters, MA, USA (2008) 69–79. 3. Sojka, P., Panák, R., Mudrák, T.: Optical Character Recognition of Mathematical Texts in the DML-CZ Project. Technical report, Masaryk University, Brno (September 2006) presented at CMDE 2006 conference in Aveiro, Portugal. 4. Bartošek, M., Krejcíˇ r,ˇ V.: Jak se deˇlá digitální matematická knihovna. In Sborník konference AKP 2007, Liberec, Czech Republic (2007). Available from WWW: http://dml.muni.cz/docs/akp2007-sbornik.pdf. 5. Czech Digital Mathematics Library [online]. [cit. 2008-05-30]. Available from WWW: http://dml.cz/. 176 Michal Ruži˚ ckaˇ

6. Czech Digital Mathematics Library: About DML-CZ [online]. [cit. 2008-06-22]. Available from WWW: http://dml.cz/about/. 7. Archivum Mathematicum [online]. Masaryk University, Brno. Last modified 14 May 2008 [cit. 2008-05-18]. Available from WWW: http://www.emis.de/journals/AM/. 8. Grimm, J.: Tralics, a LATEX to XML Translator. In Proceedings of EuroTEX, TUGboat 24(3) (2003) 377–388. 9. Tralics: a LATEX to XML translator [online]. Last modified $Date: 2008/05/13 09:32:16 $ [cit. 2008-05-18]. Available from WWW: http://www-sop.inria.fr/ apics/tralics/. 10. TeX Live@TEX Live [online]. $Date: 2008/05/17 00:21:31 $ [cit. 2008-05-25]. Available from WWW: http://www.tug.org/texlive/. 11. Probets, S., Brailsford, D.: Substituting outline fonts for bitmap fonts in archived PDF files. Software-Practice and Experience. 33(9) (2003) 885–899. 12. Research — Fonts [online]. [cit. 2008-05-25]. Available from WWW: http://www. eprg.org/research/fonts/. 13. Siegert, F.: PStill: ...generate, reprocess, normalize and extract content for PDF, EPS and PS. [online]. [cit. 2008-05-25]. Available from WWW: http://www.pstill.com/. 14. Krell, H.: What’s New With Springer Production? [online]. [cit. 2008-05-28]. Available from WWW: http://www.springer.com/societies?SGWID= 0-40801-12-481803-0. 15. Interview of Kaveh Bazargan and CV Radhakrishnan -- co-directors of River Valley Technologies [online]. Interview completed 2006-09-20, $Date: 2006/06/28 23:30:02 $ [cit. 2008-05-28]. Available from WWW: http://www.tug.org/interviews/ interview-files/river-valley.html. 16. Bouche, T.: A pdfLATEX-based automated journal production system. In Proceedings of EuroTEX 2006, TUGboat 27(1) (2006) 45–50. 17. Bouche, T.: CEDRICS: When CEDRAM Meets Tralics. (2008) In: Sojka Petr (editor): DML 2008 – Towards Digital Mathematics Library, Birmingham, UK, July 27th, 2008, pp. 153–165. 18. Centre de diffusion de revues académiques mathématiques [Center for diffusion of mathematic journals] [online]. [cit. 2008-05-25]. Available from WWW: http://www.cedram.org/. Data Enhancements in a Digital Mathematical Library

Michal Ruži˚ ckaˇ and Petr Sojka

Masaryk University, Faculty of Informatics, Botanická 68a, 602 00 Brno, Czech Republic, [email protected], [email protected]

Abstract. The quality of digital mathematical library depends on the formats and quality of data it offers. We show several enhancements of (meta)data of the Czech Digital Mathematics Library DML-CZ. We discuss possible minimalist modification of regular LATEX documents that would simplify generating basic metadata that describes the article in an XML/MathML format. We also show a proof of concept of a method that enables us to include LATEX source code of mathematical expressions into pdfTEX-generated PDFs in such a way that the reader can Copy & Paste the code from his PDF viewer. This code, hidden in the PDF file, can also be used for LATEX math indexing. Key words: metadata generation, XML, MathML, PDF, copy-math

1 Introduction

Since 2005, a digital mathematics library has been under development in the Czech Republic. The goal of the Czech Digital Mathematics Library project (DML-CZ) [3] is the preservation in digital form of the contents of the major part of mathematical literature ever published in the Czech lands, and to provide free and public access to the digital content and bibliographical data. The DML-CZ development was officially completed at the end of the 2009. The aim of this article is to give a short summary of some of the techniques that facilitated the success of this project. A LATEX document workflow consists of several steps, some of them can be reworked to enhance the final versions of documents that are stored in a digital repository. Besides postprocessing final PDF files [5], we can modify the processing of the document that a journal editor typically does, (can be seen in Section 2) and enrich the document source code itself (Section 3). In this article we intend to show how a slight modification to regular LATEX documents and classes enabled us to prepare DML-CZ metadata with only slight modification to the current workflow of the editors of mathematical journals involved. The EuDML project [4] has already been launched and it is hoped that the DML-CZ results can be applied to it. Despite being officially finished, the result of—the Czech Digital Mathematics Library—project is here and

Petr Sojka (editor): DML 2010, Towards a Digital Mathematics Library, pp. 69–76. c Masaryk University, 2010 ISBN 978-80-210-5242-0 70 Michal Ruži˚ cka,ˇ Petr Sojka we intend to continue developing in further. One possible contribution could be our method of including LATEX source code of mathematical expressions into pdfTEX-generated PDFs in such a way that the reader can Copy & Paste it directly from his PDF viewer. A PDF file of this kind could also be used for LATEX math indexing. Proof of concept of this technique is shown in the second part of this article.

2 Minimalist XML Metadata Extraction

Although the greater part of the DML-CZ project was retro-digitization—which involved scanning, OCR and finally processing the paper-only documents for the digital format—, future developments of the library depend on how the new issues of the mathematical journals are processed. With this in mind, it has been necessary to prepare appropriate software support for the mathematical journals involved that will enable editors to prepare DML-CZ data easily. The first approach was a complex system inspired by the French CEDRAM project [8,2]. It automates many of the standard procedures of the journal issue preparation [10]. Although the French system is used by the editors at the Archivum Mathematicum [1], not everyone there was willing to adopt such a complex system which seriously disrupted their current workflow. We therefore prepared a minimalist set of LATEX macros in the form of a LATEX macro package. This package can be easily customized to meet needs of a particular journal document class / style file. The LATEX macro package itself does not transform the LATEX source code to XML. Rather, this package literally exports selected parts of the LATEX document to an external file in such a way that it forms a simple LATEX document. This occurs without any expansion of the LATEX code; TEX toks registers are used (using the standard LATEX output system—\newwrite, \openout, \write, \closeout). This file is subsequently processed by a journal-independent Tralics-based procedure, which is described in the next section. The Tralics program [9,7]—a LATEX to XML translator—has proved itself 1 an adequate translator of the LATEX code to XML. Use of Tralics is the most indispensable part of the system. Its engine is able to process regular LATEX code which obviates converting the LATEX code to plain text directly; nor do we have to deal with the LATEX macro expansion or the complexity of its syntax. Tralics outputs a UTF-8 encoded XML file. This output is finally processed by the XLST processor furnishing DML-CZ metadata in its final form. A schema of the process can be seen in Figure 1. At the same time as the final PDF document is created, the metadata is automatically generated based on the same source code. Thus, we can be sure the metadata is correct and up-to-date unlike the situation in which the editors prepare metadata ‘by hand’ or generate it asynchronously. Even if the editor used another incarnation of TEX, instead of LATEX it should still be possible to export the necessary data in such a way that the result

1 Tralics is also used in the complex system of the Archivum Mathematicum journal.

72 Michal Ruži˚ cka,ˇ Petr Sojka

3 Copy Math—a Proof of Concept

The DML-CZ project stores full texts of the articles as PDF files as do many other digital libraries. PDF is widely adopted and very often used for electronic publications. Thanks to PdfTEX, PDF is also the de facto standard output format of the modern TEX distributions. Being capable of high quality mathematical typesetting, TEX is widely used. LATEX mathematical notation is well known, effective, and used not only in LATEX documents, but also in a variety of other projects, such as Wikipedia. Thus, LATEX source code is usually a good choice for plain text representation of mathematical expressions. Users and maintainers of repositories of digital documents themselves demand plain text for the content of PDF documents—in Japan, regular PDF documents are processed using OCR (optical character recognition) techniques to obtain plain text representation of math from PDFs [6,11]. Unfortunately, PdfTEX-produced PDF documents do not provide their readers with this kind of output if they use Copy & Paste functions of their preferred PDF reader. A LATEX document with a following body part has the PdfTEX generated PDF as shown in Figure 2. \begin{document} Text $\Pi(x) = \pi(x) + \frac{1}{2}\pi(x^{1/2}) + \frac{1}{3}\pi(x^{1/3}) + \cdots$ text. \end{document}

The content of the document is selected properly but the result of the Copy operation is malformed mixture of unicode characters. To address this

Fig. 2. CopyMath disabled PDF document Data Enhancements in a Digital Mathematical Library 73

Fig. 3. CopyMath enabled PDF document inconvenience we decided to use the ActualText command of the PDF language to mark the region of the mathematical expression inside the PDF document and allow PDF readers to provide their users with the LATEX source code of the expression. Figure 3 shows the PDF file that resulted from the same document with our experimental CopyMath macro package switched on. Mathematical expressions are not selected visually; the result of the Copy operation is the original LATEX source code itself:

Text $\Pi (x) = \pi (x) + \frac {1}{2}\pi (x^{1/2}) + \frac {1}{3}\pi (x^{1/3}) + \cdots $ text.

The implementation is not easy because we want the package to be as user friendly as possible—users should not be forced to modify their mathematical expressions in any way, \usepackage{copymath} should cater for all their needs. However, this requires nonstandard modifications of the LATEX mathematical environments. To implement CopyMath we need to add \pdfliteral at the beginning and end of every mathematical environment. The dollar sign ($) is activated (\catcode‘$=13) and redefined. It is necessary to keep track of nested mathematical environments (e.g. $a\mbox{$b$}c$), and double-dollar display- math syntax ($$a + b$$) adds another layer of complication. To redefine LATEX mathematical environments (\begin{math}...\end{math}, \begin{eqnarray}...\end{eqnarray} etc.) we keep the original definition of their opening (\let\normalequation\equation) and closing commands. The environment is consequently redefined using our auxiliary macros. The open- ing command is substituted for a macro that scans tokens until the closing command of the mathematical environment is achieved. And we must never lose sight of nested environments. The scanned content of the mathematical environment is used to prepare a \pdfliteral code. The \pdfliteral code 74 Michal Ruži˚ cka,ˇ Petr Sojka

and the original content of the mathematical environment are used by another auxiliary macro that is used instead of the closing command of the original mathematical environment. Here is an example of CopyMath macro definitions:

%% Auxiliary macros. \newcounter{nestedmath} \setcounter{nestedmath}{0} % \newtoks\copymath@envgetbuffera \newtoks\copymath@envgetbufferb % \long\def\copymath@envget#1#2\end #3{% \copymath@envgetbuffera=\expandafter{\copymathenvput}% \def\copymath@envtempa{#3}\def\copymath@envtempb{#1}% \ifx\copymath@envtempa\copymath@envtempb% \copymath@envgetbufferb={#2}% \def\copymath@envgetnext{\end{#1}}% \else% \copymath@envgetbufferb={#2\end{#3}}% \def\copymath@envgetnext{\copymath@envget{#1}}% \fi% \global\edef\copymathenvput{% \the\copymath@envgetbuffera \the\copymath@envgetbufferb}% \copymath@envgetnext} % \long\def\copymathenvget#1{% \gdef\copymathenvput{}\copymath@envget{#1}} %

%% $ \let\@origensuredmath=\@ensuredmath % \def\normalinlinemath#1{% \ifnum\value{nestedmath}>0 \@origensuredmath{#1}% \else% \addtocounter{nestedmath}{1}% \pdfliteral{/Span << /ActualText<\pdfescapehex{\detokenize{$#1$}% }> >> BDC}% $#1$% \addtocounter{nestedmath}{-1}% \pdfliteral{EMC}% \fi} % \let\@ensuredmath\normalinlinemath % \catcode‘$=13

%% \begin{equation}...\end{equation} \let\normalequation\equation \let\normalendequation\endequation Data Enhancements in a Digital Mathematical Library 75

\renewenvironment{equation}% {\copymathenvget{equation}}% {\ifnum\value{nestedmath}>0 \message{You cannot nest equation}% \else% \normalequation% \addtocounter{nestedmath}{1}% \pdfliteral{/Spanx << /ActualText<\pdfescapehex{% \detokenize{\begin{equation}}\copymathenvput\detokenize{% \end{equation}}}> >> BDC}% \copymathenvput% \addtocounter{nestedmath}{-1}% \pdfliteral{EMC}% \normalendequation% \fi}

Unfortunately, it seems that this approach is not as universal as expected. For example, it is not possible to directly use this kind of macro redefinition for AMS-LATEX mathematical environments and this has necessitated a complex macro redefinition. Another possible solution should be preprocessing of the source code using an external tool. This approach, however, would need to deal with the complexity of the LATEX syntax.

4 Conclusions

Minimalist modifications of the current editorial workflow proved to be an easy way of moving mathematical journal editors to a digital-library-friendly state. Tralics provides us with sufficient functionality to perform this easily and with platform independence. The CopyMath macro package shows an alternative route to improving pdfTEX-generated PDFs, but the proper redefinition of all possible mathematical environments cannot be expected to be easy.

References

1. Archivum Mathematicum. [online], http://www.emis.de/journals/AM/, Masaryk University, Brno, Czech Republic. Last modified December 18, 2009. [cit. 2010-04-25]. 2. Centre de diffusion de revues académiques mathématiques. [online], http://www. cedram.org/, [Center for diffusion of mathematic journals]. [cit. 2008-05-25]. 3. Czech Digital Mathematics Library. [online], http://dml.cz/, [cit. 2010-04-24]. 4. EuDML: The European Digital Mathematics Library. [online], http://www.eudml. eu/, This page was last modified on 20 January 2010, at 08:09. [cit. 2010-04-25]. 5. Hatlapatka, R., Sojka, P.: PDF Enhancements Tools for a Digital Library. In: Sojka, P. (ed.) Proceedings of DML 2010, pp. 69–76. Masaryk University Press, Paris, France (Jul 2010). 6. Infty Project: Research Project on Mathematical Information Processing. [online], http://www.inftyproject.org/en/, [cit. 2010-06-02]. 76 Michal Ruži˚ cka,ˇ Petr Sojka

7. Tralics: a LaTeX to XML translator. [online], http://www-sop.inria.fr/apics/ tralics/, Last modified $Date: 2009/11/24 17:17:03 $ [cit. 2010-04-24]. 8. Bouche, T.: A PdfLATEX-based automated journal production system. TUGboat 27(1), 45–50 (2006), In Proceedings of EuroTEX 2006. 9. Grimm, J.: Tralics, a LATEX to XML Translator. TUGboat 24(3), 377–388 (2003), In Proceedings of EuroTEX. 10. Ruži˚ cka,ˇ M.: Automated Processing of TEX-Typeset Articles for a Digital Library. In: Sojka, P. (ed.) DML 2008 – Towards Digital Mathematics Library. pp. 167–176 (2008), Birmingham, UK, July 27th, 2008. 11. Suzuki, M., Kanahori, T., Ohtake, N., Yamaguchi, K.: An Integrated OCR Software for mathematical Documents and Its Output with Accessibility. In: Computers Helping people with Special Needs. Lecture Notes in Computer Sciences, vol. 3119, pp. 648–655. Springer (2004), 9th International Conference ICCHP 2004, Paris, July 2004. Metadata Editing and Validation for a Digital Mathematics Library

Miha Filej 1, Michal R ˚užiˇcka 2, Martin Šárfy 3, and Petr Sojka 2

1 University of Ljubljana, Faculty of Computer and Information Science Tržaška 25, 1000 Ljubljana, Slovenia [email protected] 2 Masaryk University, Faculty of Informatics Botanická 68a, 602 00 Brno, Czech Republic, [email protected] , [email protected] 3 Masaryk University, Institute of Computer Science Botanická 68a, 602 00 Brno, Czech Republic [email protected]

Abstract. For preparing and validating metadata for the Digital Math- ematics Library DML-CZ, a new tool, the Metadata Editor, has been developed. This paper outlines the procedures for Linguistic and geo- graphical localizations its components. Also mentioned are such aspects as dynamic generation of form editing based on the XML Schema, the validation procedures as well as support for semiautomatic procedures regarding quality assurance. Key words: DML-CZ, Metadata Editor, internationalization, translation, localization, validation, XML, forms, Ruby, Perl, JavaScript

1 Introduction

Since 2005, the Czech Digital Mathematics Library project (DML-CZ) [ 3] has been under development in the Czech Republic. An important part of the project has been the development of the Metadata Editor [ 4,11 ]—a client–server web application designed to manage, edit, and validate each article’s metadata and full texts prior to their integration into the digital library. The Metadata Editor is open-source software ( http://dme.sourceforge. net/ ) and having proven its efficiency is now in use in a variety of other environments. These include the Faculty of Arts of Masaryk University, the Kramerius project of the Moravian Library [ 13 ], and the Editor may also be used in the EuDML project [ 5] as well. In this article we present some recent developments of the Metadata Editor.

2 On-line Submissions and Validation

The viability of a digital library rests with new acquisitions emerging mainly in the form of born-digital publications. The born-digital inputs to the Metadata Editor come from different sources, primarily from editors of various journals.

Petr Sojka (editor): DML 2010, Towards a Digital Mathematics Library, pp. 57–62. c Masaryk University, 2010 ISBN 978-80-210-5242-0 58 M. Filej, M. R ˚užiˇcka,M. Šárfy, P. Sojka

To assure a smooth integration of a new publication into the Metadata Editor Database, it has to satisfy a particular data format specification available to all the contributors. For this reason, it was necessary to set up a safe and comfortable interface between the contributors and the Metadata Editor. Because the Metadata Editor is a web application, it is easy to provide the users with direct on-line access based on a private user account in the Editor. After logging in the user can upload a new delivery directly and it is automatically assigned to the appropriate journal. The new entries are automatically validated so that the user gets warnings about inappropriate formats of data, while flawed submissions are completely rejected. It obviates later corrections and helps the users to prepare data in the required format.

3 Dynamic Generation of Editing Forms

One of the most important functions of the Metadata Editor consists in facilitating interactive modification of metadata. The operators are allowed to browse the contents of the Metadata Editor database and make necessary adjustments through the web-based interface of the relevant forms. Since the metadata language is formally defined by an XML Schema, it is possible to generate the forms dynamically based on the XML Schema definition. The mechanism consists of server-side and client-side scripting. The XML Schema is processed on the server by a Perl script generating a JavaScript code that is included in the web page and which is subsequently sent to the client. This JavaScript code runs in the web browser of the end user and generates a form matching the language defined by the source XML Schema. Not all features of the XML Schema are supported, but the mechanism is powerful enough to satisfy the requirements. In addition to being a part of the Metadata Editor, a generalized version of the forms generator is available as a standalone open-source project [9].

4 Internationalizing the Metadata Editor

4.1 Internationalization, Translation, Localization In a nutshell, adapting the user interface of an existing application to new languages involves changing the output in a way that will please the current user. While translation could easily be considered the most important part of this process, it is not enough by itself—both translation and localization are required. When dealing with source and target regions that are not similar, a complete localization of an application is difficult to achieve. Common parts of an application that have to be localized are time and date formats. The way time or date is displayed—the number of digits used, the separators, the order of date components, whether the 24- or 12-hour format is used—these can all vary from region to region. In addition, time zones in which users reside may Metadata Editing and Validation for a Digital Mathematics Library 59 differ. The process of localization has to ensure that every date output of the application is displayed relative to the corresponding time zone. Depending on the degree of internationalization that needs to be performed and the locales that need to be supported, more specific issues may be encountered: pluralization, units conversion (metric vs. imperial, currencies etc.), right-to-left text orientation. Particular attention has to paid to words or phrases that have different meanings due to cultural differences and may even be offensive.

4.2 Implementation The Metadata Editor is built using a variety of technologies and programming languages. The part that interacts with the user is mostly handled by Ruby [ 8], which requires support from end libraries. In the past, there were various (incompatible) internationalization solutions in the Ruby ecosystem, each solving its own set of problems. In 2007 an effort to provide a generalized library emerged resulting in I18n [ 10 ], the library that is now the de facto standard for the internationalization of Ruby applications. Being a general solution it does not provide complex internationalization facilities; instead it defines an interface for other libraries to extend its functionality and remain compatible with each other at the same time. I18n provides two basic methods, I18n.translate and I18n.localize (due to frequent use abbreviated to I18n.t and I18n.l , respectively). I18n.t handles translation by mapping an explicitly defined namespaced key to a string in a natural language. The approach differs from the popular GNU gettext [ 6] which maps a string in a natural language to a string in another natural language (although gettext’s .so and .po files can still be used with I18n to store the translations). Having explicitly programmer-defined keys should result in greater maintainability by simplifying the way translations are reused throughout the application and avoids the issue where two sentences in different contexts in a language translate to the same sentence in another language, and vice-versa. I18n.l takes various objects like time, date etc. and localizes them according to the defined localization rules. I18n’s pluggable back-ends allow internationalized data to be stored in different ways. In addition to the gettext format mentioned above, YAML files, various relational databases and key-value stores are available as storage options. By defining the interface for implementing a back-end, the I18n library enables programmers to build a custom storage solution that suits their needs.

4.3 Choosing a Locale Apart from altering the code to replace hardcoded strings with calls to methods that translate and localize them, a logic that handles switching between the locales needs to be introduced to the application. Since the Metadata Editor is a web application, the locale has to be set per request. With the help of sessions and cookies it is possible to persist a given locale between requests of the same 60 M. Filej, M. R ˚užiˇcka,M. Šárfy, P. Sojka user, so the question remains: which locale is to be introduced for the first time (for a new user with no cookies)? There is no safe way how to determine a locale for a user’s first request, but a web application is able to take a guess based on a few hints. The HTTP/1.1 protocol defines the Accept-Language header [ 7] and at first it may be tempting to use the information provided by the user agent to set the default locale for the user, but there are several things to take into account [ 12 ]. Many users are unaware of the setting which was probably set when the user agent was installed and is might not conform to their preferences. The user agent may send a request that only defines the language without specifying the region (e.g. instead of de-DE , de-CH or de-AT indicating German as spoken in Germany, Switzerland or Austria, respectively, only de may be requested). If the user does not access the application from his own machine, the inferred locale may be inappropriate, especially when one is in a foreign country. Last but not least, the header may not be set at all. Another clue from which the locale can be inferred is the user’s IP address. With the help of a database or an external geolocation service it is possible to determine the user’s geographical origin; but the approach shares a lot of the shortcomings described above. It is important that the application is not bound to its guess but allows the user to set his own preference at any point of the interaction. Whenever a locale is explicitly chosen, it is safe to assume it as a default for future requests from the same user. To sum up, the logic for setting a locale has to consider (from highest to lowest priority): the previously set preference, the locale guessed from the HTTP headers, the locale guessed from the IP of the source and the default locale. Ideally the logic would set the locale as soon as the request was received, at the beginning of the interaction with the user, the programme input. Then, when computing the output, dedicated functions would perform translation and localization depending on the set locale.

4.4 Refactoring, Dangers, Precautions The effort of adapting an application to another language rests with the difference between the source and the target language. Given that an adaptation to a broader set of languages is preferable, the codebase requires a major altering—a process that is prone to mistakes. The Metadata Editor being a relatively complex codebase, taking precautions against introducing bugs is the more important. The desired result of the process of internationalization is—(at least when rendering in the original locale) that the output matches the output of the programme before it was adapted. To assure this a set of specifications is needed. Automated software testing is strongly encouraged in the Ruby community and the past few years have seen an evolution of tools and practices for unit, function and integration testing. In 2008 a tool called Cucumber [ 2] was introduced. It differs from other solutions in the way that specifications (called features) are not written in Ruby, but in a language called Gherkin [ 1]. This Metadata Editing and Validation for a Digital Mathematics Library 61 domain specific language serves two purposes: documentation and automated tests. It allows describing software behaviour irrespective of how that behaviour is implemented. Gherkin’s grammar has only a few simple rules and reads like spoken language. This allows feature specifications to be written and understood not only by programmers but by domain experts as well, thus increasing the value of the specifications. While Cucumber itself is written in Ruby, it can be used to test codes written in other languages, which makes it suitable to cover the non-Ruby parts of the Metadata Editor. In Figure 1 one can see that Cucumber communicates with the application at the framework level, offering a better control over the request parameters than a direct communication with the application server or the web server level would provide.

application feature code specifications

framework interface test runner

application server cucumber

client request/response web server

Fig. 1. Cucumber integration diagram

5 Conclusions

The Metadata Editor is a live, continuously developing project. New features are added as needed. The on-line input and validation service was worked in to provide users with a comfortable and safe interface for data inclusion, the user interface is dynamically generated based on the formal definition of the metadata, the localization of the Metadata Editor is in progress. The Metadata Editor is used in several projects and will possibly be used in the EuDML project as well.

References

1. aslakhellesoy / cucumber. [online], http://wiki.github.com/aslakhellesoy/ cucumber/gherkin , Last edited by zwyan2009, 2 days ago [cit. 2010-04-28]. 62 M. Filej, M. R ˚užiˇcka,M. Šárfy, P. Sojka

2. Cucumber : Behaviour driven development with elegance and joy. [online], http: //cukes.info/ , [cit. 2010-04-28]. 3. Czech Digital Mathematics Library. [online], http://dml.cz/ , [cit. 2010-04-24]. 4. Digitization Metadata Editor. [online], http://dme.sourceforge.net/ , [cit. 2010-04-28]. 5. EuDML: The European Digital Mathematics Library. [online], http://www.eudml. eu/ , This page was last modified on 20 January 2010, at 08:09. [cit. 2010-04-25]. 6. gettext. [online], http://www.gnu.org/software/gettext/ , Updated: $Date: 2010/01/31 14:51:43 $ [cit. 2010-04-28]. 7. HTTP/1.1: Header Field Definitions : Accept-Language. [online], http://www.w3. org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4 , [cit. 2010-04-28]. 8. Ruby Programming Language. [online], http://www.ruby-lang.org/en/ , [cit. 2010-04-28]. 9. SchemaForms. [online], http://sforms.sourceforge.net/ , [cit. 2010-05-30]. 10. svenfuchs / i18n. [online], http://github.com/svenfuchs/i18n , [cit. 2010-04-28]. 11. Bartošek, M., Kováˇr, P., Šárfy, M.: DML-CZ Metadata Editor : Content Creation System for Digital Libraries. In: Sojka, P. (ed.) DML 2008 – Towards Digital Mathematics Library. pp. 139–151 (2008), Birmingham, UK, July 27 th , 2008. 12. Honomichl, L.: Accept-Language used for locale setting. [online], http://www.w3. org/International/questions/qa-accept-lang-locales , Last substantive up- date 2003-09-17 12:15 GMT. This version 2006-11-25 16:35 GMT [cit. 2010-04-28]. 13. Šárfy, M.: Metadatový editor pro digitální knihovny. In: Knihovny souˇcasnosti2009. pp. 140–154. Brno (2009), http://www.sdruk.cz/sec/2009/sbornik/2009-6-140. pdf , Seˇcu Chrudimi, CZ, June 23 rd , 2009. ISBN 978-80-86249-54-4 Building Corpora of Technical Texts Approaches and Tools

Petr Sojka, Martin Líška, and Michal Ruži˚ ckaˇ

Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno, Czech Republic [email protected], [email protected], [email protected]

Abstract. Building corpora of technical texts in Science, Technology, Engineering, and Mathematics (STEM) domain has its specific needs, especially the handling of mathematical formulae. In particular, there is no widely accepted format to represent and handle math. We present an approach based on multiple representations of mathemati- cal formulae that has been used for math retrieval, similarity and cluster- ing of mathematical corpus. We provide an overview of our toolset, sum- marize our experiments to date and propose further research directions and approaches.

Key words: mathematical corpora; information retrieval of mathematics; representation of mathematical formulae; math search and indexing normalization of MathML; canonicalization

1 Introduction

Leading research in empirical linguistics builds on the large (e.g. web-scale) cor- pora such as those created by Google (Google Books Corpus, Google Scholar) or by the Sketch Engine (TenTen Corpora). Such corpora allow for natural lan- guage processing (NLP) of a new quality level to solve such tasks as more rel- evant information retrieval, document clustering, classification and similarity, thesauri and ontology building, better word sense disambiguation, machine translation and many others. However, in these research mainstream activities, minority languages or domain specifics are neglected. Such a neglected ‘lan- guage’ is the language of mathematics – typical in Science, Technology, Engi- neering, and Mathematics (STEM) documents. Mainstream NLP workflow for building corpora starts with tokenization, which is usually not aware of mathematical formulae or equations. Math is usually supported neither by optical character recognition (OCR) tools, nor by applications that generate PDF or (X)HTML. The use and representation of math on the web is far from settled. As a consequence, no mainstream tools support this niche market of ‘the Queen of sciences’. In previous projects that involved building Digital Mathematics Libraries (DML) such as DML-CZ [1] and EuDML [2], we had to deal with the fact that NLP corpora tools were unable to handle corpora of math texts, let alone build

Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011, pp. 69–79, 2011. c Tribun EU 2011 70 Petr Sojka, Martin Líška, Michal Ruži˚ ckaˇ them. We therefore devised some tools for adequate support of mathematical formulae in NLP and information retrieval (IR) tasks. Proper semantic and math-aware representation is a necessary prerequisite for efficient and effective NLP processing of STEM corpora. This task involved as the first step the design of math formulae representa- tion (Section 2). Then, to build mathematical corpora, we had to preprocess and normalize heterogenious inputs (Section 3) into this new representation. It was also necessary to design ways of math retrieval (index and search are crucial, cf. Section 4). Our aim is to support math-aware document clustering, similarity and disambiguation (Section 5). We summarize our findings in the Section 6.

2 Math Representations

Mathematicians and other authors of STEM documents encode quantities and relations using formulae and equations in compact, often two-dimensional, notation. These objects have to be represented in unique way in the global STEM document handling system. There are numerous ways of notating the same mathematical object, that has evolved in some geographical location or language. This is an example of different notations for a binomial coefficient:

n n! = = C = nC = C(n, r)  r  r!(n − r)! n r r

When searching STEM documents, it should be possible to find the same objects within a corpus, and assign them the same representation even though authors have used different notation. As is the case with text handling, where words with the same meaning are treated and indexed the same, formulae require the same treatment. The matter is complicated by the fact that there are different formats for handling mathematics: TEX, MathML, OpenMath, etc.

2.1 TEX

Authors prefer the compact and logical notation of TEX. The American Math- ematical Society (AMS) extended standard plain TEX and LATEX notation with AMS packages (so called AMSLATEX) for commutative diagrams, aligned equa- tions, etc. The namespace of AMSLATEX macros is nowadays the de facto stan- dard for the typesetting of mathematical documents, and this namespace is also supported in ithe metadata of DML-CZ (only this namespace is allowed and supported there, e.g. by conversion to MathML). TEX math notation is in such demand that even for Word there is plug-in by Design Science that allows entering the formulae in TEX notation. This is much quicker, and more convenient than choosing the symbols from numerous menus and symbol tables. Nevertheless, using TEX notation for indexing Building Corpora of Technical Texts: Approaches and Tools 71 purposes is a disaster, as an example of LaTeXSearch application by Springer (http://latexsearch.com) shows. Authors are so creative in macroexpansion use or TEX language formatting that different notations cannot be coped with by simple string similarity. A formulae structure and other types of similarity has to be used in formulae representation for similarity computation. For this, the tree structure of XML (MathML) is better, as it is understood by the majority of math-aware software developers.

2.2 MathML In the world of applications and software interfaces, MathML usually wins, as it is supported by W3C and AMS. TEX’s macro namespace extensibility is a nightmare to support by software without the full TEX macroexpansion complex engine, and here MathML clearly wins. MathML DTD allows easy formulae validation and processing with XML tools. There are even recently developed portable tools like MathJax, a JavaScript library that displays mathematics in web browsers, supporting both LATEX and MathML markup as it attempts to convert LATEX on-the-fly into appropriate markup language–HTML or MathML.

2.3 Set of M-terms A mathematical document contains mathematical formulae, which are integral to the content of the document. As mentioned in the previous sections, these formulae are usually represented in TEX if authored by humans, or in MathML (presentation or mixed content-presentation) if produced by machines. To be able to search for such structural information using a fulltext index- ing approach as in the Math Indexer and Searcher (MIaS) system [3,4], a con- venient representation needs to be selected. This representation needs to be a trade-off between the TEX powered authors’ part of the world and machine- friendly, preserve-as-most-information-as-possible, structural and semantic no- tation such as Content MathML. In MIaS system we opted for Presentation MathML as it stands, we find, exactly halfway. It is relatively easy to obtain by converting the author’s TEX markup and it still holds the necessary struc- tural information for machine processing. It is still easily extensible by Content MathML trees capturing the formulae semantics. Such mathematical markup needs to be preprocessed before the indexing. This is mainly to accommodate the best user search experience as possible. For each formula in the text, the system produces several representations which are stored in the index and are searchable in the same way as regular textual terms. These are called M-terms. M-terms are translated from XML to a linear string form. In this form they are stored by the indexing core. This representation omits any XML markup that would be redundant in such a form, such as start and end tags, and replaces it with brackets to prevent ambiguity. Also most of the attributes setting the visual behaviour of the expression can be left out, since it does not hold any 72 Petr Sojka, Martin Líška, Michal Ruži˚ ckaˇ information pertaining to the meaning of the formula. This representation can be further compacted by substituting tag names for single characters to decrease storage space requirements. For example, simple expression a2 + b in its XML form

a2 + b is translated to the linear form mrow(msup(mi(a)mn(2))mo(+)mi(b)) and based on a custom tag name dictionary, where mrow = R; msup = J; mi = I; mn = N and mo = O. This is further compacted to R(J(I(a)N(2))O(+)I(b)). A set of sub-M-terms is generated for each input formula. It consists of subformula- weight pairs. For this particular expression, it is: { (mi(a),0.08166666), (mn(2),0.08166666), (msup(mi(a)mn(2)),0.11666667), (mo(+),0.11666667), (mi(b),0.11666667), (mrow(mi(b)mo(+)msup(mi(a)mn(2))),0.16666667), (msup(mi(1)mn(2)),0.093333334), (mrow(mi(1)mo(+)msup(mi(2)mn(2))),0.13333334), (msup(mi(a)mn(¶)),0.058333334), (mrow(mi(b)mo(+)msup(mi(a)mn(¶))),0.083333336), (msup(mi(1)mn(¶)),0.046666667), (mrow(mi(1)mo(+)msup(mi(2)mn(¶))),0.06666667) } These formulae are derived from the original one and their level of similarity is expressed by the weight factor. This representation not only grabs the structural similarity of mathematical formulae, it also copes with different variable names, and with mathematical properties of operators (commutativity). As such, representation of formulae by an M-term set with weights is directly useable for indexing or for document similarity computations. To provide these and other uses of this representation, we have set up a RESTful web service, where for each input formula one can get a set of M-terms as they would be indexed in the MIaS system. An example of use can be found here: http://aura.fi.muni.cz:8085/mias4gensim/mathprocess?mterm= a+b Building Corpora of Technical Texts: Approaches and Tools 73

3 Mathematical Corpora

3.1 Normalization When building mathematical corpora using MathML as a language for mathe- matical formulae preservation, it emerges that it is very useful to process and normalize MathML that is being stored. It is necessary as one mathematical for- mula can be encoded in MathML in different forms – using different sequences of characters in the source code–but its meaning is the same. For example, the formula x2 + y2 can be encoded in MathML in the form: x 2 + y 2 But some other author can use this form: x2 + y2 To be able to find documents that contains our formula in any of these codings we need one normalized form that will be stored in the index. Subsequently, any query for this formula in any coding has to be transformed to the normalized form at the beginning. Moreover, examples of documents from the real world (PubMed Central digital library workflow) show that validation of MathML source codes is not enough. Elbow et al. [5] demonstrate a well-known fact that current authors’ main target is print output – consequently one can find MathML fragment 75 as source code of the number ‘75’ for example. These anomalies have to be sorted out before publishing and indexing in a repository. For the semantically same formalae there exist infinitely many ways of representing them in MathML. For NLP handling it would be convenient to have one canonical representation of a formulae. 74 Petr Sojka, Martin Líška, Michal Ruži˚ ckaˇ

3.2 Canonicalization

Proper MathML normalization (canonicalization) is not easy given that MathML is a very complex markup language. Some existing tools we have tested fail when run over a set of MathML test documents [6] that were de- signed to cover a wide range of MathML features. Our approach to MathML normalization has so far involved a trial use of UMCL (Universal Maths Conversion Library; http://inova.ufr-info-p6. jussieu.fr/maths/umcl). [7,8] The main purpose of the UMCL tool set is to enable transcription of the MathML formulae to Braille national codes. Related to this task is also the need for MathML formulae unification. UMCL transformation of the MathML to Canonical MathML is carried out using a set of XSL stylesheets [9]. With minor modifications, the UMCL MathML transformation was used in the WebMIaS interface [10] (see Section 4) that can be used to search over our MREC corpus (see Section 3.3). This showed benefits of formulae normalization in practice – search form x2 + y2 formula using the first form of MathML code from the previous section found no results. However, for the second form of MathML – the form that is the result of UMCL XSL transformation from the first form–there were 36,817 hits in MREC corpus version 2011.4. Unfortunately, the MathML canonicalization module of the UMCL tool set is not as powerful as we thought at the beginning. Using the W3C MathML Test Suite mentioned in the previous section, some weak points in UMCL normal- ization process have been identified. Among other things, there are problems with MathML tags like ‘mphantom’, ‘mfenced’, ‘mglyphe’, ‘mmultiscripts’, ‘mover’ and ‘mstyle’ that are not properly converted. Furthermore, attributes of MathML elements are not reported in the UMCL canonicalized MathML. These problems were consulted with UMCL developers but no fast and clear solution seems to be available. Due to these problems, UMCL in the current version does not seem to be directly applicable to MREC corpus and further research in this area is definitely necessary.

3.3 Corpus MREC

To provide a test platform for mathematical search tools, we are building a corpus of mathematical texts. We call this corpus MREC. MREC is based on arXMLiv [11] – a project of Michael Kohlhase’s group at Jacobs University Bremen. arXMLiv documents came from arXiv.org but have been translated to XML by arXMLiv project. These documents cover different STEM areas – Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. However, MREC is not an exact copy of the arXMLiv content. MREC con- tains just a subset of the arXMLiv – arXMLiv puts transformed documents into several classes – successful, complete with errors and incomplete, depending on the results of the transformations. MREC contains papers from conversion classes, successful and complete with errors (missing macros) – see Table 1. We Building Corpora of Technical Texts: Approaches and Tools 75 have collected 439,423 documents in well-formed XHTML, containing mathe- matical formulae in valid MathML.

Table 1. Documents collected from arXMLiv

arXMLiv transformation result class Quantity successful (no problem) 65,874 successful (warning) 291,879 complete with errors (missing macros) 81,670 All documents 439,423

Moreover, there were several modifications of the files that from our point of view were necessary in order to make the documents well-formed and valid. These modifications include removing unnecessary attributes, names- pace proxies, ‘div’ elements nested in ‘span’ elements and so on. MREC consists of well-formed XHTML documents. MathML is used for representation of mathematical formulae. Although MREC is under constant development, it is necessary for both archive and comparison purposes to produce a stable release versions. For this reason, there are several version of MREC corpora available at http: //nlp.fi.muni.cz/projekty/eudml/MREC/. The first public version of MREC, version 2011.3.324, consists of 324,060 documents. The resulting corpus size was 53 GB uncompressed, 6.7 GB compressed. Documents contained 112,055,559 formulae in total, of which 2,129,261,646 mathematical expressions were indexed. The resulting index size was approximately 45 GB. The newer version of MREC, version 2011.4.439, consists of 439,423 scien- tific documents containing 158,106,118 mathematical formulae. 2,910,314,146 expressions were indexed and the resulting size of the index is 63 GB. The sizes of uncompressed and compressed corpora are 124 GB and 15 GB, respectively.

4 Math Retrieval

Searching functionality is nowadays a key form of getting orientated in the vast amount of information "out there" and obtaining the information we seek. Just as websites providing special content such as images and videos enable searching for these tokens, portals providing mathematical content such as EuDML [12] should also be able to search for the formulae. In our view, the optimal way of doing so is to provide a simple Google- like interface where one can pose mathematical and textual query tokens one alongside the other. Search results returned to a textual query can then be finely constrained by adding a formula to the query and, in fact, vice-versa. We present this approach in the WebMIaS interface [10]. 76 Petr Sojka, Martin Líška, Michal Ruži˚ ckaˇ

For example, by posting a simple query x2 + y2 in our web interface, the system returns 36,817 results. Addition of one more keyword Euclid reduces the number of results to only 97 – all of them contain this textual term. Conversely, searching only for Euclid returns 848 results and by adding x2 + y2 expression, we get the same 97 matches (MREC 2011.4.439). To implement math-aware IR system in addition to the web-interface it was necessary to create an index to be consulted during query evaluation. We use our M-term representation for this, as described in detail in [3,4]. We have evaluated the system’s speed. As is shown in Table 2, the performance of the MIaS system scales linearly. This gives feasible response times even for our billions of indexed subformulae.

Table 2. Indexing scalability test results (run on 448 GiB RAM, eight 8-core 64bit processors Intel XeonTM X7560 2.26 GHz driven machine).

# Docs Input formulae Indexed formulae run-time [ms] CPU time [ms] 10,000 3,406,068 64,008,762 2,145,063 2,102,770 50,000 18,037,842 333,716,261 11,382,709 10,871,500 100,000 36,328,126 670,335,243 23,066,679 21,992,100 200,000 72,030,095 1,326,514,082 46,143,472 44,006,180 300,000 108,786,856 2,005,488,153 71,865,018 66,998,550 350,000 125,974,221 2,318,482,748 83,199,724 77,886,160 439,423 158,106,118 2,910,314,146 104,829,757 97,393,301

5 Further Research Directions in Math Similarity, Clustering and Disambiguation

In mathematics, Mathematical Subject Classification (MSC) is used by most journals today, being supported and developed by both Mathematical Reviews (MR) and Zentralblatt Math (ZMath). Our research so far [13] has shown that machine-learned classification and similarity tasks are tractable to be supported by DMLs. However, previous research paid very little attention to the representation of mathematics. Either textual tokens alone were used, or the formulae were split into variables, constants and operators, and used in a ‘bag of words’ for documents. Such representation is insufficient given that it does not convey the structure of formulae, and neither does it pay attention to semantically similar formulae (e.g. written in different variable names, sorted differently as a + b vs. b + a, etc.). We are currently using the Gensim [14] system to evaluate the possibility of using M-terms instead of the usual tokenization and comparing the effects this new representation has on similarity and clustering improvements over non math-aware representations. We believe that M-term representation will Building Corpora of Technical Texts: Approaches and Tools 77 significantly improve the quality of document similarity metrics computed by Gensim. Further improvements could be achieved by employing cutting edge results on semantic disambiguation. Symbol f might play the rôle (have meaning) of a variable, functional, (linear) function, and potentially a dozen other meanings. To have a greater relevance to searching and better document clustering, even mathematical formulae should be disambiguated at this level, as authors are usually reluctant to do so in the (LATEX) sources or in Content MathML. Our representation method is easily inclusive with respect to these refinements – one just needs to add a notation and weighting for similarity of new terms representing Content MathML (semantics). There were attempts to bring NLP approaches to math corpora handling recently [15,16]. The most consistent problem remains the high degree of am- biguity in mathematical formulae and nonexistence of tagged disambiguated math data. There is a promising approach to distinguishing roles of words (math to- kens) which depends on the contexts of use in a corpus called LDA-frames [17]. It uses statistics to distinguish different roles based on different structural pat- terns of word usage in corpora. We are considering the possibility of using a fuzzy version of Formal Concept Analysis (FCA) [18] to identify the rôles of math tokens in formulae, and in combination with LDA-frames to disam- biguate them.

6 Summary and Conclusions

In this paper, we have identified and described the problems we have faced when building nontrivial corpora of STEM documents MREC. We have sug- gested M-term representation for math-aware indexing and similarity compu- tations. We have reported current results in imath-aware indexing and isearch- ing. We have discusses future research directions towards fully fledged math- aware corpora processing like math-aware document similarity or disambigua- tion of math symbols in formulae.

Acknowledgements This work has been partially supported by the Ministry of Education of CR within the Center of Basic Research LC536 and by the European Union through its Competitiveness and Innovation Programme (Information and Communications Technologies Policy Support Programme, “Open access to scientific information”, Grant Agreement No. 250503).

References

1. Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectives and the First Steps. In: Borwein, J., Rocha, E.M., Rodrigues, J.F., eds.: CMDE 2006: Communicating Mathematics in the Digital Era. A. K. Peters, MA, USA (2008) 69–79. 78 Petr Sojka, Martin Líška, Michal Ruži˚ ckaˇ

2. Borbinha, J., Bouche, T., Nowinski,´ A., Sojka, P.: Project EuDML—A First Year Demonstration. In: Davenport, J.H., Farmer, W.M., Urban, J., Rabe, F., eds.: Intelligent Computer Mathematics. Proceedings of 18th Symposium, Calculemus 2011, and 10th International Conference, MKM 2011. Volume 6824 of Lecture Notes in Artificial Intelligence, LNAI., Berlin, Germany, Springer-Verlag (2011) 281–284 http://dx.doi.org/10.1007/978-3-642-22673-1_21. 3. Sojka, P., Líška, M.: Indexing and Searching Mathematics in Digital Libraries – Ar- chitecture, Design and Scalability Issues. In: Davenport, J.H., Farmer, W.M., Urban, J., Rabe, F., eds.: Intelligent Computer Mathematics. Proceedings of 18th Sympo- sium, Calculemus 2011, and 10th International Conference, MKM 2011. Volume 6824 of Lecture Notes in Artificial Intelligence, LNAI., Berlin, Germany, Springer-Verlag (2011) 228–243 http://dx.doi.org/10.1007/978-3-642-22673-1_16. 4. Sojka, P., Líška, M.: The Art of Mathematics Retrieval. In: Proceedings of the ACM Conference on Document Engineering, DocEng 2011, Mountain View, CA, Association of Computing Machinery (2011) 57–60 http://doi.acm.org/10.1145/ 2034691.2034703. 5. Elbow, A., Krick, B., Kelly, L.: PMC Tagging Guidelines: A case study in normaliza- tion. In: Proceedings of the Journal Article Tag Suite Conference 2011, National Cen- ter for Biotechnology Information (2011) http://www.ncbi.nlm.nih.gov/books/ NBK62090/#elbow-S8. 6. W3C: MathML Test Suite (2010) http://www.w3.org/Math/testsuite/. 7. Archambault, D., Stöger, B., Batušic,´ M., Fahrengruber, C., Miesenberger, K.: A software model to support collaborative mathematical work between Braille and sighted users. In: Proceedings of the ASSETS 2007 Conference (9th International ACM SIGACCESS Conference on Computers and Accessibility), ACM (2007) 115– 122 http://portal.acm.org/ft_gateway.cfm?id=1296864&type=pdf. 8. Archambault, D., Berger, F., Moço, V.: Overview of the “Universal Maths Conversion Library”. In: Pruski, A., Knops, H., eds.: Assistive Technology: From Virtuality to Reality: Proceedings of 8th European Conference for the Advancement of Assistive Technology in Europe AAATE 2005, Lille, France, Amsterdam, The Netherlands, IOS Press (2005) 256–260. 9. Archambault, D., Moço, V.: Canonical MathML to Simplify Conversion of MathML to Braille Mathematical Notations. In: Miesenberger, K., Klaus, J., Zagler, W., Karshmer, A., eds.: Computers Helping People with Special Needs. Volume 4061 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2006) 1191–1198 http://dx.doi.org/10.1007/11788713_172. 10. Líška, M., Sojka, P., Ruži˚ cka,ˇ M., Mravec, P.: Web Interface and Collection for Mathematical Retrieval. In: Sojka, P., Bouche, T., eds.: Proceedings of DML 2011, Bertinoro, Italy, Masaryk University (2011) 77–84 http://www.fi.muni.cz/~sojka/ dml-2011-program.html. 11. Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming Large Collections of Scientific Publications to XML. Mathematics in Computer Science 3 (2010) 299–307 http://dx.doi.org/10.1007/s11786-010-0024-7. 12. Sylwestrzak, W., Borbinha, J., Bouche, T., Nowinski,´ A., Sojka, P.: EuDML— Towards the European Digital Mathematics Library. In: Sojka, P., ed.: Proceedings of DML 2010, Paris, France, Masaryk University (2010) 11–24 http://dml.cz/dmlcz/ 702569. 13. Rehˇ u˚ rek,ˇ R., Sojka, P.: Automated Classification and Categorization of Mathematical Knowledge. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F., eds.: Intelligent Computer Mathematics—Proceedings of 7th International Con- Building Corpora of Technical Texts: Approaches and Tools 79

ference on Mathematical Knowledge Management MKM 2008. Volume 5144 of Lec- ture Notes in Computer Science LNCS/LNAI., Berlin, Heidelberg, Springer-Verlag (2008) 543–557. 14. Rehˇ u˚ rek,ˇ R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, Valletta, Malta, ELRA (2010) 45–50 http://is.muni.cz/publication/884893/en, software available at http://nlp.fi.muni.cz/projekty/gensim. 15. Anca, S.:¸ Natural Language and Mathematics Processing for Applicable Theorem Search. Master’s thesis, Jacobs University, Bremen (2009) https://svn.eecs. jacobs-university.de/svn/eecs/archive/msc-2009/aanca.pdf. 16. Grigore, M., Wolska, M., Kohlhase, M.: Towards context-based disambiguation of mathematical expressions. Math-for-Industry Lecture Note Series 22 (2009) 262–271. 17. Materna, J.: LDA-Frames: an Unsupervised Approach to Generating Semantic Frames. In: Proceedings of CICLING 2012, Springer-Verlag (2012) 12 pages, submitted. 18. Belohlávek,ˇ R.: Concept lattices and order in fuzzy logic. Annals of Pure and Applied Logic 128(1–3) (2004) 277–298. Web Interface and Collection for Mathematical Retrieval WebMIaS and MREC

Martin Líška, Petr Sojka, Michal Ruži˚ cka,ˇ and Petr Mravec

Masaryk University, Faculty of Informatics, Botanická 68a, 602 00 Brno, Czech Republic [email protected], [email protected], [email protected]

Abstract. We demonstrate searching of mathematical expressions in technical digital libraries on a MREC collection of 439,423 real scientific documents with more than 158 million mathematical formulae. Our solution — the WebMIaS system — allows the retrieval of mathematical expressions written in TEX or MathML. TEX queries are converted on- the-fly into tree representations of Presentation MathML, which is used for indexing. WebMIaS allows complex queries composed of plain text and mathematical formulae, using MIaS (Math Indexer and Searcher), a math aware search engine based on the state-of-the-art system Lucene. MIaS implements proximity math indexing with a subformulae similarity search. Keywords: math indexing and retrieval, mathematical digital libraries, information systems, information retrieval, mathematical content search, document ranking of mathematical papers, math text mining, WebMIaS, MIaS, Tralics, TEX, UMCL, Lucene

1 Introduction The gateway to the vast treasures held in digital libraries’ content is entered by searching. The Google generation is starting to demand a simple Google-like interface to access digital content, even on a global scale. The mainstream technologies and interfaces are developed only for plain text without support for mathematical formulae handling — documents are represented in a bag of words representation, in a simple vector space model. Scientific and technical documents are full of indexes, exponents, and complex mathematical expressions, even in paper basic metadata, titles and abstracts. Our experience with Google Scholar shows that not handling mathematical expressions in citations causes severe problems. For example the paper by Kovácikˇ and Rákosník [3] appears as more than twenty different papers there1 mainly because of different and wrong (by different OCR) representation of mathematics in the paper metadata (title). Although there have been several attempts to solve the mathematics search problem, none of them have, as yet, fulfilled the expectations. For 2 example, Springer offers LaTeXSearch based just on TEX math string matching,

1 cf. http://scholar.google.com/scholar?q=Kovacik+Rakosnik 2 http://www.latexsearch.com/

Petr Sojka, Thierry Bouche (editors): DML 2011, Towards a Digital Mathematics Library, pp. 77–84. c Masaryk University, 2011 ISBN 978-80-210-5542-1 78 Martin Líška, Petr Sojka, Michal Ruži˚ cka,ˇ Petr Mravec

Table 1: Documents collected from arXMLiv arXMLiv transformation result class Quantity successful (no problem) 65,874 successful (warning) 291,879 complete with errors (missing macros) 81,670 All documents 439,423 which does not take into account the structural or semantical similarity of mathematical expressions at all. We have created the web interface WebMIaS for our MIaS (Math Indexer and Searcher) system [6] indexing hundreds of thousands3 of mathematical documents. We demonstrate a solution built on the state-of-the-art fulltext indexing engine Lucene TM — we have added ‘math-awareness’ to it as a plug- in. To test the system, we have created (Section 2) and indexed (Section 3 on the next page) the MREC collection of hundreds of thousands mathematical documents. In Section 4 on the facing page we describe necessary transforma- tions needed during querying and indexing (canonicalization of MathML). The WebMIaS web interface is then presented in Section 5 on page 80. The reader finds final remarks in Section 6 on page 83.

2 Mathematical Retrieval Collection MREC

To evaluate our system, we have built a corpus of mathematical texts, called 4 MREC. We downloaded documents from arXMLiv [8], where TEX documents from arXiv.org are transformed into XML documents. For the representation of mathematical formulae, MathML, a W3C standard, is used. The documents used come from different scientific areas (Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics). ArXMLiv5 sorts transformed documents into several classes, based on the return value of transformation to MathML: successful, complete with errors, incomplete and none. MREC does not contain full arXiv, only documents from conversion classes successful and complete with errors (missing macros) — see Table 1. We have collected 439,423 documents in well-formed XHTML, containing mathematical formulae in valid MathML. We hope that this corpus might be used for benchmarking mathematical retrieval, thus we have named it MREC (Mathematical REtrieval Collection) and made it available for this purpose at [4]. In our web interface for math searches we currently use this corpus of real mathematical papers.

3 LaTeXSearch currently searches only three million formulae. 4 http://kwarc.info/projects/arXMLiv/ 5 http://arxmliv.kwarc.info/ Web Interface and Collection for Mathematical Retrieval 79

3 Math-aware Indexing

We have developed a math aware, full-text based search engine called MIaS (Math Indexer and Searcher). [6] It processes documents containing mathematical notation in Presentation MathML format, however, it filters out all unnecessary presentational elements as well as any other MathML notation (Content MathML or other markup). MIaS allows users to search for mathematical formulae as well as the textual content of documents. Since mathematical expressions are highly structured and have no canonical form, our system pre-processes formulae in several steps to facilitate a greater possibility of matching two equal expressions with different notation and/or non-equal, but similar formulae. With an analogy to natural language searching, MIaS searches not only for whole sentences (whole formulae), but also for single words and phrases (subformulae down to single variables, symbols, constants, etc.). For every formula and its subformulae on the input, MIaS creates several differently generalized representations to allow similarity searching of mathematics. For calculating the relevance of matched expressions to the user’s query, MIaS uses a heuristic weighting of indexed terms, which accordingly affects scores of matched documents and thus the order of results. Weights are assigned to the formula according to the complexity of the formula, its level in the input formula tree and level of generalization. At the end of all these processing methods, formulae are converted from XML nodes to a compacted linear string form which can be handled by the indexing core.

4 System Workflow

The top-level indexing scheme is shown in Figure 1 on page 81. Document and query processing is done separately for plain text terms and mathematical terms. Indexing of mathematics is done by our Presentation MathML tokenizer implemented in Java for Apache LuceneTM3.1, and Lucene SolrTM 3.1 taking advantage of open Lucene architecture. MathML notation in the query and indexed documents is normalized into Canonical MathML [1] to increase precision of the system. For conversion into this normalized MathML format we are using the software library UMCL (Universal Maths Conversion Library). The main purpose of the UMCL toolset is the transcription of the MathML formulae to Braille national codes. Related to our task is also the need for MathML formulae unification. UMCL transformation of the MathML to Canonical MathML is carried out using a set of XSL stylesheets. This transformation was integrated into the WebMIaS system with only the slightest modifications — the UMCL transformation adds attributes in the form of id="formula:xx" to every node of the output MathML. This is not necessary for the WebMIaS purposes as it adds additional ‘noise’ to the formulae and increased size of the index. Thus, these attributes are not added to the Canonical MathML used by WebMIaS. 80 Martin Líška, Petr Sojka, Michal Ruži˚ cka,ˇ Petr Mravec

Our latest experiments with canonical forms of MathML generated by the UMCL transformation show that it not only increases fairness of similarity ranking, but also helps to match a query against the indexed form of MathML. For example, if the user asked the system for the

x2 + y2 formula using MathML of the form x 2 + y 2 the system would not be able to find any similar formulae due to omission of the element in the MathML. Provided that the MathML canonicalization of the query is done prior to the search, the canonical form of the query x2 + y2 results in 36,817 hits in MREC 2011.4. For a user-friendly math-aware information retrieval demonstration, we have built web interface WebMIaS (see Figure 2 on page 82).

5 WebMIaS WebMIaS demonstrates the possibility of querying mathematical content on a large-scale. This has been facilitated by the full indexation of the mathematical corpus MREC. In the user interface (UI) we tried to mimic the simplicity of Google. In addition to the standard textual query terms, mathematics terms (mterms) may appear in the query as well, adding to the document score with the weight depending on the similarity of matched formula to the queried one. Mterm could be either in MathML, or in TEX notation enclosed in two dollar signs. Since most mathematicians are used to using TEX compact

82 Martin Líška, Petr Sojka, Michal Ruži˚ cka,ˇ Petr Mravec

Fig. 2: WebMIaS web interface of differently complex queries (mixed, non-mixed, more/less complex single/ multiple formulae). The resulting average query time was 469 ms. It is very difficult to evaluate the mathematical search result and verify the soundness of our design. For a given set of queries, there should exist beforehand a complete list of the documents ordered by their relevance to the query with which the actual results can be compared with. We have applied an empirical approach to the evaluation so far using our WebMIaS demo interface which is publicly available at http://nlp.fi.muni. cz/projekty/eudml/mias/. It currently works on our mathematical corpus MREC version 2011.4 with 158,106,118 input formulae, 2,910,314,146 indexed (sub)formulae. Web Interface and Collection for Mathematical Retrieval 83

Table 2: Scalability test results (run on 448 GiB RAM, eight 8-core 64bit processors Intel XeonTM X7560 2.26 GHz driven machine). Indexing Indexing # Docs Input formulae Indexed formulae run-time [ms] CPU time [ms] 10,000 3,406,068 64,008,762 2,145,063 2,102,770 50,000 18,037,842 333,716,261 11,382,709 10,871,500 100,000 36,328,126 670,335,243 23,066,679 21,992,100 200,000 72,030,095 1,326,514,082 46,143,472 44,006,180 300,000 108,786,856 2,005,488,153 71,865,018 66,998,550 350,000 125,974,221 2,318,482,748 83,199,724 77,886,160 439,423 158,106,118 2,910,314,146 104,829,757 97,393,301

6 Conclusion We have demonstrated the fully functioning information retrieval interface, WebMIaS, capable of retrieving both text and math from fulltexts in Presentation MathML. The system scales well and has got the power to be used in several digital libraries. As our developments were motivated by future deployment in the EuDML6 project [9], experience with WebMIaS results will be projected and employed in the EuDML UI. Another area of long-term research planned is supporting Content MathML, in a way similar to the current handling of Presentation MathML. The architectural design is suited to it, but as most of the math within EuDML will be in Presentation MathML taken from PDFs, this is not currently a high priority.

Acknowledgements. This work has been in part financed by the European Union through its Competitiveness and Innovation Programme (Information and Communications Technologies Policy Support Programme, “Open access to scientific information”, Grant Agreement No. 250503).

References 1. Archambault, D., Moço, V.: Canonical MathML to Simplify Conversion of MathML to Braille Mathematical Notations. In: Miesenberger, K., Klaus, J., Zagler, W., Karshmer, A. (eds.) Computers Helping People with Special Needs, Lecture Notes in Computer Science, vol. 4061, pp. 1191–1198. Springer Berlin / Heidelberg (2006), http://dx.doi.org/10.1007/11788713_172 2. Grimm, J.: Producing MathML with Tralics. In: Sojka [5], pp. 105–117, http: //dml.cz/dmlcz/702579 3. Kovácik,ˇ O., Rákosník, J.: On spaces Lp(x) and Wk,p(x). Czechoslovak Mathematical Journal 41, 592–618 (1991), http://dml.cz/dmlcz/102493 4. MREC — Mathematical REtrieval Collection, http://nlp.fi.muni.cz/projekty/ eudml/MREC/index.html

6 http://eudml.eu 84 Martin Líška, Petr Sojka, Michal Ruži˚ cka,ˇ Petr Mravec

5. Sojka, P. (ed.): Towards a Digital Mathematics Library. Masaryk University, Paris, France (Jul 2010), http://www.fi.muni.cz/~sojka/dml-2010-program.html 6. Sojka, P., Líška, M.: Indexing and Searching Mathematics in Digital Libraries – Architecture, Design and Scalability Issues. In: Davenport, J.H., Farmer, W., Rabe, F., Urban, J. (eds.) Proceedings of CICM Conference 2011 (Calculemus/MKM). Lecture Notes in Artificial Intelligence, LNAI, vol. 6824, pp. 228–243. Springer-Verlag, Berlin, Germany (Jul 2011) 7. Stamerjohanns, H., Ginev, D., David, C., Misev, D., Zamdzhiev, V., Kohlhase, M.: MathML-aware Article Conversion from LATEX. In: Sojka, P. (ed.) Proceedings of DML 2009. pp. 109–120. Masaryk University, Grand Bend, Ontario, CA (Jul 2009), http://dml.cz/dmlcz/702561 8. Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming Large Collections of Scientific Publications to XML. Mathematics in Computer Science 3, 299–307 (2010), http://dx.doi.org/10.1007/s11786-010-0024-7 9. Sylwestrzak, W., Borbinha, J., Bouche, T., Nowinski,´ A., Sojka, P.: EuDML—Towards the European Digital Mathematics Library. In: Sojka [5], pp. 11–24, http://dml.cz/ dmlcz/702569 Budeme vděčni i za pomocnou ruku v získávání nových členů, propagaci sdružení, editaci a redakci tematických webových stránek na připravovaném novém webu sdružení, ale i za konstriktivní náměty na nové aktivity či připomínky ke stávajícímu chodu sdružení. Masarykova univerzita, Fakulta informatiky, Botanická 68a, 602 00 Brno

Redakční systém odborného časopisu s podporou exportu do digitální knihovny Petr Sojka, Michal Růžička

Postup zpracování odborných, zejména matematických časopisů je založen na TEXu a navazujících technologiích. Vydavatel většinou články zpřístupňuje a pa ralelně publikuje i elektronicky. Vytváří optimalizovanou verzi pro webové šíření, verzi pro archivaci, případně i verzi optimalizovanou pro čtení na obrazovce. Článek popisuje navržený a implementovaný postup zpracování několika matematických časopisů deponujících svou produkci v České digitální matema tické knihovně DMLCZ a následně v Evropské digitální matematické knihovně EuDML.

Publikování odborných časopisů

V akademickém světě se z mnoha stran ozývá známé „publikuj nebo zmiz“ (“Pu blish or Perish”). Množství každoročně publikovaných článků roste exponenciálně, a oblast technických věd a matematiky není výjimkou. Je třeba publikovat rychleji a kvalitně poskytovat metadata pro vyhledávací stroje, neboť většina uživatelů k článkům přistupuje přes vyhledávací portály. Velké procento odborných pu blikací a časopisů, zejména matematických, je sázeno TEXem. V Indii, Litvě či jinde po světě jsou firmy se stovkami zaměstnanců, které tyto publikace sázejí a zpracovávají pro vydavatele jako Elsevier nebo Springer. Vydavatelé jednak svou produkci archivují v systémech jako Portico1 pro případy katastrofických scénářů, jednak zpřístupňují předplatitelům a Google ve svých digitálních knihovnách jako je SpringerLink2 nebo ACM Digital Library.

1 2

4 Nadpoloviční většina čtenářů své odborné dokumenty najde přes Google či Google Scholar, takže je zásadní, aby plné texty byly stahovačům Google známy a poskytnuty, i když jsou za předplatitelskou posuvnou zdí, anglicky moving wall (princip, kdy starší články jsou dostupné volně, ale novějších několik ročníku je dostupných jen z rozsahů internetových adres předplatitelů). V Česku a Slovensku je publikována více než desítka odborných matematických časopisů, a všechny redakce k sazbě používají TEX. V roce 2005 se podařilo získat grant na projekt České digitální matematické knihovny DMLCZ () [1], která by vydávané články zpřístupňovala odborné komunitě a vystavovala pro indexování Google (Scholar). Projekt byl úspěšný, a v roce 2010 na něho již navazuje projekt Evropské digitální matematické knihovny EuDML (), kterým se přes 300 000 stran odborných matematických textů dále šíří ke čtenářům. Cílem projektu DMLCZ bylo zpracovat jednak články digitalizací (retro digital období), ale také články u kterých již byla, ale často neúplná, primární data (retroborndigital období), a přebírat co nejvíce automatizovaně od redakcí časopisů data nově vydávaných čísel (tzv. born-digital, vznikajících elektronicky).

Borndigital systém DMLCZ

V rámci projektu DMLCZ bylo navržen způsob zpracování budoucích čísel časopisů a jejich zařazování do repositáře projektu. Jelikož jsou matematické časopisy typicky sázeny TEXem, jsou i borndigital systémy projektu DMLCZ do značné míry postaveny na TEXových technologiích. Hlavní myšlenkou bylo získávání borndigital dat pro DMLCZ jako vedlejšího produktu práce redakcí při vydávání nových čísel svých časopisů. Zde bylo možné dát se dvěma cestami – vytvořit nový komplexní redakční systém, který se postará nejen o vygenerování DMLCZ metadat, ale také automatizuje co nejvíce činností prováděných při přípravě nového čísla matematického časopisu, anebo provést jen minimální zásahy do zaběhnutého redakčního workflow a s pokud možno minimálními nároky na redakci jej obohatit o generování výstupů pro projekt DMLCZ.

Komplexní born-digital systém Pilotním projektem přechodu do borndigital režimu byl brněnský matematický časopis Archivum Mathematicum vydávaný na Přírodovědecké fakultě Masary kovy univerzity. Zde jsme se vydali první cestou – připravili jsme zbrusu nový komplexní systém. Inspirací nám byl systém [2] používaným ve francouzském projektu CEDRAM [3].

5 V tomto systému redaktor všechny články upraví do požadovaného tvaru za A použití značkování, které se příliš neliší od standardního značkování LTEXu a spo lečně s případnými externími soubory (obrázky apod.) je umístí do odpovídající adresářové struktury. Pro každé číslo následně připraví jednoduchý TEXový doku ment – řídící soubor, ve kterém definuje pořadí jednotlivých článků, ročník, číslo a rok vydání daného čísla časopisu apod. V tomto bodě je již redakční systém schopen automaticky • přeložit všechny články čísla a sesadit je do jediného výsledného dokumentu ve formátech PDF a PostScript, • vygenerovat obálku čísla s obsahem a korektními údaji o ročníku, čísle atd. ve formátech PDF a PostScript, • vygenerovat zrcadlové tiskové předlohy s ořezovými značkami ve formátech PDF a PostScript, • vygenerovat elektronickou podobu časopisu pro vystavení na webu, • vygenerovat formuláře pro recenzní řízení, • vygenerovat (a na požádání automaticky odeslat) metadata pro projekt DMLCZ, • vygenerovat další pomocné výstupy. Schéma práce borndigital redakčního systému DMLCZ můžete vidět na obrázku 1. Celé zpracování je řízeno přímo ze třídy dokumentu pomocí TEXového příkazu \write183, ke spuštění celého automatizovaného zpracování proto uživateli stačí přeložit TEXem řídící soubor čísla. Překlad jednotlivých článků probíhá zcela odděleně (systém řeší nastavení správného čísla první stran každého článku apod.), nehrozí proto konflikty mezi makry používanými v jednotlivých článcích. Výsledné číslo časopisu je vytvořeno spojením jednotlivých výstupních PDF dokumentů. Jádro systému generující DMLCZ metadata ve formátu XML je postaveno na opensource nástroji Tralics [4, 5] (jehož použití je blíže popsáno v dalším oddílu tohoto článku na straně 8). Tralics zpracovává jen soubory speciálně pro A něj připravené – jedná se o LTEXové dokumenty s minimálním dodatečným značkováním, které vznikly při překladu článků díky speciální definici některých v článcích používaných maker. Tyto minimální dokumenty obsahují původní A LTEXový zdrojový text položek, které se mají stát součástí metadat. Pro zjed nodušení konfigurace Tralicsu nejsou produkována metadata přímo v konečném DMLCZ značkování. Tralicsem vyprodukovaný XML dokument je do finální podoby převeden pomocí XLST. DMLCZ metadata mají dvě části – popis vlastního článku (název, autoři, jazyk článku, abstrakt atd.) a seznam v článku citované literatury. Pro použití

3Tento příkaz umožňuje uživateli vykonávat běžné příkazy operačního systému přímo z TEXového zdrojového textu. Z bezpečnostních důvodů musí být použití uživatelem expli citně povoleno přidáním -shell-escape nebo obdobného argumentu na příkazovém řádku při spuštění TEXu.

6 podoby seznamů literatury u všech článků, a zároveň jsou automaticky připravena kvalitně strukturně označená data pro generování DMLCZ metadat.

Minimalistický born-digital systém Některé redakce již dlouhou dobu používají svůj osvědčený zaběhnutý způsob práce, a tak i když se chtěly zapojit jako borndigital přispěvatelé do projektu DMLCZ, místo zavedení nového integrovaného systému dávaly přednost druhé cestě generování DMLCZ podkladů – zavedení minimálního rozšíření stávajícího workflow. Právě zde jsme opět použili osvědčenou komponentu integrovaného systému, nástroj Tralics – ukázka jeho použití je obsahem následujícího oddílu. Stejně jako v případě komplexního borndigital systému bylo nutné drobně modifikovat používané značkování s ohledem na strukturní vyznačení jednotlivých datových položek pro DMLCZ metadata, pro sazbu seznamů literatury se však již nepoužívá BibTEX, ale bylo zavedeno speciální strukturní značení. Ať se již redakce rozhodla pro kteroukoliv cestu, pro projekt DMLCZ je velmi výhodné, že data vznikají přímo při přípravě nového čísla z původních zdrojových textů – je tak zaručeno, že se v digitální knihovně objeví články ve stejné podobě, ve které byly vytištěny.

Tralics

A Tralics () je konvertor LTEXu do XML. Jedná se o původem francouzský opensource nástroj šířený pod CeCILL licencí.4 Tralics je multiplatformní, kromě zdrojových textů jsou k dipozici i předpřipra vené zkompilované binární verze tohoto nástroje pro operační systém GNU/Linux, Apple Mac OS X a Microsoft Windows.5 Na rozdíl od některých jiných nástrojů, např. TEX4ht [6], je Tralics primárně určen ke konverzi dokumentů do obecného strukturně označkovaného XML, nikoliv ke konverzi TEXových dokumentů do jejich vizuálně obdobné podoby v jiném formátu (např. HTML). Také pracuje odlišným způsobem. TEX4ht kupříkladu vstupní dokument překládá přímo TEXem do DVI. Načtena je pouze speciální sada maker, která do výstupního souboru přidá \special{...} příkazy s poznámkami pro TEX4ht. Tyto poznámky jsou při postprocessingu takto vytvořeného DVI souboru využity k vytvoření HTML podoby dokumentu. [7] Tralics pracuje odlišným způsobem – Tralics sám překládá přímo zdrojový text A LTEXového dokumentu. Je přitom schopen provádět expanzi i složitých uživatelem 4CeCILL licence () je podobná a kompatibilní s GNU GPL, zohledňuje však specifika francouzského právního systému. 5Pro běh na MS Windows je použito knihoven z projektu Cygwin ().

8 definovaných maker apod. Mimoto dokáže zpracovávat i bibliografické databáze BibTEXu. Pokud je seznam literatury do dokumentu vkládán tímto způsobem, je Tralicsem zpracován v jediném kroku zároveň s dokumentem samotným. A Schopnost přímého zpracování LTEXového zdrojového textu a TEXové makro expanze je velmi významným argumentem pro použití Tralicsu, neboť díky tomu podává velmi dobré výsledky a je možná podrobná konfigurace konverze. Právě mnohotvárnost zápisu i jednoduchých konstrukcí v TEXu je důvodem, proč by bylo obtížené získávat data (byť i jejich malou podmnožinu potřebnou pro DML CZ) přímo ze zdrojových textů jiným způsobem, např. skriptem v Perlu. Použití Tralicsu je mnohem pohodlnější a výsledky lepší.

Konfigurace Tralicsu Konfigurace Tralicsu je pro uživatele TEXu velmi přirozená. Pro makra původního dokumentu definuje jejich transformaci do XML standardním TEXovým zápisem. Tralics k tomuto účelu poskytuje sadu nových TEXových příkazů, kterými je možné definovat podobu XML výstupu. V konfiguračních souborech jsou definice maker uloženy v souborech kopí A A rujících strukturu jejich standardních LTEXových protějšků – LTEXové třídě dokumentu / balíčku (.cls/.sty) odpovídá .clt/.plt konfigurační soubor Tra licsu se stejným jménem. Možná je samozřejmě definice i bez vazby na nějakou třídu dokumentu / balík maker. Součástí balíku se zdrojovými texty Tralicsu je také sada konfiguračních souborů pro základní třídy dokumentů a makrobalíky A LTEXu.

Ukázka použití Tralicsu Ačkoliv jsou na domovské stránce Tralicsu () k dispozici předkompilované spustitelné soubory (obzvláště užitečné na platformě MS Windows, kde standardně není k dispozici překladač C++6), je vždy vhodné stáhnout také archiv se zdrojovými texty dané verze programu. Sou částí archivu (adresář confdir/) je totiž také sada předpřipravených konfigurač ních souborů zmíněných výše a sada testů (spouštěná skriptem Test/alltests) pro ověření funkčnosti Tralicsu. Kompilace ze zdrojových textů v unixovém operačním systému je jednoduchá a vyžaduje jen překladač jazyka C++. Po rozbalení archivu stačí přejít do adresáře src/ a příkazem make spustit kompilaci. Výsledkem je spustitelný soubor

6Na platformě Windows bude ke spuštění programu potřeba také dynamická knihovna cygwin1.dll. K provozování Tralicsu není třeba instalovat celé prostředí Cygwin (), potřebnou knihovnu stačí umístit do stejného adresáře jako zkompilovaný Tralics. Samotná knihovna je součástí balíku cygwin-.tar.bz2, který je ke stažení na některém zrcadle FTP serveru projektu (), v České republice např. na adrese .

9 src/tralics. Zkompilovaný program můžeme v tomto umístění rovnou otestovat, pokud přejdeme do adresáře Test a spustíme skript alltests. Skript programem src/tralics přeloží TEXové soubory v adresáři Test a XML výstupy porovná se vzorovými soubory v adresáři Modele. Pokud testy proběhnou bez chyb, máme Tralics připraven k použití – potřebovat budeme zkompilovaný program tralics adresář se standardní konfigurací confdir/, které si můžeme přesunout dle svého uvážení kamkoliv do systému. ~/tralics2.13.6$ mkdir ~/tralics ~/tralics2.13.6$ mv confdir/ src/tralics ~/tralics/ ~/tralics2.13.6$ cd ~/tralics/ A Nyní si můžeme vyzkoušet konverzi jednoduchého LTEXového dokumentu do XML. ~/tralics$ vim helloworld. ... ~/tralics$ cat helloworld.tex \documentclass{article} \def\hello{\uppercase{h}ello} \begin{document} \hello{} world!

Příliš žluťoučký kůň úpěl ďábelské ódy. \end{document} Dokument můžeme přeložit příkazem: ~/tralics$ ./tralics confdir=confdir/ utf8 utf8output helloworld.tex Parametrem -confdir přímo při spuštění programu určujeme umístění ad resáře s konfiguračními soubory, nezáleží proto na jeho umístění v systému. Použity jsou také argumenty pro volbu vstupního a výstupního kódování (-utf8, -utf8output). Tralics podporuje jen omezenou množinu kódování, je však mezi nimi také UTF8. Nejjednodušším postupem proto je případně zkonvertovat vstupní dokumenty Tralicsu právě do tohoto kódování. Při znalosti původního kódování dokumentu je toto možné snadno provést automatizovaně, UTF8 navíc dokáže kódovat libovolný znak Unicode, problém proto nebude činit žádné vstupní kódování původního dokumentu. Výsledkem překladu je soubor helloworld.xml:

Hello world!

Příliš žluťoučký kůň úpěl ďábelské ódy.

10 Jedná se tedy skutečně o strukturně označkovaný XML přepis původního doku mentu. Je také vidět, že Tralics správně zpracoval uživatelem definované makro \hello. Deklarace typu dokumentu odkazuje na DTD classes.dtd, který však není součástí standardní konfigurace. Gramatiku našich dokumentu bychom museli definovat sami. Patřičným nastavením (např. v konfiguračním souboru) lze odka zované DTD snadno změnit za jiné. Tamtéž lze definovat také příkazy nezávisle na konkrétní třídě dokumentu / balíku maker. ~/tralics$ vim hwconfig.tcf ... ~/tralics$ cat hwconfig.tcf DocType = hello world.dtd

BeginAlias std report book article minimal End

BeginCommands \def\world{world} End

~/tralics$ vim helloworld.tex ... ~/tralics$ cat helloworld.tex \documentclass{article} \def\hello{\uppercase{h}ello} \begin{document} \hello{} \world!

Příliš žluťoučký kůň úpěl ďábelské ódy. \end{document}

~/tralics$ ./tralics confdir=confdir/ configfile=./hwconfig.tcf \ > utf8 utf8output helloworld.tex ... Output written on helloworld.xml (234 bytes). No error found. (For more information, see transcript file helloworld.log)

~/tralics$ cat helloworld.xml

11

Hello world!

Příliš žluťoučký kůň úpěl ďábelské ódy.

Tralics nabízí několik nových maker a prostředí7 pro definici struktury XML výstupu. ~/tralics$ vim helloworld.tex ... ~/tralics$ cat helloworld.tex \documentclass{article} \def\hello{\uppercase{h}ello} \begin{document} \hello{} \world!

\begin{xmlelement}{pokus0} \begin{xmlelement}{pokus1} Obsah prvního testovacího elementu. \end{xmlelement} \begin{xmlelement}{pokus2} Obsah druhého testovacího elementu. \end{xmlelement} \AddAttToLast{attposlední}{hodnotaposlední} \AddAttToCurrent{attsoučasný}{hodnotasoučasný} \end{xmlelement}

\xbox{xboxelement}{Obsah \emph{xbox} elementu.} \xmlelt{xmleltelement}{Obsah \emph{xmlelt} elementu.}

Příliš žluťoučký kůň úpěl ďábelské ódy. \end{document}

~/tralics$ ./tralics confdir=confdir/ configfile=./hwconfig.tcf \ > utf8 utf8output helloworld.tex ...

~/tralics$ cat helloworld.xml

Hello world!

7Dokumentace je k dispozici na domovské stránce projektu ().

12 Obsah prvního testovacího elementu. Obsah druhého testovacího elementu.

Obsah xbox elementu.

Obsah xmlelt elementu.

Příliš žluťoučký kůň úpěl ďábelské ódy.

Matematické výrazy Tralics implicitně konvertuje do XML jazyka MathML. ~/tralics$ vim matematika.tex ... ~/tralics$ cat matematika.tex \documentclass{article} \begin{document} Pythagorovu větu vyjadřuje tato rovnice: \[a^2 + b^2 = c^2\] \end{document}

~/tralics$ ./tralics confdir=confdir/ utf8 utf8output \ > entnames=false matematika.tex ...

~/tralics$ cat matematika.xml

Pythagorovu větu vyjadřuje tato rovnice:

a 2 + b 2 = c 2

13 Argumentem -nomathml ale můžeme použití MathML potlačit. Matematické A výrazy pak budou do dokumentu vkládány v podobě LTEXového zdrojového textu výrazu. ~/tralics$ ./tralics confdir=confdir/ utf8 utf8output \ > entnames=false nomathml matematika.tex ...

~/tralics$ cat matematika.xml

Pythagorovu větu vyjadřuje tato rovnice:

a^2 + b^2 = c^2

A Pokud do dokumentu vložíme seznam literatury standardním LTEXovým pro středím thebibliography bez podrobnějšího strukturního označkování obsahu, tak samozřejmě ani výsledek hlubší informaci o své struktuře neponese. ~/tralics$ vim matematika.tex ... ~/tralics$ cat matematika.tex \documentclass{article} \begin{document} Pythagorovu větu vyjadřuje tato rovnice: \[a^2 + b^2 = c^2\]

Citovaná literatura: \cite{texbook,latex,chicago}.

\begin{thebibliography}{9} \bibitem{texbook} Donald~E. Knuth. \textit{The \TeX book}. AddisonWesley, 1984. \bibitem{latex} Leslie Lamport. \textit{\LaTeX : A~Document Preparation System}. AddisonWesley, 1986. \bibitem{chicago} \textit{The Chicago Manual of Style}, pages 400401. University of Chicago Press, thirteenth edition, 1982. \end{thebibliography} \end{document}

14 ~/tralics$ ./tralics confdir=confdir/ utf8 utf8output \ > entnames=false nomathml matematika.tex ...

~/tralics$ cat matematika.xml

Pythagorovu větu vyjadřuje tato rovnice:

a^2 + b^2 = c^2

Citovaná literatura: , , .

Donald E. Knuth. The book. AddisonWesley, 1984.

Leslie Lamport. : A Document Preparation System. AddisonWesley, 1986.

The Chicago Manual of Style, pages 400–401. University of Chicago Press, thirteenth edition, 1982.

Tralics však dokáže pracovat také s bibliografickými databázemi programu BibTEX. Ty již informaci o struktuře záznamu obsahují a bude zachována i ve výstupním XML dokumentu. A Na rozdíl od překladu standardním LTEXem se Tralics sám postará o zpra cování bibliografické databáze. Není proto třeba volat externí programy nebo provádět několikanásobný překlad.

~/tralics$ vim matematika.tex ... ~/tralics$ cat matematika.tex \documentclass{article} \begin{document} Pythagorovu větu vyjadřuje tato rovnice: \[a^2 + b^2 = c^2\]

Citovaná literatura: \cite{texbook,latex,chicago}.

\bibliographystyle{abbrv} \bibliography{databazeliteratury} \end{document}

15 ~/tralics$ vim databazeliteratury.bib ... ~/tralics$ cat databazeliteratury.bib @INBOOK{chicago, title = "The Chicago Manual of Style", publisher = "University of Chicago Press", edition = "Thirteenth", year = 1982, pages = "400401", key = "Chicago" } @BOOK{texbook, author = "Donald E. Knuth", title= "The {{\TeX}book}", publisher = "AddisonWesley", year = 1984 } @BOOK{latex, author = "Leslie Lamport", title = "{\LaTeX}: {A} Document Preparation System", publisher = "AddisonWesley", year = 1986 }

~/tralics$ ./tralics confdir=confdir/ utf8 utf8output \ > entnames=false nomathml matematika.tex ~/tralics$ cat matematika.xml

Pythagorovu větu vyjadřuje tato rovnice:

a^2 + b^2 = c^2

Citovaná literatura: , , .

The Chicago Manual of Style Thirteenth University of Chicago Press 1982 400–401

16 The book AddisonWesley 1984 : A Document Preparation System AddisonWesley 1986

Tralics v minimalistickém born-digital systém Jak bylo uvedeno výše, Tralics je také jádrem minimalistických borndigital systémů, jejichž cílem je co nejméně zasahovat do zvyklostí redakcí při jejich práci. A Jelikož je Tralics překladač LTEXu do XML, bylo také vhodné vyhnout se A závislosti na LTEXu jako formátu používaného pro přípravu článků, neboť různé redakce mohou používat různé inkarnace TEXu. V případě minimalistických implementací proto byla modifikována původní redakční makra pracující s polož kami, které tvoří součást DMLCZ metadat – název a autoři článků, abstrakty, seznamy klíčových slov, seznam literatury apod. Úprava byla provedena tak, že nemá vliv na výstupní podobu vysázeného textu, avšak argumenty maker jsou v průběhu překladu v původní podobě zapsány do pomocného souboru. (Navíc jsou obaleny makry Tralicsu identifikujícími význam té které položky.) Tamtéž zapisují parametry i některá nová makra, která pro potřeby generování DMLCZ metadat vznikla, i když nejsou přímo nezbytná pro sazbu časopisu (a nemají ani žádný viditelný efekt v sazbě) – jedná se např. o určení jazyka nebo typu článku apod. Pokud redakce nepoužívají pro sazbu seznamů literatury BibTEX, největší rozšíření sady používaných maker se objevilo právě zde. Nová makra jsou určena pro vyznačení jednotlivých položek každého záznamu, neboť pro potřeby DMLCZ je třeba mít i seznamy citací co nejlépe strukturně označeny.

17

Pokud jsou v položkách metadat použita makra definovaná přímo autorem článku, nebude jejich definice v pomocném souboru zpracovávaném Tralicsem obsažena. Tralics proto nebude vědět, jak s těmito makry naložit. Za tímto účelem bylo definováno speciální okolí, do kterého je možné v každém článku zapsat definice zde používaných uživatelských maker pro Tralics. Při překladu je obsah tohoto okolí uložen do souboru \jobname.dml.ult, který je Tralicsem při převodu pomocného souboru načítán. Závěr V článku popsaný systém prokázal svou použitelnost v praxi – aktivně se používá v rámci projektu DMLCZ (odkud jsou data přebírána také do projektu Evropské digitální matematické knihovny EuDML). Rozsah užití je následující: • Časopis Archivum Mathematicum vydávaný Masarykovou univerzitou (vy užívá komplexní systém od roku 2009), • sborníky konference DML (Towards a Digital Mathematics Library) vydá vané Masarykovou univerzitou (využívá přizpůsobený minimalistický systém od roku 2009), • časopis Communications in Mathematics (do roku 2011 vydávaný pod jmé nem Acta Mathematica et Informatica Universitatis Ostraviensis) vydávaný Ostravskou univerzitou v Ostravě (využívá přizpůsobený minimalistický systém od roku 2010), • časopis Acta Universitatis Palackianae Olomucensis, Facultas Rerum Natu- ralium, Mathematica vydávaný Univerzitou Palackého v Olomouci (využívá přizpůsobený minimalistický systém od roku 2010), • časopis Kybernetika vydávaný Ústavem teorie informace a automatizace Akademie věd České republiky (využívá přizpůsobený minimalistický systém od roku 2010). Podobný systém byl použit také interně v rámci projektu DMLCZ při přípravě metadat publikací z retroborndigital období, jmenovitě pro přípravu metadat • osmi konferenčních sborníků NAFSA (Nonlinear Analysis, Function Spaces and Applications), • časopisu Acta Universitatis Palackianae Olomucensis, Facultas Rerum Na- turalium, Mathematica od roku 2003, • časopisu Acta Mathematica et Informatica Universitatis Ostraviensis od roku 2005, • časopisu Applications of Mathematics (vydávaného Matematickým ústavem Akademie věd České republiky) od roku 1994, • časopisu Czechoslovak Mathematical Journal (vydávaného Matematickým ústavem Akademie věd České republiky) od roku 1992, • časopisu Kybernetika od roku 1998, • časopisu Mathematica Bohemica (vydávaného Matematickým ústavem Aka demie věd České republiky) od roku 1992,

19 • časopisu Pokroky matematiky, fyziky a astronomie (vydávaného Jednotou českých matematiků a fyziků) od roku 1993. Doufáme, že popsané užití TEXových technologií nalezne uplatnění i u budou cích přispěvatelů projektu DMLCZ a popsané systémy tak budou nápomocny při jeho dalšímu rozvoji.

Reference

[1] Miroslav Bartošek, Martin Lhoták, Jiří Rákosník, Petr Sojka, and Martin Šárfy. DMLCZ: The Objectives and the First Steps. In J. Borwein, M. Rocha E. and F. Rodrigues J. editors, CMDE 2006: Communicating Mathematics in the Digital Era, pages 69–79. A. K. Peters, MA, USA, 2008. A [2] Thierry Bouche. A pdfLTEXbased automated journal production system. TUGboat, 27(1):45–50, 2006. [3] Centre de diffusion de revues académiques mathématiques. [online]. [cit. 20100113]. URL: . [4] Apics Team. Tralics: a LaTeX to XML translator. [online], October 2009. [cit. 20091114]. URL: . [5] Thierry Bouche. CEDRICS: When CEDRAM Meets Tralics. In Petr Sojka, editor, Towards Digital Mathematics Library, Proceedings of the DML 2008 workshop, pages 153–165. [6] Eitan M. Gurari. TeX4ht: LaTeX and TeX for Hypertext. [online], June 2008. [cit. 20100218]. URL: . [7] CV Radhakrishnan. HCode : a web notebook extrapolating TeX4ht. [online], September 2009. [cit. 20100218]. URL: .

Summary: A Scientific Journal Processing System with the Capability of Exporting to a Digital Library

Production workflow of publishing scientific, especially mathematical journals is based on TEX and related technologies. Publisher usually prepare and make papers available electronically in a digital library, optimized for digital delivery and eventually for reading too. Paper describes designed and implemented production workflow of several mathematical journals that archive their production in the Czech Digital Mathe matics Library DMLCZ, which is subsequently available in the European Digital Mathematics Library EuDML. Masarykova univerzita, Fakulta informatiky, Botanická 68a, 602 00 Brno ,

20 Normalization of Digital Mathematics Library Content MathML Canonicalization

David Formánek, Martin Líška, Michal Růžička, and Petr Sojka

Masaryk University, Faculty of Informatics Botanická 68a, 602 00 Brno, Czech Republic [email protected], [email protected], [email protected], [email protected]

Abstract. Paper discusses the needs for data normalization in a Digital Mathematics Library (DML). Specifically, emphasis is given to canonical izing formulae encoded in Presentation MathML notation which starts to be available in several DMLs and is used by DML applications. This is a prerequisite for advanced processing — namely math enabled full text searching or semantic filtering and automated classification. Different sources of MathML and their specifics are described. Several use cases of possible formulae canonicalization transformations are listed and discussed in detail. Findings are finally concluded and a design of a tobedeveloped canonicalization tool is outlined.

Keywords: MathML normalization, canonicalization, digital mathematics libraries, DML, presentation MathML

1 Motivation

Modern Digital Mathematics Libraries (DML) such as EuDML [18,5] base their services on paper semantics, i.e. fulltext handling, including mathematical for mulae, as well as basic metadata and Mathematics Subject Classification (MSC) codes. Mathematics literature is widely dispersed across a high number of pub lishers, making it very difficult to collect fulltexts from these heterogeneous sources. This situation is very different from other libraries, such as PubMed Central for biomedical and life sciences, where publishers have an agreed work flow using the NLM Journal Publishing Tag Set and tools developed with funding from the National Institutes of Health. Full paper texts have to be ‘homogenized’, converted to some uniform repre sentation, in order for mathaware fulltext searches [15] and paper similarity computations [11,12] to work properly. These tasks are usually handled based on a bagofwords representation of a document text — vector space model — every term (word, lemma) has its own dimension and the number of occurrences of a term reflects its value. Nontextual terms such as mathematical formulae are mostly not taken into account. This creates another challenge for DMLs, as 2 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka

mathematical formulae are the essence of mathematical publications. There is an average of 380 mathematical formulae per arXiv paper in the MREC database [8]. It has been reported [21] that even a single histogram of mathematical symbols is sufficient for domain classification of a paper in the mathematical domain. To reliably represent a paper for DML processing, including handling the mathematics, it is necessary to

1. select a canonical representation of the nontextual structural entities appear ing in fulltexts (mathematical symbols, formulae, and equations); and 2. decide on equivalence classes for these entities (e.g., for which formulae should be considered equal for given DML tasks such as search, similarity computation, formulae editing, and conversion of math into Braille).

In this paper, we discuss the options for selecting the canonical representations of formulae to be used in DML tools, and the canonicalization process — the process — of computing this canonical representation from a variety of different sources and formats. Our primary motivation is the natural requirement for our own (Web)MIaS system, which currently uses Presentation MathML [14] to operate correctly and offer an expected search behaviour to users regardless of the MathML input source. When a user posts a query to the system, the system must abstract it from the underlying notational differences in order for it to behave correctly. This requirement is increasingly emphasized with the growing number of different sources of MathML. Currently there are three sources (LATEXML, Tralics, and user input; the number is expected to increase). If they are not correctly normalized the system misbehaves and it appears to users as if it simply does not work, however good the underlying design is. We have used UMCL library [1,2] for canonicalization in our MIaS system sofar. However, we have found that the deficiencies of the software are so severe (change of formulae semantics, slowness,...)[7, chapter 5], and the need for canonicalization so important, that we have decided to design and implement new canonicalization tool from scratch. This paper is structured as follows: in Section 2, different sources of mathe matics are described and their differences are discussed. The core part of this paper is Section 3, where several use cases of possible canonical representation and canonicalization are documented and suggested. We conclude with Section 5, and present a plan for future work.

2 MathML Sources

To store mathematical formulae in our documents we have chosen MathML1 — an XMLbased language — as a widely used, formally defined, but still evolving standard. The widespread use of MathML and its XML base means of this

1 More precisely, Presentation MathML, as there are currently significantly more reallife resources using this form of MathML than Content MathML. Normalization of Digital Mathematics Library Content 3

language is supported by various tools in the whole document workflow. More importantly, MathML can be used as a common language among the advanced computer mathematical software packages that are extensively used by working mathematicians. On the author end of the document workflow the MathML code can be ‘hand made’ using simple plain text editors such as MS Windows Notepad, or something more comfortable, such as specialized XML editors that are usually part of various integrated development environments. For example, the formula x2 + y2 can be written as follows: x2 + y2

Listing 1: Example of the ‘hand made’ formula x2 + y2

However, the XML nature of MathML makes the coding of more complex formulae rather long for manual construction. Various software tools are more frequent sources of MathML. MathML can be generated as an output / data exchange format of complex specialized programs, such as Maple, Matlab, and Mathematica [9,20,22], or web services, such as the well known Wolfram Al pha [23], that are extensively used by mathematicians to support their work. generate::MathML(x^2 + y^2, Content = FALSE, Annotation = FALSE) x 2 + y 2

Listing 2: Example of MathML export of the formula x2 + y2 by Matlab 7.9.0 MuPAD symbolic engine 4 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka

x 2 + y 2

Listing 3: Example of the MathML export of the Wolfram Alpha input query ‘x^2 + y^2’ On the consumer end of the document workflow MathML can be used as an input for mathematical programs and services (Maple, Matlab, Mathematica, Wolfram Alpha, etc.) or simply displayed — usually as part of an XHTML web page — in a web browser with MathML support. However, a large number of mathematical documents are produced using the TEX typesetting system and authored in TEX markup. Thus, it is necessary to be able to convert the TEX source code of mathematical formulae to the MathML language. Our main motivation is the WebMIaS system. For more complex input formulae, it would be uncomfortable for the user to manually construct queries in MathML, as the code would be very complicated. The well known LATEX syntax is far more appropriate for manual input. Therefore, we need a conversion from LATEX to MathML as part of the WebMIaS input routine. There are several tools that are able to convert TEX markup to the MathML language. For example, arXMLiv [16] employs LATEXML [19]. The EuDML project and our WebMIaS [8] system internally use Tralics [6]. x2 + y2 x^{2}+y^{2}

2 2 Listing 4: Example of LATEXML generated MathML of formula x + y Normalization of Digital Mathematics Library Content 5

x 2 + y 2

Listing 5: Example of Tralics generated MathML of formula x2 + y2

A frequent type of mathematical document in DML is the older papers that are unavailable in any digitalformat or are available only in an ‘end’ format such as PDF that is suitable for reading and printing but is not appropriate for direct MathML processing. These documents can be a significant part of the DML content collection, so they are worth further processing. Documents available in hard copy only can be scanned and processed using InftyReader [17] optical character recognition (OCR) software. InftyReader has a unique feature for detecting mathematical formulae in a scanned document. These formulae can be subsequently saved as MathML. x 2 + y 2

Listing 6: Example of InftyReader generated MathML from a PDF document containing only formula the x2 + y2 in its body

Borndigital PDF documents with no available source codes can be processed using the MaxTract software [3,4], which that is under intensive development as part of the EuDML project. MaxTract generates LATEX source / XHTML+MathML representation of the document based on an optical analysis of the positions of 6 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka characters on the page. The analysis is supported with information from the fonts embedded in the processed document. x

y

Listing 7: Example of XHTML + MathML generated by the development version of MaxTract from a PDF document containing only the formula x2 + y2 in its body

During the MathDex project, it became clear that the most time and resources consuming task in building a math search engine and database is the normaliza tion and conversion of heterogeneous sources [10]. As shown in Listings 1 — 6, MathML can vary slightly due to the different ways a code was obtained, even for a trivial formula like x2 + y2. In a DML project, there can be differences in the final MathML encoding even for semantically and structurally similar formulae, due to the origins of the MathML from different sources. In Section 3, several more complicated examples of possible ambiguities in MathML are discussed that have to be normalized to allow math searches and similarity computation.

3 Use Cases

Using our public working demo of the WebMIaS system we discovered several discrepancies in the form of MathML generated by the realtime TEX to MathML converter we currently use — Tralics — and by the MathML canonicalizer from the UMCL library. We employed the UMCL canonicalization module to try to normalize the users’ MathML input and the MathML produced by the LATEXML converter contained in the arXMLiv collection. Then we went through the Pre sentation MathML specifications and gathered a list of possible reformatting rules we could perform. Normalization of Digital Mathematics Library Content 7

The goal is to reduce the possible MathML scripts with the same semantics and mathematical structures to just one representation. To have such a canoni calized representation is convenient for many applications, as was described in Sections 1 and 2. Analyzing the issues of possible inconsistencies and ambiguities of MathML encoded formulae raised design and strategy questions. Conceptual decisions for handling different types of similar constructions and completely different formulae need to be made. More specifically, for example, should we try to keep the MathML compact and reduce the number of nodes in transformations, or should we try to add nodes for better disambiguation? Another question is: should our future canoni calization tool produce valid MathML according to this schema? Unquestionably, this feature would be nice to have for many reasons and possible applications, but it certainly adds more requirements and takes much more effort to design and implement not only true/false validation, but also functional correctness validation. Below are described proposals and discussions of transformations that can be performed with relatively minor difficulty. The list is not complete and is subject to further evaluation.

3.1 Removing Elements and Attributes

Many of the MathML elements used in Presentation MathML make little or no contribution to the semantics of the formula and therefore also to the formulae for indexing and searching. These are usually elements that alter the appear ance of formulae in some way — spacelike elements such as mspace, mpadded, mphantom, maligngroup, and malignmark. They may occasionally have some semantic meaning, but we prefer to canonicalize similar formulae into one rep resentation rather than risk treating the same formulae as different. Therefore, these elements are best omitted. The content of the mtext element should be indexed as normal text before removal. Most element attributes are similarly undesirable. Many are used for for matting, affecting only the appearance of rendered formulae (for example, the attributes linebreak and indentalign of the mo element). Others might have some slight semantic significance, but are very uncommon and usually not very important; we think these attributes should be removed. However, several excep tions exist. For instance, the element mfrac is used for fractions but its meaning changes with the attribute linethickness set to 0, which express a binomial coefficient. The attributes of the element mfenced are also important (see List ing 9). The attribute mathvariant can also influence formula semantics and therefore should be preserved in all possible elements. For example, the MIaS system makes use of this attribute so that hits with the assigned mathvariant font specifying the attribute are more relevant. 8 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka

x x + + y y + + z z x x + y + + z z

Listing 8: Example of ommision

bevelled="true"> a a b b

Listing 9: Example of omission of unnecessary attributes in mfrac

3.2 Unifying Fences

There are two approaches to creating fenced formulae. One is more semantic and uses the mfenced element with the open, close, and separator attributes to describe delimiters and separators. The other places fence symbols directly within mo elements, and the fenced formula is enclosed in the mrow element to group the elements together. Although the first approach seems to be valid, we prefer the second one as it is more universal and allows easier conversion — e.g., converting addition to mfenced with attribute separators set to + would be invalid. As shown in Listing 10, mfenced elements are replaced by a more general mrow element, and fence and separator symbols are added as mo elements. Fenced elements are further enclosed in an mrow element so it can be treated as a single expression when needed. We could also consider unifying the symbols used as separators/delimiters. Normalization of Digital Mathematics Library Content 9

[ x x , y y ) Listing 10: Two ways of writing interval [x, y)

3.3 Mrow Minimizing The mrow element is used for grouping other elements. Its most common use case is to obtain a given correct number of child elements of some parent element (e.g. mfrac needs two child elements). We can determine unnecessary occurrences of mrow by summing the number of its child elements and its siblings with respect to the number of required elements for the parent element. Parents requiring only one child element actually accept any number of elements that are treated as if they are inferred within a single mrow element. Hence, the grouping element is redundant and can be removed. In any case, the impact of the transformations to any form of processing canonicalized notation must be taken into account and the structure of the formulae cannot be violated. For instance, after removing the mfenced enclosing element we ought to wrap the fenced formula with an mrow if it is not.

- - 1 1

Listing 11: Example of removal after optimization √ 1 −

3.4 Sub-/Superscripts Handling The msubsup element used for attaching subscript and superscript to another element at the same time is redundant — the same thing can be expressed as 10 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka

a combination of msub and msup elements. The order of the elements is important. When both elements are used, we prefer to place msub within msup (see List ing 12) because a subscript is usually more closely related to the base expression. A similar problem and solution is related to the elements triad of munder, mover, and munderover. Both msubsup and munderover can be used for limits of inte gration or bounds of summations; therefore, we should use only one canonical representant.

x x 1 1 2 2

2 Listing 12: Two ways of expressing x1

3.5 Applying Functions There are many ways to express functions. Entity ⁡ (function application) should be used but we cannot rely on that, so we suggest removing this operator for the purpose of unification. The opposite approach — adding the function application operator where it was omitted — could be rather tricky and could lead to ambiguities. The name of the function should occur in the mi element but it also can be considered as an operator and be placed in the mo element. The arguments of a function can be fenced with parentheses or an mfenced element or both. We chose canonical representation without an entity, with mrow and parentheses (see Listing 14). Other ambiguities can be caused by different invisible operators. For example, two identifiers in a subscript with no operator usually means multiplication but it can mean separation too.

4 Design Considerations

The design and implementation decisions of the canonicalization application depend on the purpose of new canonicalizer. Even though the use of the math content by different tools might be similar, the experience shows that we hardly could ‘fit one size’ for all applications. Thus the main design imperative is the modularity, simplicity, extensibility and flexibility, so that the canonicalizer might be easily modified when the need of the applications change. With different data the canonicalizer might change even for different types of mathaware search. Normalization of Digital Mathematics Library Content 11

f f ( ( x x ) )

Listing 13: Using or not using the operator for function application

sin sin ( x ) x Listing 14: Adding parentheses to sine function argument

Examples in subsections of previous section form set of modules that do the necessary MathML tree transformations as recursive procedures on MathML trees. According to the expected size of the input data set, effectiveness, the speed of the canonicalization application is also a critical parameter — in our MREC [8] corpora there is 168,000,000 formulae to canonicalize. Thus, use of standard XSL transformations does not seem to be appropriate, for example, as UMCL example showed. Another key decision is handling of invalid input MathML and question of valid MathML on the output as mentioned in Section 3. As the (Web)MIaS system as well as other core parts of EuDML system (Lucene) do use the Java platform is seems to be natural to use Java also for the implementation of canonicalization application.

5 Conclusions and Future Work

We consider MathML canonicalization important for proper functioning of seve ral mathaware applications that handle documents in DMLs. We have defined the problems and enumerated the most important use cases as modules of newly designed canonicalizer. We are currently working on finishing the design and implementation of a first version of application that will be used for the task of math indexing in MIaS 12 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka system employed in EuDML project. By evaluation of this task we will verify our design decisions and plan to use it for another tools working with math fulltext data (semantic similarity tools as gensim [12]).

Acknowledgements This work was partially supported by the European Union through its Competitiveness and Innovation Programme (Information and Com munication Technologies Policy Support Programme, ‘Open access to scientific information’, Grant Agreement No. 250503, a project of the European Digital Mathematics Library, EuDML).

References

1. Archambault, D., Berger, F., Moço, V.: Overview of the “Universal Maths Conversion Library”. In: Pruski, A., Knops, H. (eds.) Assistive Technology: From Virtuality to Reality: Proceedings of 8th European Conference for the Advancement of Assistive Technology in Europe AAATE 2005, Lille, France. pp. 256–260. IOS Press, Amsterdam, The Netherlands (Sep 2005) 2. Archambault, D., Moço, V.: Canonical MathML to Simplify Conversion of MathML to Braille Mathematical Notations. In: Miesenberger, K., Klaus, J., Zagler, W., Karshmer, A. (eds.) Computers Helping People with Special Needs, Lecture Notes in Computer Science, vol. 4061, pp. 1191–1198. Springer Berlin / Heidelberg (2006), http://dx. doi.org/10.1007/11788713_172 3. Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Proceedings of the Conferences in Intelligent Computer Mathematics, CICM 2009. LNAI, vol. 5625, pp. 201–216. Springer (2009) 4. Baker, J.B., Sexton, A.P., Sorge, V.: Towards reverse engineering of PDF documents. In: Sojka, P., Bouche, T. (eds.) Towards a Digital Mathematics Library, DML 2011. pp. 65–75. Masaryk University Press, Bertinoro, Italy (July 2011), http://hdl.handle. net/10338.dmlcz/702603 5. Borbinha, J., Bouche, T., Nowiński, A., Sojka, P.: Project EuDML—A First Year Demonstration. In: Davenport, J.H., Farmer, W.M., Urban, J., Rabe, F. (eds.) Intel ligent Computer Mathematics. Proceedings of 18th Symposium, Calculemus 2011, and 10th International Conference, MKM 2011. Lecture Notes in Artificial Intelli gence, LNAI, vol. 6824, pp. 281–284. SpringerVerlag, Berlin, Germany (Jul 2011), http://dx.doi.org/10.1007/978-3-642-22673-1_21 6. Grimm, J.: Producing MathML with Tralics. In: Sojka [13], pp. 105–117, http://dml. cz/dmlcz/702579 7. Jarmar, M.: Conversion of Mathematical Documents into Braille. Master’s thesis, Faculty of Informatics (Jan 2012), https://is.muni.cz/th/172981/fi_m/?lang=en 8. Líška, M., Sojka, P., Růžička, M., Mravec, P.: Web Interface and Collection for Mathe matical Retrieval. In: Sojka, P., Bouche, T. (eds.) Proceedings of DML 2011. pp. 77–84. Masaryk University, Bertinoro, Italy (Jul 2011), http://www.fi.muni.cz/~sojka/ dml-2011-program.html 9. Maplesoft, a division of Waterloo Maple Inc.: MathML – Maple Help (Apr 2012), http://www.maplesoft.com/support/help/Maple/view.aspx?path=MathML 10. Munavalli, R., Miner, R.: MathFind: A MathAware Search Engine. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 735–735. SIGIR ’06, ACM, New York, NY, USA (2006), http://doi.acm.org/10.1145/1148170.1148348 Normalization of Digital Mathematics Library Content 13

11. Řehůřek, R., Sojka, P.: Automated Classification and Categorization of Mathematical Knowledge. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds.) Intelligent Computer Mathematics—Proceedings of 7th International Con ference on Mathematical Knowledge Management MKM 2008. Lecture Notes in Computer Science LNCS/LNAI, vol. 5144, pp. 543–557. SpringerVerlag, Berlin, Hei delberg (Jul 2008) 12. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks. pp. 45– 50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/publication/884893/en, software available at http://nlp.fi.muni.cz/projekty/gensim 13. Sojka, P. (ed.): Towards a Digital Mathematics Library. Masaryk University, Paris, France (Jul 2010), http://www.fi.muni.cz/~sojka/dml-2010-program.html 14. Sojka, P., Líška, M.: Indexing and Searching Mathematics in Digital Libraries (Mar 2011), submitted to MKM 2011 15. Sojka, P., Líška, M.: Indexing and Searching Mathematics in Digital Libraries – Archi tecture, Design and Scalability Issues. In: Davenport, J.H., Farmer, W.M., Urban, J., Rabe, F. (eds.) Intelligent Computer Mathematics. Proceedings of 18th Symposium, Calculemus 2011, and 10th International Conference, MKM 2011. Lecture Notes in Artificial Intelligence, LNAI, vol. 6824, pp. 228–243. SpringerVerlag, Berlin, Germany (Jul 2011), http://dx.doi.org/10.1007/978-3-642-22673-1_16 16. Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming Large Collections of Scientific Publications to XML. Mathematics in Computer Science 3, 299–307 (2010), http://dx.doi.org/10.1007/s11786-010-0024-7 17. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY — An integrated OCR system for mathematical documents. In: Vanoirbeek, C., Roisin, C., Munson, E. (eds.) Proceedings of ACM Symposium on Document Engineering 2003. pp. 95–104. ACM, Grenoble, France (2003) 18. Sylwestrzak, W., Borbinha, J., Bouche, T., Nowiński, A., Sojka, P.: EuDML—Towards the European Digital Mathematics Library. In: Sojka [13], pp. 11–24, http://dml.cz/ dmlcz/702569 19. The LaTeXML project: The LaTeXML Developer Portal (Apr 2012), https://trac. mathweb.org/LaTeXML/ 20. The MathWorks, Inc.: MuPAD – Matlab (May 2012), http://www.mathworks.com/ discovery/mupad.html 21. Watt, S.M.: Mathematical Document Classification via Symbol Frequency Analysis. In: Sojka, P. (ed.) Towards Digital Mathematics Library—Proceedings of DML 2008. pp. 29–40. Masaryk University, Birmingham, UK (Jul 2008), http://www.fi.muni. cz/~sojka/dml-2008-program.xhtml 22. Wolfram: Mathematica Import/Export Format : MathML (Apr 2012), http:// reference.wolfram.com/mathematica/ref/format/MathML.html 23. Wolfram Alpha LLC: Wolfram Alpha (Apr 2012), http://www.wolframalpha.com/