Augmenting Mathematical Formulae for More Effective Querying & Efficient Presentation

Augmenting Mathematical Formulae for More Effective Querying & Efficient Presentation vorgelegt von Diplom-Physiker Moritz Schubotz geb. in Offenbach am Main von der FakultätIV { Elektrotechnik und Informatik der Technischen UniversitätBerlin zur Erlangung des akademischen Grades Doktor der Naturwissenschaften { Dr. rer. nat. { genehmigte Dissertation Promotionsausschuss: Vorsitzender: Prof. Dr. Odej Kao Gutachter: Prof. Dr. Volker Markl Gutachter: Prof. Abdou Youssef, PhD Gutachter: Prof. James Pitman, PhD Tag der wissenschaftlichen Aussprache: 31. März2017 Berlin 2017 ii Abstract Mathematical Information Retrieval (MIR) is a research area that focuses on the Information Need (IN) of the Science, Technology, Engineering and Mathematics (STEM) domain. Unlike traditional Information Retrieval (IR) research, that extracts information from textual data sources, MIR takes mathematical formulae into account as well. This thesis makes three main contributions: 1. It analyses the strengths and weaknesses of current MIR systems and establishes a new MIR task for future evaluations; 2. Based on the analysis, it augments mathematical notation as a foundation for future MIR systems to better fit the IN from the STEM domain; and 3. It presents a solution on how large web publishers can efficiently present mathematics to satisfy the INs of each individual visitor. With regard to evaluation of MIR systems, it analyses the first international MIR task and proposes the Math Wikipedia Task (WMC). In contrast to other tasks, which evaluate the overall performance of MIR systems based on an IN, that is described by a combination of textual keywords and formulae, WMC was designed to gain insights about the math-specific aspects of MIR systems. In addition to that, this thesis investigates how different factors of similarity measures for mathematical expressions influence the effectiveness of MIR results. Based on the aforementioned evaluations, this thesis proposes to rethink the funda- mentals of MIR systems. MIR systems should elevate the internal representation of mathematics and use a more semantic rather than syntactic representation for the retrieval algorithms. This approach simplifies MIR research by defining three or- thogonal MIR research challenges: (1) Augmentation; (2) Querying; and (3) Efficient Execution. As augmentation target, this thesis proposes the concept of context-free formulae visualized by the idea of Formula Home Page (FHP). By visiting a FHP, a mathematically literate person can fully understand the formula semantics without iii context or additional resources. As a first step towards unsupervised formula augmentation, this thesis introduces Mathematical Language Processing (MLP). MLP extracts knowledge about individual formulae from their surrounding text. To achieve that, it borrows concepts from Natural Language Processing (NLP) and adapts them to the specifics of mathematical language. To finally satisfy the user's mathematical IN, formulae (i.e., data representing mathematical semantics) need to be presented to the user. Given the large variety of users and information systems, delivering math in a robust, scalable, fast and accessible way, was an open research problem. This thesis investigates different approaches to solve this problem and demonstrates the feasibility of a service-oriented multi-format approach which was implemented and is known as the Mathoid math rendering service. This implementation improves the math rendering for all Wikimedia sites including Wikipedia in production. iv Kurzfassung Die digitale Revolution hat die Informationsbeschaffung grundlegend verändert.Das Internet ist zum ersten Anlaufpunkt zur Befriedigung des täglichen Informationsbe- darfs avanciert - sowohl im privaten, als auch im professionellen Leben. Dies gilt auch fürdie Disziplinen Mathematik, Ingenieurswissenschaften, Natur und Tech- nik (MINT). Der hohe Anteil mathematischer Ausdrücke, die in MINT-Fächerin in- tegraler Bestandteil der Schriftsprache sind, stellt eine besondere Herausforderung fürSysteme wie Suchmaschinen und Literaturempfehlungsdienste dar. Mit dieser Thematik beschäftigt sich das Forschungsgebiet Mathematical Information Retrieval (MIR). Einige Probleme, wie beispielsweise die Disambiguierung, könnendurch Adap- tion korrespondierender Methoden aus der Computerlinguistik gelöstwerden. Viele Aspekte erfordern jedoch auch vollständigneue Lösungen. Die Anwendungsszenarien fürbessere Verarbeitungs- und Analyseverfahren von Tex- ten mit einem hohen Anteil mathematischer Notation sind vielfältigund reichen von der Literaturrecherche wissenschaftlicher Texte, über die Vermeidung von Plagiaten bis zur Verbesserung von Lernsoftware fürdie MINT-Fächer in Schulen und Univer- sitäten. Die vorliegende Dissertation leistet die folgenden Beiträgezur MIR-Forschung: 1. Analyse der Stärken und Schwächen bereits bestehender MIR-Systeme und En- twicklung eines standardisierten Evaluationssystems zur Quantifizierung der Ef- fektivitätvon MIR- Systemen. 2. Erforschung von Verfahren zur automatischen, semantischen Anreicherung mathematischer Ausdrücke. 3. Entwicklung eines Lösungsvorschlags fürdie effiziente und skalierbare Darstel- lung mathematischer Inhalte. Basierend auf der Analyse bereits bestehender MIR-Systeme, wird in dieser Arbeit eine Dreiteilung der MIR-Forschung vorgeschlagen: (1) Augmentierung; (2) Anfra- gengenerierung und (3) Effiziente Ausführung. v Es wird ein Evaluationsverfahren zur Quantifizierung der Effektivitätvon MIR-Sys- temen entwickelt, bestehend aus einem auf Wikipedia basierenden Testkorpus, einer Aufgabenliste und einem vollautomatischen Auswertungssystem der Messergebnisse. Im Gegensatz zu herkömmlichen Evaluationsverfahren, bei denen die Aufgaben aus Schlagwortenlisten bestehen, verwendet das hier vorgestellte Verfahren Formelmuster. Das Evaluationsverfahren war Teil des ersten offiziellen, internationalen Wettbewerbs fürMIR-Systeme. Darüber hinaus wurde das Evaluationsverfahren auch außerhalb des Wettbewerbs zur Evaluation von MIR-Systemen verwendet und von anderen Wis- senschaftlern weiterentwickelt. In einem Prozess, der als "Mathematical Language Processing\ (MLP) bezeichnet wird, werden mathematische Bezeichner durch Informationen aus dem umgebenden Text semantisch angereichert. In einem zweiten Schritt wird nicht nur der umgebende Text eines Bezeichners betrachtet, sondern die Gesamtheit der Texte aus ähnlichen Themengebieten analysiert, um die Bedeutungen einzelner Bezeichner zu identifizieren und die Effektivitätder semantischen Anreicherung weiter zu verbessern. In einem weiteren Schritt wird die Darstellung und Verarbeitung von mathematischen Formeln in Wikipedia grundlegend verbessert. Dazu werden die mathematischen Ausdrücke, die bis zu diesem Zeitpunkt in Bilddateien dargestellt wurden, in HTML5- Code umgewandelt. Dies ermöglicht eine schnellere und skalierbare Verarbeitung der mathematischen Inhalte in Wikipedia. Seit Mai 2016 wird dieses Verfahren weltweit auf allen Wikipediaseiten mit mathematischen Ausdrücken verwendet. vi Acknowledgements This thesis would not have been possible without the collaboration and support of nu- merous individuals and institutions. I am especially grateful to my doctoral advisor, Professor Volker Markl. Moreover, I gratefully acknowledge Professor Abdou Youssef for collaboration and advise in different projects and Professor James Pitman forhis support for this thesis and the research field as a whole. I wish to thank Dr. Howard Cohl for the open exchange of ideas and our collaboration on the National Institute of Standards and Technology (NIST) Digital Repository of Mathematical Formulae (DRMF) project. Furthermore, I wish to thank Marcus Leich, Dr. Howard Cohl and Norman Meuschke for their valuable feedback and proofreading of the manuscript. I also wish to thank Professor Akiko Aizawa, Professor Michael Kohlhase and Professor Bela Gipp for fruitful discussions at different stages of my thesis. I also thank collab- orating researchers for their input including Alan Sexton, Dr. Bruce Miller, Deyan Ginev, Juan Soto, Dr. Peter Krautzberger, Frédéric Wang and Professor Volker Sorge. Furthermore, I wish to thank my students Alexey Grigorev, Robert Pagel, David Veenhuis, AndréGreiner-Petter, Malte Schwarzer, Julian Hilbigson, Duc Linh Tran, Tobias Uhlich and Thanh Phuong Luu. I especially acknowledge all the people, that I collaborated with during code devel- opment, program design, code review, testing and infrastructure setup, especially Gabriel Wicke, Dr. Marko Obrovac, Terry Chay, Matthew Flaschen, Goran Topic, Bryan Davis, Dr. Scott Ananin, Daniel Kinzler, Derk-Jan Hartmann, Ed Sanders, Andrew Otto, Andrew Bogott, Erik Moeller, James Forrester, Lydia Pintscher, Quim Gil, Raimond Spekking, Alexandros Kosiaris, FrédéricWang, Dr. Bruce Miller and Deyan Ginev. I also gratefully acknowledge the NIST in Gaithersburg for inviting me as a foreign guest researcher in 2014, 2015 and 2016 as well as the National Institute for Infor- matics in Tokyo for my research stay in 2014. Furthermore, I wish to thank the Wikimedia Foundation for their travel grants for my participation in the Wikimedia Hackathon 2016 and 2017 as well as their financial support during my research visit in 2013. I thank ACM and SIGIR first for their conference travel grants. Furthermore, I thank my colleagues at Technische UniversitätBerlin, my family and friends for their support during my work on this thesis. vii viii Contents

Augmenting Mathematical Formulae for More Effective Querying & Efficient Presentation

Tractatus Logico-Philosophicus</Em>

First Order Logic

EA-1092; Environmental Assessment and FONSI for Decontamination and Dismantlement June 1995

Every Nonsingular C Flow on a Closed Manifold of Dimension Greater Than

TUGBOAT Volume 39, Number 2 / 2018 TUG 2018 Conference

Semi-Automated Correction Tools for Mathematics-Based Exercises in MOOC Environments

Frege and the Logic of Sense and Reference

Toward a Logic of Everyday Reasoning

Transforming the Arχiv to XML

1 Installation Texlive 2019

Real and Imaginary Parts of Decidability-Making Gilbert Giacomoni

Formalizing Common Sense Reasoning for Scalable Inconsistency-Robust Information Integration Using Direct Logictm Reasoning and the Actor Model