Variable Typing: Assigning Meaning to Variables in Mathematical Text

Variable Typing: Assigning Meaning to Variables in Mathematical Text Yiannos A. Stathopoulos♠ Simon Baker♠♣ Marek Rei♠♦ Simone Teufel♠ ♠Computer Laboratory, University of Cambridge, United Kingdom ♣Language Technology Lab, University of Cambridge, United Kingdom ♦The ALTA Institute, University of Cambridge, United Kingdom yiannos.stathopoulos,simon.baker,marek.rei,simone.teufel @cl.cam.ac.uk { } Abstract Let P be a parabolic subgroup of GL(n) with Levi decomposition P = MN, where N is the Information about the meaning of mathemati- unipotent radical. cal variables in text is useful in NLP/IR tasks such as symbol disambiguation, topic mod- the variables P and N in the symbolic context eling and mathematical information retrieval are assigned the meaning “parabolic subgroup” (MIR). We introduce variable typing, the task and “unipotent radical” by the textual context sur- of assigning one mathematical type (multi- rounding them respectively. word technical terms referring to mathemati- We will refer to the task of assigning one cal concepts) to each variable in a sentence of mathematical type to each variable in a sentence mathematical text. As part of this work, we also introduce a new annotated data set com- as variable typing. We use mathematical types posed of 33,524 data points extracted from sci- (Stathopoulos and Teufel, 2016) as variable de- entific documents published on arXiv. Our in- notation labels. Types are multi-word phrases trinsic evaluation demonstrates that our data drawn from the technical terminology of the math- set is sufficient to successfully train and eval- ematical discourse that label mathematical objects uate current classifiers from three different (e.g., “set”), algebraic structures (e.g., “monoid”) model architectures. The best performing and instantiable notions (e.g., “cardinality of a model is evaluated on an extrinsic task: MIR, by producing a typed formula index. Our re- set”). In the sentence presented earlier, the phrases sults show that the best performing MIR mod- “parabolic subgroup”, “Levi decomposition” and els make use of our typed index, compared to “unipotent radical” are examples of types. a formula index only containing raw symbols, Typing variables may be beneficial to other nat- thereby demonstrating the usefulness of vari- ural language processing (NLP) tasks, such as able typing. topic modeling, to group documents that assign 1 Introduction meaning to variables consistently (e.g., “E” is “en- ergy” consistently in some branches of Physics). Scientific documents, such as those from Physics In mathematical information retrieval (MIR), for and Computer Science, rely on mathematics to instance, enriching formulae with types may im- communicate ideas and results. Written mathe- prove precision. For example, the formulae x + y matics, unlike general text, follows strong domain- and a+b can be considered α-equivalent matches. specific conventions governing how content is pre- However, if a and b are matrices while x and y sented. According to Ganesalingam(2008), the are vectors, the match is likely to be a false posi- sense of mathematical text is conveyed through tive. Typing information may be helpful in reduc- the interaction of two contexts: the textual context ing such instances and improving retrieval preci- (flowing text) and the mathematical (or symbolic) sion. context (mathematical formulae). Variable typing differs from similar tasks in In this work, we introduce a new task that fo- three fundamental ways. First, meaning – in the cuses on one particular interaction: the assignment form of mathematical types – is explicitly assigned of meaning to variables by surrounding text in the to variables, rather than arbitrary mathematical ex- same sentence1. For example, in the sentence pressions. Second, variable typing is carried out 1Data for the task is available at https://www.cst. at the sentential level, with valid type assignments cam.ac.uk/˜yas23/ for variables drawn from the sentences in which 303 Proceedings of NAACL-HLT 2018, pages 303–312 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics they occur, rather than from larger contexts, such al. report a precision of 66%. as documents. Third, denotations are drawn from Quoc et al.(2010) used a rule-based approach a pre-determined list of types, rather than from to extract descriptions for formulae (phrases or free-form text in the surrounding context of each sentences) from surrounding context. In a simi- variable. lar approach, Kristianto et al.(2012) applied pat- As part of our work, we have constructed a new tern matching on sentence parse trees and a “near- data set for variable typing that is suitable for ma- est noun” approach to extract descriptions. These chine learning (Section4) and is distributed un- rule-based methods have been shown to perform der the Open Data Commons license. We pro- well for recall but poorly for precision (Kris- pose and evaluate three models for typing vari- tianto et al., 2012). However, Kristianto et al. ables in mathematical documents based on current (2012) note that domain-agnostic parsers are con- machine learning architectures (Section5). Our fused by mathematical expressions making rule- intrinsic evaluation (Section6) suggests that our based methods sensitive to parse tree errors. Both models significantly outperform the state-of-the- rule-based extraction methods were outperformed art SVM model by Kristianto et al.(2012, 2014) by Support Vector Machines (SVMs) (Kristianto (originally developed for description extraction) et al., 2012, 2014). on our data set. More importantly, our intrinsic Schubotz et al.(2016) use hierarchical named evaluation demonstrates that our data set is suf- topic clusters, referred to as namespaces, to ficient to successfully train and evaluate classi- model the semantics of mathematical identifiers. fiers from three different architectures. We also Namespaces are derived from a document col- demonstrate that our variable typing task and data lection of 22,515 Wikipedia articles. A vector- are useful in MIR in our extrinsic evaluation (Sec- space approach is used to cluster documents into tion7). namespaces using mini-batch K-means clustering. Clusters beyond a certain purity threshold are se- 2 Related Work lected and converted into namespaces by extracting phrases that assign meaning to identifiers in The task of extracting semantics for variables from the selected clusters. Schubotz et al.(2016) take the linguistic context was first proposed by Grig- a ranked approach at determining the phrase that ore et al.(2009) with the intention of disambiguat- best assigns meaning to a particular identifier. The ing symbols in mathematical expressions. Grigore authors report F1 scores of 23.9% and 56.6% for et al. took operators listed in OpenMath content their definition extraction methods. dictionaries (CDs) as concepts and used term clus- In contrast, we assign meaning exclusively to ters to model their semantics. A bag of nouns is variables, using denotations from a pre-computed extracted from the operator description in the dic- dictionary of mathematical types, rather than free- tionary and enriched manually using terms taken form text. Types as pre-identified, composition- from online lexical resources. The cluster that ally constructed denotational labels enable effi- maximises the similarity (based on Pointwise Mu- cient determination of relatedness between math- tual Information (PMI) and DICE) between nouns ematical concepts. In our extrinsic MIR experi- in the cluster and the local context of a target for- ment (Section7), the mathematical concept that mula is taken to represent its meaning. two or more types are derived from is identified Wolska et al.(2011) used the Cambridge dic- by locating their common parent type – the super- tionary of mathematics and the mathematics sub- type – on a suffix trie. Topically related types that ject classification hierarchy to manually construct do not share a common supertype can be identi- taxonomies used to assign meaning to simple ex- fied using an automatically constructed type em- pressions. Simple expressions are defined by the bedding space (Stathopoulos and Teufel(2016), authors to be mathematical formulae taking the Section 5.1), rather than manually curated names- form of an identifier, which may have super/sub- paces or fuzzy term clusters. scripted expressions of arbitrary complexity. Lex- ical features surrounding simple expressions are 3 The Variable Typing Task used to match the context of candidate expressions to suitable taxonomies using a combination We define the task of variable typing as follows. of PMI and DICE (Wolska et al., 2011). Wolska et Given a sentence containing a pre-identified set of 304 variables V and types T , variable typing is the task 4 Variable Typing Data Set of classifying all edges V T as either existent × (positive) or non-existent (negative). We have constructed an annotated data set of sentences for building variable typing classifiers. However, not all elements of V T are valid × The sentences in our corpus are sourced from the edges. Invalid edges are usually instances of type Mathematical REtrieval Corpus (MREC) (L´ıskaˇ parameterisation, where some type is parame- et al., 2011), a subset of arXiv (over 439,000 terised by what appears to be a variable. For ex- papers) with all LAT X formulae converted to ample, the set of candidate edges for the sentence E MathML. Train Dev Test Total We now consider the q-exterior algebras of V and V ∗, cf. [21]. Sentences 5,273 841 1,689 7,803 Positive edges 1,995 457 1,049 3,501 Negative edges 15,164 4,386 10,473 30,023 Total edges 17,159 4,843 11,522 33,524 would include (V , exterior algebra) and (V ∗, exterior algebra) but not Table 1: Data set statistics. (q, exterior algebra). Such edges are identified using pattern matching (Java regular expressions) The data set is split into a standard train- and are not presented to annotators or recorded in ing/development/test machine learning partition- the data set.

Load more