Extracting Glossary Sentences from Scholarly Articles: a Comparative Evaluation of Pattern Bootstrapping and Deep Analysis

Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern bootstrapping and deep analysis Melanie Reiplinger1 Ulrich Schafer¨ 2 Magdalena Wolska1∗∗ 1Computational Linguistics, Saarland University, D-66041 Saarbrucken,¨ Germany 2DFKI Language Technology Lab, Campus D 3 1, D-66123 Saarbrucken,¨ Germany {mreiplin,magda}@coli.uni-saarland.de, [email protected] Abstract The process of glossary creation is inherently dependent on expert knowledge of the given domain, The paper reports on a comparative study of its concepts and language. In case of scientific do- two approaches to extracting definitional sen- mains, which constantly evolve, glossaries need to tences from a corpus of scholarly discourse: be regularly maintained: updated and continually one based on bootstrapping lexico-syntactic extended. Manual creation of specialized glossaries patterns and another based on deep analysis. Computational Linguistics was used as the tar- is therefore costly. As an alternative, fully- and get domain and the ACL Anthology as the semi-automatic methods of glossary creation have corpus. Definitional sentences extracted for a been proposed (see Section 2). set of well-defined concepts were rated by do- This paper compares two approaches to corpus- main experts. Results show that both meth- based extraction of definitional sentences intended ods extract high-quality definition sentences to serve as input for a specialized glossary of a scien- intended for automated glossary construction. tific domain. The bootstrapping approach acquires lexico-syntactic patterns characteristic of definitions 1 Introduction from a corpus of scholarly discourse. The deep approach uses syntactic and semantic processing to Specialized glossaries serve two functions: Firstly, build structured representations of sentences based they are linguistic resources summarizing the ter- on which ‘is-a’-type definitions are extracted. In minological basis of a specialized domain. Sec- the present study we used Computational Linguis- ondly, they are knowledge resources, in that they tics (CL) as the target domain of interest and the provide definitions of concepts which the terms de- ACL Anthology as the corpus. note. Glossaries find obvious use as sources of ref- Computational Linguistics, as a specialized do- erence. A survey on the use of lexicographical aids main, is rich in technical terminology. As a cross- in specialized translation showed that glossaries are disciplinary domain at the intersection of linguistics, among the top five resources used (Duran-Mu´ noz,˜ computer science, artificial intelligence, and mathe- 2010). Glossaries have also been shown to facil- matics, it is interesting as far as glossary creation itate reception of texts and acquisition of knowl- is concerned in that its scholarly discourse ranges edge during study (Weiten et al., 1999), while self- from descriptive informal to formal, including math- explanation of reasoning by referring to definitions ematical notation. Consider the following two de- has been shown to promote understanding (Aleven scriptions of Probabilistic Context-Free Grammar et al., 1999). From a machine-processing point of (PCFG): view, glossaries may be used as input for domain (1)A PCFG is a CFG in which each production ontology induction; see, e.g. (Bozzato et al., 2008). A → α in the grammar’s set of productions ∗∗Corresponding author R is associated with an emission probabil- 55 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, pages 55–65, Jeju, Republic of Korea, 10 July 2012. c 2012 Association for Computational Linguistics ity P (A → α) that satisfies a normalization classifier’s output. Besides other features, the posi- constraint tion of a sentence within a document is used, which, X due to the encyclopaedic text character of the cor- P (A → α) = 1 pus, allows to set the baseline precision at above α:A→α∈R 75% by classifying the first sentences as definitions. and a consistency or tightness constraint [...] Westerhout and Monachesi (2008) use a complex set of grammar rules over POS, syntactic chunks, and PCFG (2)A defines the probability of a string entire definitory contexts to extract definition sen- of words as the sum of the probabilities of tences from an eLearning corpus. Machine learn- all admissible phrase structure parses (trees) ing is used to filter out incorrect candidates. Gaudio for that string. and Branco (2009) use only POS information in a While (1) is an example of a genus-differentia supervised-learning approach, pointing out that lex- definition, (2) is a valid description of PCFG which ical and syntactic features are domain and language neither has the typical copula structure of an “is-a”- dependent. Borg et al. (2009) use genetic program- type definition, nor does it contain the level of de- ming to learn rules for typical linguistic forms of tail of the former. (2) is, however, well-usable for a definition sentences in an eLearning corpus and ge- glossary. The bootstrapping approach extracts defi- netic algorithms to assign weights to the rules. Ve- nitions of both types. Thus, at the subsequent glos- lardi et al. (2008) present a fully-automatic end-to- sary creation stage, alternative entries can be used to end methodology of glossary creation. First, Term- generate glossaries of different granularity and for- Extractor acquires domain terminology and Gloss- mal detail; e.g., targeting different user groups. Extractor searches for definitions on the web using google queries constructed from a set of manually Outline. Section 2 gives an overview of related lexical definitional patterns. Then, the search results work. Section 3 presents the corpora and the prepro- are filtered using POS and chunk information as well cessing steps. The bootstrapping procedure is sum- as term distribution properties of the domain of in- marized in Section 4 and deep analysis in Section 5. terest. Filtered results are presented to humans for Section 6 presents the evaluation methodology and manual validation upon which the system updates the results. Section 7 presents an outlook. the glossary. The entire process is automated. 2 Related Work Bootstrapping as a method of linguistic pattern induction was used for learning hyponymy/is-a re- Most of the existing definition extraction methods lations already in the early 90s by Hearst (1992). – be it targeting definitional question answering or Various variants of the procedure – for instance, ex- glossary creation – are based on mining part-of- ploiting POS information, double pattern-anchors, speech (POS) and/or lexical patterns typical of def- semantic information – have been recently pro- initional contexts. Patterns are then filtered heuris- posed (Etzioni et al., 2005; Pantel and Pennacchiotti, tically or using machine learning based on features 2006; Girju et al., 2006; Walter, 2008; Kozareva et which refer to the contexts’ syntax, lexical content, al., 2008; Wolska et al., 2011). The method pre- punctuation, layout, position in discourse, etc. sented here is most similar to Hearst’s, however, we DEFINDER (Muresan and Klavans, 2002), a rule- acquire a large set of general patterns over POS tags based system, mines definitions from online medical alone which we subsequently optimize on a small articles in lay language by extracting sentences us- manually annotated corpus subset by lexicalizing the ing cue-phrases, such as “x is the term for y”, “x verb classes. is defined as y”, and punctuation, e.g., hyphens and brackets. The results are analyzed with a statistical 3 The Corpora and Preprocesssing parser. Fahmi and Bouma (2006) train supervised learners to classify concept definitions from medi- The corpora. Three corpora were used in this cal pages of the Dutch Wikipedia using the “is a” study. At the initial stage two development corpora pattern and apply a lexical filter (stopwords) to the were used: a digitalized early draft of the Jurafsky- 56 Martin textbook (Jurafsky and Martin, 2000) and the machine translation language model WeScience Corpus, a set of Wikipedia articles in the neural network reference resolution domain of Natural Language Processing (Ytrestøl et finite(-| )state automaton hidden markov model al., 2009).1 The former served as a source of seed speech synthesis semantic role label(l)?ing domain terms with definitions, while the latter was context(-| )free grammar ontology Seed terms generative grammar dynamic programming used for seed pattern creation. mutual information For acquisition of definitional patterns and pat- T .* (is|are|can be) used ACL Anthology tern refinement we used the , a dig- T .* called ital archive of scientific papers from conferences, T .* (is|are) composed workshops, and journals on Computational Linguis- T .* involv(es|ed|e|ing) tics and Language Technology (Bird et al., 2008).2 T .* perform(s|ed|ing)? The corpus consisted of 18,653 papers published be- T \( or .*? \) Seed patterns tween 1965 and 2011, with a total of 66,789,624 task of .* T .*? is tokens and 3,288,073 sentences. This corpus was term T .*? refer(s|red|ring)? also used to extract sentences for the evaluation us- Table 1: Seed domain terms (top) and seed patterns (bot- ing both extraction methods. tom) used for bootstrapping; T stands for a domain term. Preprocessing. The corpora have been sentence and word-tokenized using regular expression-based 4 Bootstrapping Definition Patterns sentence boundary detection and tokenization tools. Sentences have been part-of-speech tagged using the Bootstrapping-based extraction of definitional sen- TnT tagger (Brants, 2000) trained on the Penn Tree- tences proceeds in two stages: First, aiming at recall, bank (Marcus et al., 1993).3 a large set of lexico-syntactic patterns is acquired: Next, domain terms were identified using the C- Starting with a small set of seed terms and patterns, Value approach (Frantzi et al., 1998). C-Value is term and pattern “pools” are iteratively augmented a domain-independent method of automatic multi- by searching for matching sentences from the ACL word term recognition that rewards high frequency Anthology while acquiring candidates for definition and high-order n-gram candidates, but penalizes terms and patterns.

Extracting Glossary Sentences from Scholarly Articles: a Comparative Evaluation of Pattern Bootstrapping and Deep Analysis

Discovering Semantic Links Between Synonyms, Hyponyms and Hypernyms Ricardo Avila, Gabriel Lopes, Vania Vidal, Jose Macedo

Glossary and Bibliography for Vocabularies 1 the Codes (For Example, the Dewey Decimal System Number 735.942)

GLOSSARY of USAGE Presbyterian College English Department a A, an Use a Before a Consonant Sound, an Before a Vowel Sound

Glossary of Linguistic Terms

Morpheme Master List

Bilingualized Dictionaries with Special Reference to the Chinese EFL Context Yuzhen Chen, Department of Foreign Languages, Putian University, Fujian, P.R.C

Igor Burkhanov. Lexicography: a Dictionary of Basic Terminology. 1998, 285 Pp

Dictionary of Ibm & Computing Terminology 1 8307D01a

Detailed Terminographic Analysis of “Glossary of Terminology Used in the Standardization of Geographical Names *

Using Translations to Distinguish Between Homonymy and Polysemy

Foundational Literacy Glossary of Terms

Glossary Guided Post-Processing for Word Embedding Learning