Termsuite: Terminology Extraction with Term Variant Detection

TermSuite: Terminology Extraction with Term Variant Detection Damien Cram Beatrice´ Daille LINA - UMR CNRS 6241 LINA - UMR CNRS 6241 Universite´ de Nantes, France Universite´ de Nantes, France [email protected] [email protected] Abstract This last question has always been a pain in the neck for TET. We introduce, TermSuite, a JAVA and The current generation of TET improves on var- UIMA-based toolkit to build terminolo- ious aspects. As an example, TermoStat1 deals gies from corpora. TermSuite follows with several Romance languages, reaches to treat the classic two steps of terminology ex- text up to 30 megabytes, and proposes a first traction tools, the identification of term structuring based on lexical inclusion. Term- candidates and their ranking, but imple- Suite goes a step forward: it is multilingually ments new features. It is multilingually designed, scalable, and handles term variants. It designed, scalable, and handles term vari- is able to perform term extraction from languages ants. We focus on the main compo- that behave differently from the linguistic point of nents: UIMA Tokens Regex for defining view. Complex terms in languages such as Ger- term and variant patterns over word anno- man and Russian are mostly compounds, while in tations, and the grouping component for Roman languages they are MWT. TermSuite clustering terms and variants that works extracts single terms and any kind of complex both at morphological and syntactic levels. terms. For some generic domains and some ap- plications, large amounts of data have to be pro- 1 Introduction cessed. TermSuite is scalable and has been applied to corpora of 1.1 gigabytes using a per- Terminologies play a central role in any NLP ap- sonal computer configuration. Finally, Term- plications such as information retrieval, informa- Suite identifies a broad range of term variants, tion extraction, or ontology acquisition. A ter- from spelling to syntactic variants that may be minology is a coherent set of terms that consti- used to structure the extracted terminology with tutes the vocabulary of a domain. It also reflects various conceptual relations. the conceptual system of that domain. A term Since the first TermSuite release (Rocheteau could be a single term (SWT), such as rotor, or and Daille, 2011), several enhancements about a complex term. Complex terms are either com- TET have been made. We developed UIMA To- pounds such as broadband, or multi-word terms kens Regex, a tool to define term and variant pat- (MWT) such as frequency band. Terms are func- terns using word annotations within the UIMA tional classes of lexical items used in discourse, framework (Ferrucci and Lally, 2004) and a group- and as such they are subjected to linguistic varia- ing tool to cluster terms and variants. Both tools tions such as modification or coordination. are designed to treat in an uniform way all linguis- As specialized domains are poorly covered by tic kinds of complex terms. general dictionaries, Term Extraction Tools (TET) After a brief reminder of TermSuite gene- that extract terminology from corpora have been ral architecture, we present its term spotting tool developed since the early nineties. This first gen- UIMA Tokens Regex, its variant grouping tool, eration of TET (Cabre´ et al., 2001) was mono- and the variant specifications we design for En- lingually designed, not scalable, and they were glish, French, Spanish, German, and Russian. Fi- not handling term variants, except for ACABIT (Daille, 2001) and FASTR (Jacquemin, 2001). 1http://termostat.ling.umontreal.ca/ 13 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics—System Demonstrations, pages 13–18, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics nally, we provide some figures and considerations define rules over a sequence of UIMA annota- about TermSuite resources and behaviour. tions, ie. over tokens of the corpus, each rule being in the form of a regular expression. Com- 2 TermSuite architecture pared to RUTA (Kluegl et al., 2016), UIMA To- TET are dedicated to compute the termhood and kens Regex operates only on annotations that ap- the unithood of a term candidate (Kageura and pear sequentially, which is the case for word an- Umino, 1996). Two steps make up the core of notations. The occurrence recognition engine has the terminology extraction process (Pazienza et been thus implemented as a finite-state machine al., 2005): with linear complexity. 1. Spotting: Identification and collection of 3.1 Syntax term-like units in the texts, mostly a subset UIMA Tokens Regex syntax is formally de- of nominal phrases; fined by an ANTLR3 grammar and inspired by Stanford TokensRegex (Chang and Man- 2. Filtering and sorting: Filtering of the ex- ning, 2014). tracted term-like units that may not be terms, syntactically or terminologically; Sorting of Matchers Before defining regular expressions the term candidates according to their unit- over annotations, each annotation needs to be hood, their terminological degree and their atomically matchable. That is why UIMA Tokens most interest for the target application. Regex defines a syntax for matchers.A matcher can be of three types: TermSuite adopts these two steps. Term- [Boolean Exp] an expression matching like units are collected with the following NLP the values of annotation pipeline: tokenization, POS tagging, lemmatiza- attributes. tion, stemming, splitting, and MWT spotting with /String RegExp/ A valid Java regular ex- UIMA Tokens Regex. They are ranked according pression matching against to the most popular termhood measure. But in or- the text covered by the an- der to improve the term extraction process and to notation. provide a first structuring of the term candidates, a The dot ”.” matches any annotation. component dedicating to term variant recognition The Boolean Exp within brackets is a combination has been added. Indeed, term variant recognition of atomic boolean expressions, boolean operators improves the outputs of term extraction: the rank- & and , and parentheses. An atomic boolean k ing of the term candidates is more accurate and expression is of the form: more terms are detected (Daille and Blancafort, property op literal 2013). Figure 2 shows the output of TermSuite TET Where property is an annotation feature de- within the graphical interface. The main win- fined in TermSuite UIMA type system, op is dow shows the terms rank according to termhood. one of ==, !=, <, <=, >, and >=, and literal A term candidate may group miscellaneous term is either a string, a boolean (true or false), or variants. When a term is highlighted, the occur- a number (integer or double). rences spot by UIMA Tokens Regex are showed Rules Rules are named regular expressions that in the bottom window and the term features in the are defined as follows: right window. term "rule name": TokensRegex; 3 Spotting multiword terms Where TokensRegex is a sequence of quantified We design a component in charge of spotting matchers. The quantifiers are: ? multi-word terms and their variants in text, which 0 or 1 is based on UIMA Tokens Regex2, a concise and * 0 or several + expressive language coupled with an efficient rule at least 1 n exactly n engine. UIMA Tokens Regex allows the user to { } m,n between m and n 2http://github.com/JuleStar/ { } uima-tokens-regex/ 3http://antlr.org/ 14 3.2 Engine Lexical filtering Matcher A above shows an example of lexical filtering that prohibits occur- UIMA Tokens Regex engine parses the list of rules rences of the listed lemma in the pattern. For ex- and creates for each of these rules a finite-state ample, Rule an will not match the term candidate automaton. The engine provides automata with same energy. the sequence of UIMA annotations of the prepro- cessed input document. UIMA Tokens Regex en- Contextual filtering Contextual POS are pre- gine implements the default behaviour of a regu- ceded by tilde ( ). Rule acan shows an example ∼ lar expression engine: it is greedy, backtracking, of contextual filtering. A determinant should oc- picking out the first alternative, and impatient. cur for the pattern to be matched, but it will be not Every time an automaton (ie. a rule) matches, part of collected MWT. TermSuite generates a rule occurrence and stores the offset indexes of the matched text. 4 Variant grouping 3.3 Application to terminology extraction TermSuite is able to gather terms according to syntactic and morphological variant patterns that Example In TermSuite type system, the val- are defined with YAML syntax (Ben-Kiki et al., ues of the feature category are the part-of- 2005). speech (POS) tags. Rule an below extracts MWT composed of one or several adjectives followed by 4.1 Syntax a noun. A variant rule states a set of conditions that two term "an": [category=="adjective"]+ term candidates must fulfil to be paired. It consists [category=="noun"] ; of: Matcher predefinition For the sake of both a rule name a string expression between double readability and reusability, UIMA Tokens Regex quotes ("), ended by a colon (:), allows the user to predefine matchers. Thus, Rule an can be expressed concisely as A+ N using the a source pattern and a target pattern, which are matchers N and A: sequences of matcher labels. matcher N: [category=="noun"]; matcher Vpp: [V & mood=="participle" a boolean expression a logical expression on & tense=="past"]; matcher A: [(Vpp | category=="adjective") source and target term features, denoted by & lemma!="same" rule. The field rule is interpreted by a & lemma!="other"]; Groovy engine and must be defined in valid matcher C: /ˆ(and|or)$/; matcher D: [category=="determiner" Groovy syntax. & subCategory != "possessive"]; matcher P: [category=="adposition" Example The example below is the simplest & subCategory=="preposition"]; variant grouping rule defined for English. term "an": A+ N ; "S-I-NN-(N|A)": term "npn": N P D? N ; source: N N term "acan": ˜D A C A N ; target: N N N, N A N rule: s[0]==t[0] && s[1]==t[2] Rule acan extracts coordination variants that This rule is named S-I-NN-(N A).

Load more