TermSuite: Extraction with Term Variant Detection

Damien Cram Beatrice´ Daille LINA - UMR CNRS 6241 LINA - UMR CNRS 6241 Universite´ de Nantes, France Universite´ de Nantes, France [email protected] [email protected]

Abstract This last question has always been a pain in the neck for TET. We introduce, TermSuite, a JAVA and The current generation of TET improves on var- UIMA-based toolkit to build terminolo- ious aspects. As an example, TermoStat1 deals gies from corpora. TermSuite follows with several Romance languages, reaches to treat the classic two steps of terminology ex- text up to 30 megabytes, and proposes a first traction tools, the identification of term structuring based on lexical inclusion. Term- candidates and their ranking, but imple- Suite goes a step forward: it is multilingually ments new features. It is multilingually designed, scalable, and handles term variants. It designed, scalable, and handles term vari- is able to perform term extraction from languages ants. We focus on the main compo- that behave differently from the linguistic point of nents: UIMA Tokens Regex for defining view. Complex terms in languages such as Ger- term and variant patterns over anno- man and Russian are mostly compounds, while in tations, and the grouping component for Roman languages they are MWT. TermSuite clustering terms and variants that works extracts single terms and any kind of complex both at morphological and syntactic levels. terms. For some generic domains and some ap- plications, large amounts of data have to be pro- 1 Introduction cessed. TermSuite is scalable and has been applied to corpora of 1.1 gigabytes using a per- play a central role in any NLP ap- sonal computer configuration. Finally, Term- plications such as information retrieval, informa- Suite identifies a broad range of term variants, tion extraction, or ontology acquisition. A ter- from spelling to syntactic variants that may be minology is a coherent set of terms that consti- used to structure the extracted terminology with tutes the vocabulary of a domain. It also reflects various conceptual relations. the conceptual system of that domain. A term Since the first TermSuite release (Rocheteau could be a single term (SWT), such as rotor, or and Daille, 2011), several enhancements about a complex term. Complex terms are either com- TET have been made. We developed UIMA To- pounds such as broadband, or multi-word terms kens Regex, a tool to define term and variant pat- (MWT) such as frequency band. Terms are func- terns using word annotations within the UIMA tional classes of lexical items used in discourse, framework (Ferrucci and Lally, 2004) and a group- and as such they are subjected to linguistic varia- ing tool to cluster terms and variants. Both tools tions such as modification or coordination. are designed to treat in an uniform way all linguis- As specialized domains are poorly covered by tic kinds of complex terms. general dictionaries, Term Extraction Tools (TET) After a brief reminder of TermSuite gene- that extract terminology from corpora have been ral architecture, we present its term spotting tool developed since the early nineties. This first gen- UIMA Tokens Regex, its variant grouping tool, eration of TET (Cabre´ et al., 2001) was mono- and the variant specifications we design for En- lingually designed, not scalable, and they were glish, French, Spanish, German, and Russian. Fi- not handling term variants, except for ACABIT (Daille, 2001) and FASTR (Jacquemin, 2001). 1http://termostat.ling.umontreal.ca/

13 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics—System Demonstrations, pages 13–18, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics nally, we provide some figures and considerations define rules over a sequence of UIMA annota- about TermSuite resources and behaviour. tions, ie. over tokens of the corpus, each rule being in the form of a regular expression. Com- 2 TermSuite architecture pared to RUTA (Kluegl et al., 2016), UIMA To- TET are dedicated to compute the termhood and kens Regex operates only on annotations that ap- the unithood of a term candidate (Kageura and pear sequentially, which is the case for word an- Umino, 1996). Two steps make up the core of notations. The occurrence recognition engine has the terminology extraction process (Pazienza et been thus implemented as a finite-state machine al., 2005): with linear complexity.

1. Spotting: Identification and collection of 3.1 Syntax term-like units in the texts, mostly a subset UIMA Tokens Regex syntax is formally de- of nominal phrases; fined by an ANTLR3 grammar and inspired by Stanford TokensRegex (Chang and Man- 2. Filtering and sorting: Filtering of the ex- ning, 2014). tracted term-like units that may not be terms, syntactically or terminologically; Sorting of Matchers Before defining regular expressions the term candidates according to their unit- over annotations, each annotation needs to be hood, their terminological degree and their atomically matchable. That is why UIMA Tokens most interest for the target application. Regex defines a syntax for matchers.A matcher can be of three types: TermSuite adopts these two steps. Term- [Boolean Exp] an expression matching like units are collected with the following NLP the values of annotation pipeline: tokenization, POS tagging, lemmatiza- attributes. tion, , splitting, and MWT spotting with /String RegExp/ A valid Java regular ex- UIMA Tokens Regex. They are ranked according pression matching against to the most popular termhood measure. But in or- the text covered by the an- der to improve the term extraction process and to notation. provide a first structuring of the term candidates, a The dot ”.” matches any annotation. component dedicating to term variant recognition The Boolean Exp within brackets is a combination has been added. Indeed, term variant recognition of atomic boolean expressions, boolean operators improves the outputs of term extraction: the rank- & and , and parentheses. An atomic boolean k ing of the term candidates is more accurate and expression is of the form: more terms are detected (Daille and Blancafort, property op literal 2013). Figure 2 shows the output of TermSuite TET Where property is an annotation feature de- within the graphical interface. The main win- fined in TermSuite UIMA type system, op is dow shows the terms rank according to termhood. one of ==, !=, <, <=, >, and >=, and literal A term candidate may group miscellaneous term is either a string, a boolean (true or false), or variants. When a term is highlighted, the occur- a number (integer or double). rences spot by UIMA Tokens Regex are showed Rules Rules are named regular expressions that in the bottom window and the term features in the are defined as follows: right window. term "rule name": TokensRegex; 3 Spotting multiword terms Where TokensRegex is a sequence of quantified We design a component in charge of spotting matchers. The quantifiers are: ? multi-word terms and their variants in text, which 0 or 1 is based on UIMA Tokens Regex2, a concise and * 0 or several + expressive language coupled with an efficient rule at least 1 n exactly n engine. UIMA Tokens Regex allows the user to { } m,n between m and n 2http://github.com/JuleStar/ { } uima-tokens-regex/ 3http://antlr.org/

14 3.2 Engine Lexical filtering Matcher A above shows an ex- ample of lexical filtering that prohibits occur- UIMA Tokens Regex engine parses the list of rules rences of the listed lemma in the pattern. For ex- and creates for each of these rules a finite-state ample, Rule an will not match the term candidate automaton. The engine provides automata with same energy. the sequence of UIMA annotations of the prepro- cessed input document. UIMA Tokens Regex en- Contextual filtering Contextual POS are pre- gine implements the default behaviour of a regu- ceded by tilde ( ). Rule acan shows an example ∼ lar expression engine: it is greedy, backtracking, of contextual filtering. A determinant should oc- picking out the first alternative, and impatient. cur for the pattern to be matched, but it will be not Every time an automaton (ie. a rule) matches, part of collected MWT. TermSuite generates a rule occurrence and stores the offset indexes of the matched text. 4 Variant grouping

3.3 Application to terminology extraction TermSuite is able to gather terms according to syntactic and morphological variant patterns that Example In TermSuite type system, the val- are defined with YAML syntax (Ben-Kiki et al., ues of the feature category are the part-of- 2005). speech (POS) tags. Rule an below extracts MWT composed of one or several adjectives followed by 4.1 Syntax a noun. A variant rule states a set of conditions that two term "an": [category=="adjective"]+ term candidates must fulfil to be paired. It consists [category=="noun"] ; of: Matcher predefinition For the sake of both a rule name a string expression between double readability and reusability, UIMA Tokens Regex quotes ("), ended by a colon (:), allows the user to predefine matchers. Thus, Rule an can be expressed concisely as A+ N using the a source pattern and a target pattern, which are matchers N and A: sequences of matcher labels. matcher N: [category=="noun"]; matcher Vpp: [V & mood=="participle" a boolean expression a logical expression on & tense=="past"]; matcher A: [(Vpp | category=="adjective") source and target term features, denoted by & lemma!="same" rule. The field rule is interpreted by a & lemma!="other"]; Groovy engine and must be defined in valid matcher C: /ˆ(and|or)$/; matcher D: [category=="determiner" Groovy syntax. & subCategory != "possessive"]; matcher P: [category=="adposition" Example The example below is the simplest & subCategory=="preposition"]; variant grouping rule defined for English. term "an": A+ N ; "S-I-NN-(N|A)": term "npn": N P D? N ; source: N N term "acan": ˜D A C A N ; target: N N N, N A N rule: s[0]==t[0] && s[1]==t[2] Rule acan extracts coordination variants that This rule is named S-I-NN-(N A). It states match the ”adjective conjunction adjective noun” | pattern, such as onshore and offshore locations. that one term candidate (the source) must be of The quantifier ? expresses an optional determiner. pattern N N, and the second term candidate (the Rule npn can extract both MWT: energy of wind target) of patterns N N N or N A N. The rule and energy of the wind. field states that the lemma property of s[0], the first noun of the source, has the same lemma as Features The annotation features available t[0], the first noun of the target. Likewise s[1] in TermSuite type system are category, and t[2] must share the same lemma. For exam- subCategory, lemma, and stem and in- ple, this variant grouping rule will be satisfied for flectional features such as mood, tense, or the two terms turbine structure and turbine base case. structure.

15 Word features The rule field expresses con- MWT Variants en 43 41 ditions on word features. The two main features fr 35 37 used for grouping are lemma and stem. lemma de 20 30 is the default one, that is why stating s[0] == t[0] es 62 40 ru 18 16 is equivalent to s[0].lemma == t[0].lemma. The rule ”S-PI-NN-P” below makes use of the stem Table 1: Numbers of rules provided in Term- property. An example of grouping is effect of ro- Suite tation and rotational effect where rotational is de- rived from rotation. O(n2) complexity applies to small subsets of term "S-PI-NN-P": candidates, and the weight of variant grouping in source: N P N target: A N, N N the overall terminology extraction process is quite rule: s[0]==t[1] && s[2].stem==t[0].stem reasonable (see Section 7). Morphological variants TermSuite imple- ments Compost, a multilingual splitter (Logi- 5 Language grammars nova Clouet and Daille, 2014) that makes the de- We define MWT spotting rules and variant group- cision as to whether the term composed of one ing rules for the five languages supported by graphic unit, is a SWT or a , and for TermSuite: Fr, En, Es, De, and Ru. Table 1 compounds, it gives one or several candidate anal- shows the number of rules by languages for MWT yses ranked by their scores. We only keep the best spotting and for term variant grouping. split. The compound elements are reachable when TermSuite comes to apply the variant group- 6 Ranking by termhood ing rules. The syntax of YAML variant rules al- lows the user to express morphological variants Term candidates are ranked according to their ter- between two terms: mhood that is measured with weirdness ratio (WR). "M-I-EN-N|A": WR is the quotient of the relative frequency in both source: N [compound] the domain specific corpus and a general lan- target: N N, A N C guage corpus . rule: s[0][0]==t[0][0] && s[0][1] == t[1] G In the rule M-I-EN-N A above, the tag | fnorm(t, ) [compound] after the source pattern states that WR(t, ) = C (1) C f (t, ) the source has to be a morphosyntactic compound. norm G In the rule field, we access the component fea- Where fnorm stands for the normalized fre- tures with the second index of the two-based in- quency of a term in a given corpus, ie. the average dexing arrays, the first index referring to the POS number of its occurrences every 1000 , and position in the source or target patterns. As ex- is a general language corpus. amples, this rule groups the two term candidates G windfarm and windmill farm, and also hydropower 6.1 General language corpus and hydroelectric power. The general language corpora used for computing 4.2 Engine WR are part of the compilation of newspapers pro- Term variant grouping applies on term pairs with vided by CLEF 2004 (Jones et al., 2005). These a complexity of O(n2), where n is the number corpora cover numerous and miscellaneous topics, of term candidates extracted by UIMA Tokens which are useful to obtain a corpus representative Regex. TermSuite copes with this issue by of the general language. The corpora of the gen- pre-indexing each term candidate with all its pairs eral language that we use to compute the frequen- of single-word lemmas. For example, the term cies of term candidates are: of length 3 offshore wind turbine has three in- Newspaper Lang Size Nb words dexing keys: (offshore, wind), (offshore, Der Spiegel De 382M 60M Glasgow Herald En 302M 28M turbine), and (turbine, wind). The group- Agencia EFE Es 1.1G 171M ing engine operates over all terms sharing the same Le Monde Fr 1.1G 82M indexing key, for all indexing keys. Therefore, the Izvestia Ru 66M 5.8M

16 6.2 WR behaviour 7 Performances TermSuite Figure 1 gives WR distribution on the English part operates on English [EOL] in 11 of the domain-specific monolingual comparable seconds with the technical configuration: Ubuntu 14.04, 16Go RAM, Intel(R) corpora for Wind Energy4 [EOL]. [EOL] is avail- Core(TM) i7-4800MQ (4x2,2.7Ghz) able for seven languages and has a minimum size . of 330K words by language. The x-axis of Fig- We detail the execution times of each main ure 1 is set to WR base-10 logarithm, hence a value component with the use of two part-of-speech TreeTagger5 TT Mate6 of 2 means that the term candidate is a 100 times taggers ( ) and : more frequent in the specific corpus than in . TT Mate C G Tokenizer 1.3s idem POS/Lemmatiser 2.4s 81s Stemmer 0.67s idem MWT Spotter 4.8s idem 000 8, Morph. Compound Finder 0.14s idem Syntactic Term Gatherer 0.23s idem Graphical Term Gatherer 0.27s idem ,000 6 Total (without UIMA overheads) 9.8s 88.5s

000 Time complexity is linear. The pro- 4, Scalability cessing of Agencia EFE corpus (cf. Section 6.1), 000 2, the biggest tested so far (171 million words), takes 101 minutes to process. This performance proves 0 a very satisfactory vertical scalability in the con- 0 1 2 3 4 5 text of smaller domain-specific corpora. No kind Logarithmic of Weirdness Ratio - log (wr) of parallelism has been implemented so far, not even Java multi-threading, which is the best oppor- Figure 1: Distribution of WR base-10 logarithm tunity of optimization if an improvement of perfor- over all terms extracted by TermSuite on En- mances is required. glish [EOL]. 8 Release We distinguish two sets of terms on Figure 1. TermSuite is a Java (7+) program. It can be The first one, starting around 0 until log(wr) 2, used in three ways: the Java API, the command ' contains the terms that are not domain specific line API, or the graphical user interface as shown since they occur in both the specialised and the on Figure 2. Its only third-party dependency is general language corpora. The second set, from TreeTagger, which needs to be installed sep- the peak at log(wr) 2 to the upper bound, con- arately and referenced by TermSuite configura- ' tains both the terms that appear much more fre- tion. quently in than in and the terms that never TermSuite is under licence Apache 2.0. The C G occur in . Actually, the first peak at log(wr) 2 source code and all its components and linguis- G ' 7 refers to terms that occur once in and never in , tic resources are released on Github . The lat- C G the second lower peak refers to terms that occur est released versions, currently 2.1, are available 8 twice in and never in , and so on. on Maven Central . All links, documenta- C G tion, resources, and guides about TermSuite are We did not provide the distributions for other available on its official website: [EOL] languages nor for other corpora, because http://termsuite.github.io/ their WR distributions are similar. For all config- urations, the first peak always appears at WR 2 ' Acknowledgements and the upper bound at WR 5. As a result of ' the analysis of WR distribution, we set 2 as default TermSuite development is supported by IS- value of log(wr) threshold for accepting candi- TEX, French Excellence Initiative of Scientific dates as terms. 5http://www.cis.uni-muenchen.de/ ˜schmid/tools/TreeTagger/ 6https://code.google.com/p/mate-tools/ 4http://www.lina.univ-nantes.fr/taln/ 7https://github.com/termsuite/ maven/wind-energy.tgz 8Maven group id is fr.univ-nantes.termsuite

17 Figure 2: TermSuite graphical user interface and Technical Information. Gareth J. F. Jones, Michael Burke, John Judge, Anna Khasin, Adenike Lam-Adesina, and Joachim Wag- ner, 2005. Multilingual Information Access for Text, References Speech and Images: 5th Workshop of the Cross- Language Evaluation Forum, CLEF 2004, Bath, Oren Ben-Kiki, Clark Evans, and Brian Ingerson. UK, September 15-17, 2004, Revised Selected Pa- TM 2005. Yaml ain’t markup language (yaml ) ver- pers, chapter Dublin City University at CLEF 2004: sion 1.1. yaml. org, Tech. Rep. Experiments in Monolingual, Bilingual and Multi- lingual Retrieval, pages 207–220. Springer Berlin M. Teresa Cabre,´ Rosa Estopa` Bagot, and Jordi Vi- Heidelberg, Berlin, Heidelberg. valdi Platresi. 2001. Automatic term detection: A review of current systems. In D. Bourigault, Kyo Kageura and Bin Umino. 1996. Methods of au- C. Jacquemin, and M.-C. L’Homme, editors, Recent tomatic term recognition: a review. Terminology, Advances in Computational Terminology, volume 2 3(2):259–289. of Natural Language Processing, pages 53–88. John Benjamins. Peter Kluegl, Martin Toepfer, Philip-Daniel Beck, Georg Fette, and Frank Puppe. 2016. UIMA ruta: Angel X. Chang and Christopher D. Manning. 2014. Rapid development of rule-based information ex- TokensRegex: Defining cascaded regular expres- traction applications. Natural Language Engineer- sions over tokens. Technical Report CSTR 2014-02, ing, 22(1):1–40. Department of Computer Science, Stanford Univer- sity. Elizaveta Loginova Clouet and Beatrice´ Daille. 2014. Beatrice´ Daille and Helena Blancafort. 2013. Splitting of Compound Terms in non-Prototypical Knowledge-poor and knowledge-rich approaches Compounding Languages. In Workshop on Compu- for multilingual terminology extraction. In Proceed- tational Approaches to Compound Analysis, COL- ings, 13th International Conference on Intelligent ING 2014, pages 11 – 19, Dublin, Ireland, August. and Computational Linguistics (CI- Maria Teresa Pazienza, Marco Pennacchiotti, and CLing), page 14p, Samos, Greece. Fabio Massimo Zanzotto. 2005. Terminology ex- Beatrice´ Daille. 2001. Qualitative terminology extrac- traction: An analysis of linguistic and statistical ap- tion. In D. Bourigault, C. Jacquemin, and M.-C. proaches. In S. Sirmakessis, editor, Proceedings L’Homme, editors, Recent Advances in Computa- of the NEMIS 2004 Final Conference, volume 185 tional Terminology, volume 2 of Natural Language of Studies in Fuzziness and Soft Computing, pages Processing, pages 149–166. John Benjamins. 225–279. Springer Berlin Heidelberg. David Ferrucci and Adam Lally. 2004. UIMA: an J. Rocheteau and B. Daille. 2011. TTC TermSuite - architectural approach to unstructured information A UIMA Application for Multilingual Terminology processing in the corporate research environment. Extraction from Comparable Corpora. In Proceed- Natural Language Engineering, 10:327–348. ings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011, Thai- Christian Jacquemin. 2001. Spotting and Discovering land, November. Asian Federation of ACL. Terms through Natural Language Processing. Cam- bridge: MIT Press.

18