Tbxtools: a Free, Fast and Flexible Tool for Automatic Terminology Extraction
Total Page:16
File Type:pdf, Size:1020Kb
TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction Antoni Oliver Merce` Vazquez` Universitat Oberta de Catalunya Universitat Oberta de Catalunya [email protected] [email protected] Abstract assisted translation, thesaurus construction, classi- fication, indexing, information retrieval, and also The manual identification of terminology text mining and text summarisation (Heid and Mc- from specialized corpora is a complex task Naught, 1991; Frantzi and Ananiadou, 1996; Vu et that needs to be addressed by flexible al., 2008). tools, in order to facilitate the construction The automatic terminology extraction tools de- of multilingual terminologies which are veloped in recent years allow easier manual term the main resources for computer-assisted extraction from a specialized corpus, which is a translation tools, machine translation or long, tedious and repetitive task that has the risk ontologies. The automatic terminology of being unsystematic and subjective, very costly extraction tools developed so far either use in economic terms and limited by the current a proprietary code or an open source code, available information. However, existing tools that is limited to certain software func- should be improved in order to get more consistent tionalities. To automatically extract terms terminology and greater productivity (Gornostay, from specialized corpora for different pur- 2010). poses such as constructing dictionaries, In the last few years, several term extraction thesauruses or translation memories, we tools have been developed, but most of them are need open source tools to easily integrate language-dependent: French and English –Fastr new functionalities to improve term selec- (Jacquemin, 1999) and Acabit (Daille, 2003); tion. This paper presents TBXTools, a Portuguese –Extracterm (Costa et al., 2004) and free automatic terminology extraction tool ExATOlp (Lopes et al., 2009); Spanish-Basque that implements linguistic and statistical –Elexbi (Hernaiz et al., 2006); Spanish-German methods for multiword term extraction. –Autoterm (Haller, 2008); Arabic (Boulaknadel The tool allows the users to easily iden- et al., 2008); Slovene and English –Luiz (Vin- tify multiword terms from specialized cor- tar, 2010); English and Italian –KX (Pianta and pora and also, if needed, translation candi- Tonelli, 2010); or English and German (Gojun et dates from parallel corpora. In this paper al., 2012). we present the main features of TBXTools Some tools are adapted to a specialized domain: along with evaluation results for term ex- TermExtractor (Sclano and Velardi, 2007), Ter- traction, both using statistical and linguis- Mine (Ananiadou et al., 2009) or BioYaTeA (Go- tic methodology, for several corpora. lik et al., 2013), for example. Specific tools have been developed to extract corpus-specific lexical 1 Introduction items comparing technical and non-technical cor- Automatic terminology extraction (ATE) is a rel- pus: TermoStat (Drouin, 2003). And other tools evant natural language processing task involv- are based on under-resourced language –TWSC ing terminology which has been used to iden- (Pinnis et al., 2012)–, or use semantic and con- tify domain-relevant terms applying computa- textual information –Yate (Vivaldi and Rodr´ıguez, tional methods (Oliver et al., 2007a; Foo, 2012). 2001). Automatic term extraction is a relevant task that Furthermore, there was TermSuite, which was can be useful for a wide range of tasks, such as developed during the European project TTC (Ter- ontology learning, machine translation, computer- minology Extraction, Translation Tools and Com- 473 Proceedings of Recent Advances in Natural Language Processing, pages 473–479, Hissar, Bulgaria, Sep 7–9 2015. parable Corpora). This project focused on Nowadays TBXTools does not have a user in- the automatic or semi-automatic acquisition of terface, but it will be developed in the future. At aligned bilingual terminologies for computer- present the extraction is done by means of sim- assisted translation and machine translation. To ple Python scripts calling the TBXTools class. In this end, automatic terminology extraction is part this paper we will see the code of some of these of the process of identifying terminologies from scripts. Several examples of scripts can be found comparable corpora (Blancafort et al., 2010). in the TBXTools distribution. This paper presents TBXTools, a free automatic term extraction tool which allows multiword terms 2.2 Statistical Terminology Extraction from specialized corpora to be identified easily, The statistical strategy for terminology extraction combining statistical and linguistic methods. is based on the calculation of n-grams, that is, the This paper is structured as follows: in the next combination of n words appearing in the corpus. section we present the TBXTools implementation After this calculation, filtering with stop words is and statistical and linguistic methods, as well as performed, eliminating all the candidates begin- the automatic finding of translation equivalents. ning or ending with a word from a list. Some nor- The experimental settings are described in detail malizations, such as case normalization, nesting in section 3. The paper concludes with some final detection and morphological normalization, can remarks and ideas for future work. be performed. Here we can see a complete code for terminology extraction: 2 TBXTools from TBXTools import * 2.1 Description e=TBXTools() e.load_sl_corpus("corpus.txt") TBXTools is a Python class that implements a set e.load_stop_l1("stop-eng.txt") of methods for ATE along with other utilities re- e.set_nmin(2) e.set_nmax(3) lated to terminology management. This tool has a e.statistical_term_extraction() free software licence and can be downloaded from e.case_normalization() SourceForge1. TBXTools is an evolution of pre- e.nesting_detection() e.load_morphopatterns("morpho-eng.txt") vious tools developed by the authors (Oliver and e.morpho_normalization() Vazquez,` 2007; Oliver et al., 2007b).The tool is e.save_term_candidates("candidates.txt") still under development but it already implements The code, as can be seen, is very simple. First a set of methods that permit the following func- of all, we import TBXTools and create a TBX- tionalities: Tools object, called e in the example. This code calculates the term candidates from the corpus in Statistical term extraction using n-grams and • the corpus.txt file using the stop words in the stop words and allowing some normalizations: stop.txt file. Afterwards, we fix the minimum n capital letter normalization, morphological nor- to 2 and the maximum to 3, in order to calculate malization and nested candidate detection. bigrams and trigrams term candidates. The next Linguistic term extraction using morpho- step in the code performs the statistical term ex- • syntactic pattern and a tagged corpus. Any ex- traction. After that, the following normalizations ternal tagger and a connection with a server run- are implemented: ning Freeling (Padro´ and Stanilovsky, 2012) are Case normalization: it tries to collapse the same implemented. The tool uses an easy formal- • ism for the expression of patterns, allowing the term appearing with a different case: for ex- use of regular expressions and lemmatization of ample, “interest rate”, “Interest Rate” and “IN- some of the components, if required. TEREST RATE” into “interest rate”. Detection of translation candidates in parallel Nesting detection: sometimes shorter term can- • • corpora, using a statistical strategy. didates are not terms in and of themselves, but are part of a longer term. For example, the bi- Automatic learning of morphological patterns • gram term candidate “national central” is a part from a list of reference terms. of the trigram term candidate “national central 1http://sourceforge.net/projects/tbxtools/ bank”. 474 Morphological normalization: it tries to col- from TBXTools import * • lapse several forms at the same time into a sin- import codecs e=TBXTools() gle form, for example, to collapse the plural e.load_tabtxt_corpus("corpus.txt") term candidate “economic policies” into “eco- e.load_stop_l2("stop.txt") nomic policy”. To perform this normalization, ... tr=e.get_statistical_translation_ a simple set of morphological patterns is used. candidate(t, candidates=5) After all these normalizations, the term candi- print(t,tr) dates are saved into the text file candidates.txt. ... The candidates are stored in descending fre- quency order and the value of frequency is also With this code we load a parallel corpus and a stored, as in the following example: list of stop words for the target language. Then we calculate the translation equivalent (tr) from 53 euro banknotes 51 central bank the term (t) and ask to return 5 candidates. The 47 payment institution output would as follows: 23 payment instrument payment institution entidad de pago: 2.3 Linguistic Terminology Extraction servicios de pago:dinero electronico:´ entidad de credito:Estado´ miembro: To perform linguistic terminology extraction we need a POS-tagged corpus. The tagging can be performed with any tagger offering In this example we want to find the translation lemma and POS tags. TBXTools can be eas- of “payment institution” and we get 5 candi- ily used with Freeling. In the following exam- dates in Spanish. In this case the first one is the ple we will perform linguistic extraction from correct one (“entidad de pago”). a tagged corpus (ct.txt) using a set of patterns (p) and storing the