Supporting Terminology Extraction with Dependency Parses

Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), pages 72–79 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Supporting terminology extraction with dependency parses Małgorzata Marciniak, Piotr Rychlik, Agnieszka Mykowiecka Institute of Computer Science, Polish Academy of Sciences Jana Kazimierza 5, 01-248 Warsaw, Poland {mm, rychlik, agn}@ipipan.waw.pl Abstract Terminology extraction procedure usually consists of selecting candidates for terms and ordering them according to their importance for the given text or set of texts. Depending on the method used, a list of candidates contains different fractions of grammatically incorrect, semantically odd and irrelevant sequences. The aim of this work was to improve term candidate selection by reducing the number of incorrect sequences using a dependency parser for Polish. Keywords: terminology extraction, candidates filtering, dependency parsing, prepositional phrases 1. Introduction 2. Related Work Extracting important domain related phrases is a part of Terminology extraction (sometimes under the name of key- very many NLP tasks such as information extraction, in- word/keyphrase extraction) is quite a popular NLP task dexing or text classification. Depending on a particular which is tackled by several tools available both as open ac- scenario either more precise or more robust solutions are cess and commercial systems. An overview of biomedical preferable. In our terminology extraction work, the aim terminology extraction is presented in (Lossio-Ventura et is to prepare preliminary lists for building terminology re- al., 2016), several keyphrase extraction systems described sources or text indexing. As manual checking of the pre- in the scientific literature were later presented in (Merrouni pared list is expensive, we are interested in a solution in et al., 2019). The latter paper mainly describes solutions which the top of the ordered candidates list is of the highest which were proposed within the area of text mining or ar- quality. One of the problems of all term extraction meth- tificial intelligence, while quite a lot of other approaches ods is the fact that some extracted sequences are incorrect. were proposed at more natural language processing and The sequences recognized using statistical methods or shal- terminology extraction oriented venues, e.g., TermSuite low grammars can sometimes be semantically odd or even (Cram and Daille, 2016) and Sketch Engine (Kilgarriff et incorrect at the syntactic level. We identify two types of er- al., 2014). Competitions in automatic term extractions have rors. In the first, the shallow patterns cover only part of the been also organised, e.g., at SemEval workshop (Kim et al., phrase, e.g., resolution microscopy. In the second, parts of 2010) or (Augenstein et al., 2017). two independent phrases are merged into a sequence which Terminology extraction systems can be divided into two does not form a coherent phrase, e.g., high resolution mi- groups. In one group, term extraction is treated as any croscopy designed. The aim of this work was to improve other extraction task and is usually solved as a classification term candidate selection by reducing the number of incor- task using statistical, e.g., CRF, (Zhang, 2008), (Yu et al., rect sequences using a dependency parser for Polish. The 2012), or deep learning methods, e.g., (Zhang et al., 2016), answer to the question whether using a deep parser im- (Meng et al., 2017). The other approach, also accepted by proves term identification would have been evident if the the extraction tool we use (TermoPL), comes from colloca- parsing were perfect. In such a case, at least all syntacti- tion/phrase recognition work. Most of the term extraction cally incorrect phrases (the errors of the second type men- systems which were developed along these lines follow the tioned above) would have been eliminated. However, er- standard three phase procedure consisting of text prepro- rors of the first type are rather hard to identify on syntactic cessing, potential term selection and term scoring. Text grounds. preprocessing depends on the source of texts and the lan- Dependency analysis classifies all modifiers as adjuncts, guage in which they are written and usually consists of fil- some of them are necessary term parts and indicate a par- tering out unnecessary information, tokenization and some- ticular subtype, e.g., basic income, while others are just times POS tagging. As a lot of work was done for English, modifications which specify frequency, intensity or quality most approaches for candidate selections are based on se- features and do not constitute a part of a term, e.g., bigger lecting just word n-grams on the basis of the simple fre- income. That is why we propose a hybrid approach, not just quency based statistics, e.g., (Rose et al., 2010) or on the dependency parsing. shallow grammars usually written as a set of regular expres- In this paper, we will not discuss the computational aspects sions over POS tags, e.g., (Cram and Daille, 2016). Deep of dependency parsing. Although it can significantly slow syntactic grammars are hardly used at all. One solution in down the extraction process, it might still be useful in cases which dependency grammar is used to extract term candi- where the potential user wants to improve the quality of dates is described in Gamallo (2017). Dependency parses the output. Besides, not all sentences of the processed text were also analyzed in Liu et al. (2018). All the above ap- need to be analyzed by a dependency parser, but only those proaches to candidate selection are approximate (for differ- containing examined terms. ent reasons), i.e. some term candidates are improper while 72 others are omitted. In our work, we used shallow gram- 3. Tools Description mars with additional specification of morphological values 3.1. TermoPL dependencies. As Polish is an inflectional language, this As the baseline method of term selection for our experi- approach allows a lot of grammatically incorrect phrases ments we chose one implemented in the publicly available to be filtered out while, at the same time, it is not limited tool – TermoPL (Marciniak et al., 2016). The tool oper- to the sequences recognized properly by a deep parser for ates on the text tagged with POS and morphological fea- Polish, which for a specific domain might not have enough tures values and uses shallow grammar to select the term coverage. candidates. Grammar rules operate on forms, lemmas and The second step of the process – candidate ranking – is also morphological tags of the tokens. They thus allow for im- carried out in very different ways. The frequency of a term posing agreement requirements important for recognizing or frequency based coefficients play the most prominent phrase borders in inflectional languages, such as Polish. role. The most popular is tf-idf, but the C-value (Frantzi et TermoPL has a built-in grammar describing basic Polish al., 2000), used in this paper, also became widely used. Un- noun phrases and also allows for defining custom gram- like many other coefficients, the C-value takes into account mars for other types of phrases. The program was origi- not only the longest phrases or sequences of a given length, nally developed for the Polish language so it is capable of but also sequences included in other, longer, sequences. handling the relatively complex structural tagset of Polish Although in some approaches the ranking procedure may (Przepiórkowski et al., 2012). It is also possible to redefine be very complex, the idea of an additional phase of filter- this tagset and process texts in other languages. To elim- ing improperly built pre-selected phrases, as suggested in inate sub-sequences with borders crossing strong colloca- our paper, is not very popular. There are, however, some tions, the NPMI (Bouma, 2009) based method of identify- solutions with a post filtering phrase, e.g. (Liu et al., 2015), ing the proper sub-sequences was proposed (Marciniak and in which the candidates are compared to different external Mykowiecka, 2015). According to this method, subphrase terminology resources. This approach was not adopted in borders are subsequently identified between the tokens with our work, as it cannot be used to identify new terms and it the smallest NPMI coefficient (counted for bigrams on the requires resources adequate for a specific domain. Another basis of the whole corpus). So, if a bigram constitutes a postulated modification of the overall processing schema is strong collocation, the phrase is not being divided in this the final re-ranking procedure adopted in (Gamallo, 2017). place, and this usually blocks creation of semantically odd nested phrases. As in many other NLP tasks, evaluation of the terminology The final list of terms is ordered according to the C-value extraction results is both crucial and hard to perform. Eval- adapted for taking one word terms into account. The C- uation can either be performed manually or automatically. value is a frequency dependent coefficient but takes into In the first case, apart from the cost of the evaluation, the account not only the occurrences of the longest phrase, but main problem is that sometimes it is hard to judge whether also counts occurrences of its sub-sequences. a particular term is domain related or comes from general language. Automatic evaluation requires terminological re- 3.2. COMBO sources (which, even if they exist, are usually not com- In our experiments we use a publicly available Polish plete), or preparing the gold standard labelled text (which dependency parser – COMBO (Rybak and Wróblewska, has similar problems to direct manual evaluation). In sta- 2018). COMBO is a neural net based jointly trained tag- tistical methods, the automatic evaluation procedure is usu- ger, lemmatizer and dependency parser.

Load more