Evaluating Automatic Term Extraction Methods on Individual Documents

Evaluating Automatic Term Extraction Methods on Individual Documents Antonio Sajatoviˇ c´ Maja Buljan Jan Snajderˇ Bojana Dalbelo Basiˇ c´ University of Zagreb, Faculty of Electrical Engineering and Computing, Text Analysis and Knowledge Engineering Lab, Zagreb, Croatia fantonio.sajatovic,maja.buljan,jan.snajder,[email protected] Abstract A task related to ATE is Automatic Keyword and Keyphrase Extraction (AKE), which deals Automatic Term Extraction (ATE) extracts ter- with the extraction of single words and MWEs minology from domain-specific corpora. ATE from a single document. Unlike ATE, which is used in many NLP tasks, including Com- aims to capture domain-specific terminology, key- puter Assisted Translation, where it is typi- words and keyphrases extracted by AKE should cally applied to individual documents rather than the entire corpus. While corpus-level capture the main topics of a document. Conse- ATE has been extensively evaluated, it is not quently, there will only be a handful of represen- obvious how the results transfer to document- tative keyphrases for a document (Turney, 2000). level ATE. To fill this gap, we evaluate 16 In spite of these differences, several AKE methods state-of-the-art ATE methods on full-length were adapted for ATE (Zhang et al., 2016). documents from three different domains, on While corpus-level ATE methods, as well as both corpus and document levels. Unlike existing studies, our evaluation is more realistic AKE methods, have been extensively evaluated as we take into account all gold terms. We in the literature, it is not obvious how the results show that no single method is best in corpus- transfer to document-level ATE, which is how level ATE, but C-Value and KeyConceptRela- ATE is typically used for CAT. In this paper, we tendess surpass others in document-level ATE. aim to close this gap and present an evaluation study that considers both corpus- and document- 1 Introduction level ATE. We evaluate 16 state-of-the-art ATE methods, including modified AKE methods. Fur- The aim of Automatic Term Extraction (or Recog- thermore, addressing another deficiency in exist- nition) (ATE) is to extract terms – single words ing evaluations, we evaluate the methods using a or multiword expressions (MWEs) represent- complete set of gold terms, making the evaluation ing domain-specific concepts – from a domain- more realistic. specific corpus. ATE is widely used in many NLP tasks, such as information retrieval and ma- chine translation. Moreover, Computer Assisted 2 Related Work Translation (CAT) tools often use ATE methods Most ATE methods begin with the extraction and to aid translators in finding and extracting transla- filtering of candidate terms, followed by candi- tion equivalent terms in the target language (Costa date term scoring and ranking. Because of di- et al., 2016; Oliver, 2017). vergent candidate extraction and filtering step im- While corpus-based approaches to terminology plementations, many existing ATE evaluations are extraction are the norm when building large-scale not directly comparable. Zhang et al.(2008) were termbases (Warburton, 2014), a survey we con- 1 among the first to compare several scoring and ducted showed that translators are most often in- ranking methods, using the same candidate extrac- terested in ATE from individual documents of var- tion and filtering step and the UAP metric on a ious lengths, rather than entire corpora, since they custom Wikipedia corpus and GENIA (Kim et al., typically translate on document at a time. 2003) corpus. In a followup work, they developed 1Survey results available at http://bit.ly/ JATE 2.0 (Zhang et al., 2016), with 10 ATE meth- 2LwrTkv. ods available out-of-the-box, that were evaluated 149 Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), pages 149–154 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics on GENIA and ACL RD-TEC (Zadeh and Hand- such as DomainCoherence (Buitelaar et al., 2013) schuh, 2014) using the “precision at K” metric. and NC-Value (Frantzi et al., 2000). A similar toolkit, ATR4S (Astrakhantsev, 2018), which implements 15 ATE methods, was evalu- Reference corpora. Several methods compare ated on even more datasets using “average preci- the domain corpus and reference corpus term fre- sion at K”. All abovementioned studies were car- quencies, assuming that the difference between ried out corpus-level, and rely on exact match- them can be used to distinguish terms from non- ing between extracted terms and a subset of gold terms. Domain pertinence (DomPertinence) (Mei- terms. The latter makes such evaluations unreal- jer et al., 2014) is the simplest one, while Rele- istic because it disregards the contribution of the vance (Penas˜ et al., 2001) and Weirdness (Ahmad candidate extraction and filtering step. The subset et al., 1999) can be considered its modifications. is selected by considering only the gold terms that appear in the output above the cut off of at level K, which is used to discriminate between real terms Topic modeling. Topic information can also be and non-terms. A general consensus is that there used instead of term frequency information, as in is no single best method (Zhang et al., 2008; As- NovelTM (Li et al., 2013). trakhantsev, 2018; Zhang et al., 2018). To the best of our knowledge, we are the first Wikipedia. Several methods use Wikipedia in- to carry out a document-level ATE evaluation, and stead of term frequency to distinguish between take into account all gold terms instead of only a candidate and actual terms, such as LinkProba- subset. To this end, we use a single ATE toolkit, bility (Astrakhantsev, 2014) and KeyConceptRe- to allow for a direct comparison among different latedness (Astrakhantsev, 2014). In addition to term-ranking methods, by using the same prepro- Wikipedia, KeyConceptRelatedness also relies on cessing and filters. Our toolkit of choice is ATR4S, keyphrase extraction and semantic relatedness. because it has the most diverse set of methods, many of which are state-of-the-art. Re-ranking. Methods from this group use other 3 Term Extraction Methods ATE methods as features, and attempt to learn the importance of each feature in an unsuper- ATE methods may be roughly grouped by the type vised or supervised setting. Glossary Extrac- of information used for scoring the term candi- tion (Park et al., 2002) extends Weirdness, while dates (Astrakhantsev, 2018). Due to the sheer Term Extraction (Sclano and Velardi, 2007) fur- number of ATE methods, we only describe the ther extends Glossary Extraction. SemRe-Rank main principle behind each group and list the main (Zhang et al., 2018) is a generic approach that in- methods. In the evaluation, we consider a total of corporates semantic relatedness to re-rank terms. 16 methods from ATR4S, covering all groups. Both da Silva Conrado et al.(2013) and Yuan Frequency. Most methods rests on the assump- et al.(2017) use a variety of features in a su- tion that a higher term candidate frequency im- pervised binary term classifier. A weakly super- plies a higher likelihood that a candidate is an vised bootstrapping approach called fault tolerant actual term. Among these are AverageTermFre- learning (Yang et al., 2010) has been extended for quency (Zhang et al., 2016), ResidualIDF (Zhang deep learning (Wang et al., 2016). The following et al., 2016) (adapted from AKE), TotalTF-IDF methods are the only ones from this group avail- (Evans and Lefferts, 1995), C-Value (Frantzi et al., able in ATR4S and therefore the only ones evalu- 2000), Basic (Buitelaar et al., 2013), ComboBasic ated: PostRankDC (Buitelaar et al., 2013) com- (Astrakhantsev et al., 2015). Two notable ATE- bines DomainCoherence with Basic, while both adapted AKE methods, not provided in ATR4S, PU-ATR (supervised) (Astrakhantsev, 2014) and are Chi-Square (Matsuo and Ishizuka, 2004) and Voting (unsupervised) (Zhang et al., 2008) use Rapid Keyword Extraction (Rose et al., 2010). the same five features as implemented in ATR4S. In our study, we distinguish between the original Context. A handful of methods adopt the dis- Voting5 and its variant, Voting3, in which the two tributional hypothesis (Harris, 1954) and consider Wikipedia-based features are removed to gauge the context in which the term candidate appears, their impact. 150 Dataset # Docs # Terms % MWEs Avg terms/doc able True Positives (RTP), which are the intersec- Patents 16 1585 86 151 tion of the extracted candidate terms after filter- TTCm 37 160 55 51 ing and the gold set terms. To separate real terms TTCw 102 190 72 33 from non-terms based on their scores, a cutoff at j j Table 1: Full-length document datasets statistics rank K has to be set. Setting K equal to RTP is the default choice in the majority of previous work (Zhang et al., 2016; Astrakhantsev, 2018; Zhang 4 Evaluation et al., 2018), but any such metric can easily be- come too optimistic because jRTPj ≤ jATPj, i.e., Datasets. There exists a number of ATE datasets evaluation becomes oblivious to the candidate ex- compiled using various criteria, comprised of ab- traction and filtering step. stracts or full-length documents. As our focus To obtain a more realistic score, we calculate is document-level ATE, our criteria were that the ATP for both the corpus- and document-level ATE. dataset has to consist of full-length documents and In the former, ATP is equal to the entire gold set, be manually annotated. This ruled out the two while in the latter we build the gold set of each most popular datasets used in most of previous document by checking if the lemma of any term works, GENIA and ACL RD-TEC, as the former from the gold set is a substring of the entire lem- consists of abstracts only and the latter is not man- matized document.

Evaluating Automatic Term Extraction Methods on Individual Documents

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support