Identifying English and Hungarian Light Verb Constructions: a Contrastive Approach
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by SZTE Publicatio Repozitórium - SZTE - Repository of Publications Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach Veronika Vincze1,2, Istvan´ Nagy T.2 and Richard´ Farkas2 1Hungarian Academy of Sciences, Research Group on Artificial Intelligence [email protected] 2Department of Informatics, University of Szeged nistvan,rfarkas @inf.u-szeged.hu { } Abstract 2 Light Verb Constructions Here, we introduce a machine learning- based approach that allows us to identify Light verb constructions (e.g. to give advice) are light verb constructions (LVCs) in Hun- a subtype of multiword expressions (Sag et al., garian and English free texts. We also 2002). They consist of a nominal and a verbal present the results of our experiments on component where the verb functions as the syn- the SzegedParalellFX English–Hungarian tactic head, but the semantic head is the noun. The parallel corpus where LVCs were manu- verbal component (also called a light verb) usu- ally annotated in both languages. With ally loses its original sense to some extent. Al- our approach, we were able to contrast though it is the noun that conveys most of the the performance of our method and define meaning of the construction, the verb itself can- language-specific features for these typo- not be viewed as semantically bleached (Apres- logically different languages. Our pre- jan, 2004; Alonso Ramos, 2004; Sanroman´ Vi- sented method proved to be sufficiently ro- las, 2009) since it also adds important aspects to bust as it achieved approximately the same the meaning of the construction (for instance, the scores on the two typologically different beginning of an action, such as set on fire, see languages. Mel’cukˇ (2004)). The meaning of LVCs can be only partially computed on the basis of the mean- 1 Introduction ings of their parts and the way they are related to In natural language processing (NLP), a signifi- each other, hence it is important to treat them in a cant part of research is carried out on the English special way in many NLP applications. language. However, the investigation of languages LVCs are usually distinguished from productive that are typologically different from English is or literal verb + noun constructions on the one also essential since it can lead to innovations that hand and idiomatic verb + noun expressions on might be usefully integrated into systems devel- the other (Fazly and Stevenson, 2007). Variativ- oped for English. Comparative approaches may ity and omitting the verb play the most significant also highlight some important differences among role in distinguishing LVCs from productive con- languages and the usefulness of techniques that are structions and idioms (Vincze, 2011). Variativity applied. reflects the fact that LVCs can be often substituted In this paper, we focus on the task of identify- by a verb derived from the same root as the nomi- ing light verb constructions (LVCs) in English and nal component within the construction: productive Hungarian free texts. Thus, the same task will be constructions and idioms can be rarely substituted carried out for English and a morphologically rich by a single verb (like make a decision – decide). language. We compare whether the same set of Omitting the verb exploits the fact that it is the features can be used for both languages, we in- nominal component that mostly bears the seman- vestigate the benefits of integrating language spe- tic content of the LVC, hence the event denoted cific features into the systems and we explore how by the construction can be determined even with- the systems could be further improved. For this out the verb in most cases. Furthermore, the very purpose, we make use of the English–Hungarian same noun + verb combination may function as an parallel corpus SzegedParalellFX (Vincze, 2012), LVC in certain contexts while it is just a productive where LVCs have been manually annotated. construction in other ones, compare He gave her a 255 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 255–261, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics ring made of gold (non-LVC) and He gave her a and Kuhn (2009) argued that multiword expres- ring because he wanted to hear her voice (LVC), sions can be reliably detected in parallel corpora hence it is important to identify them in context. by using dependency-parsed, word-aligned sen- In theoretical linguistics, Kearns (2002) distin- tences. Sinha (2009) detected Hindi complex guishes between two subtypes of light verb con- predicates (i.e. a combination of a light verb and structions. True light verb constructions such as a noun, a verb or an adjective) in a Hindi–English to give a wipe or to have a laugh and vague ac- parallel corpus by identifying a mismatch of the tion verbs such as to make an agreement or to Hindi light verb meaning in the aligned English do the ironing differ in some syntactic and se- sentence. Many-to-one correspondences were also mantic features and can be separated by various exploited by Attia et al. (2010) when identifying tests, e.g. passivization, WH-movement, pronom- Arabic multiword expressions relying on asym- inalization etc. This distinction also manifests in metries between paralell entry titles of Wikipedia. natural language processing as several authors pay Tsvetkov and Wintner (2010) identified Hebrew attention to the identification of just true light verb multiword expressions by searching for misalign- constructions, e.g. Tu and Roth (2011). However, ments in an English–Hebrew parallel corpus. here we do not make such a distinction and aim to To the best of our knowledge, parallel corpora identify all types of light verb constructions both have not been used for testing the efficiency of an in English and in Hungarian, in accordance with MWE-detecting method for two languages at the the annotation principles of SZPFX. same time. Here, we investigate the performance The canonical form of a Hungarian light verb of our base LVC-detector on English and Hungar- construction is a bare noun + third person singular ian and pay special attention to the added value of verb. However, they may occur in non-canonical language-specific features. versions as well: the verb may precede the noun, or the noun and the verb may be not adjacent due 4 Experiments to the free word order. Moreover, as Hungarian is a morphologically rich language, the verb may In our investigations we made use of the Szeged- occur in different surface forms inflected for tense, ParalellFX English-Hungarian parallel corpus, mood, person and number. These features will be which consists of 14,000 sentences and contains paid attention to when implementing our system about 1370 LVCs for each language. In addition, for detecting Hungarian LVCs. we are aware of two other corpora – the Szeged Treebank (Vincze and Csirik, 2010) and Wiki50 3 Related Work (Vincze et al., 2011b) –, which were manually an- notated for LVCs on the basis of similar principles Recently, LVCs have received special interest in as SZPFX, so we exploited these corpora when the NLP research community. They have been au- defining our features. tomatically identified in several languages such as To automatically identify LVCs in running English (Cook et al., 2007; Bannard, 2007; Vincze texts, a machine learning based approach was ap- et al., 2011a; Tu and Roth, 2011), Dutch (Van de plied. This method first parsed each sentence and Cruys and Moiron,´ 2007), Basque (Gurrutxaga extracted potential LVCs. Afterwards, a binary and Alegria, 2011) and German (Evert and Ker- classification method was utilized, which can au- mes, 2003). tomatically classify potential LVCs as an LVC or Parallel corpora are of high importance in the not. This binary classifier was based on a rich fea- automatic identification of multiword expressions: ture set described below. it is usually one-to-many correspondence that is The candidate extraction method investi- exploited when designing methods for detecting gated the dependency relation among the verbs multiword expressions. Caseli et al. (2010) de- and nouns. Verb-object, verb-subject, verb- veloped an alignment-based method for extracting prepositional object, verb-other argument (in the multiword expressions from Portuguese–English case of Hungarian) and noun-modifier pairs were parallel corpora. Samardziˇ c´ and Merlo (2010) an- collected from the texts. The dependency labels alyzed English and German light verb construc- were provided by the Bohnet parser (Bohnet, tions in parallel corpora: they pay special attention 2010) for English and by magyarlanc 2.0 to their manual and automatic alignment. Zarrieß (Zsibrita et al., 2013) for Hungarian. 256 The features used by the binary classifier can be Csirik, 2010) in the case of Hungarian. Then, we categorised as follows: investigated whether the lemmatised verbal com- Morphological features: As the nominal com- ponent of the candidate was one of these fifteen ponent of LVCs is typically derived from a verbal verbs. The lemma of the noun was also applied stem (make a decision) or coincides with a verb as a lexical feature. The nouns found in LVCs (have a walk), the VerbalStem binary feature fo- were collected from the above-mentioned corpora. cuses on the stem of the noun; if it had a verbal Afterwards, we constructed lists of lemmatised nature, the candidates were marked as true. The LVCs got from the other corpora. POS-pattern feature investigates the POS-tag se- quence of the potential LVC. If it matched one pat- Syntactic features: As the candidate extraction tern typical of LVCs (e.g. verb + noun) the methods basically depended on the dependency candidate was marked as true; otherwise as false.