Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs

ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light Verbs Petra Barancˇ´ıkova´ and Vaclava´ Kettnerova´ Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Malostranske´ nam´ estˇ ´ı 25, 118 00, Praha, Czech Republic, [email protected], [email protected] Abstract kiss’ vs. va´snivˇ e/nˇ eˇznˇ e/letmo/*manˇ zelsky/*mˇ a-´ jove/*smrtelnˇ eˇ pol´ıbit ‘kiss passionately/tender- We present a new freely available dic- ly/fleetingly/*marriagely/*Mayly/?fatally’. Easier tionary of paraphrases of Czech complex modification of CPs is usually considered as the predicates with light verbs, ParaDi. Can- main motivation for their widespread use (Brinton didates for single predicative paraphrases and Akimoto, 1999). of selected complex predicates have been In this paper, we present ParaDi, a dictionary extracted automatically from large mono- of single predicative verb paraphrases of Czech lingual data using word2vec. They have CPs. We restricted the dictionary only to CPs been manually verified and further refined. that consist of light verbs and predicative nouns, We demonstrate one of many possible ap- which represent the most frequent and central type plications of ParaDi in an experiment with of CPs in the Czech language. improving machine translation quality. ParaDi was built on a semi-automatic basis. First, candidates for single verb paraphrases of 1 Introduction selected CPs have been automatically identified Multiword expressions (MWEs) pose a serious in large monolingual data using word2vec, a shal- challenge for both foreign speakers and many NLP low neural network. The list of these candidates tasks (Sag et al., 2002). From various multiword has been then manually checked and further re- expressions, those that involve verbs are of great fined. In many cases, if CPs are to be correctly significance as verbs represent the syntactic center paraphrased by the identified single predicative of a sentence. verbs, these verbs require certain semantic and/or In this paper, we focus on one particular type syntactic modifications. of Czech multiword expressions – on complex It has been widely acknowledged that many predicates with light verbs (CPs). CPs consist NLP applications – let us mention, e.g. informa- of a light verb and another predicative element tion retrieval (Wallis, 1993), question answering, – a predicative noun, an adjective, an adverb or machine translation (Madnani and Dorr (2013); a verb; the pairs function as single predicative Callison-Burch et al. (2006); Marton et al. (2009)) units. As such, most CPs have their single pred- or machine translation evaluation (Kauchak and icative counterparts by which they can be para- Barzilay (2006); Zhou et al. (2006); Barancˇ´ıkova´ phrased, e.g. the CPs dat´ polibek and dat´ pusu et al. (2014)) – can benefit from paraphrases. ‘give a kiss’ can be both paraphrased by pol´ıbit Here we show how the dictionary providing ‘to kiss’. high quality data can be integrated into an ex- In contrast to their single predicative para- periment with improving statistical machine trans- phrases, CPs manifest much greater flexibility lation quality. If translated separately, CPs of- in their modification, c.f. adjectival modifiers ten cause errors in machine translation. In our of the CP dat´ polibek ‘give a kiss’ and the cor- experiment, we use the dictionary to simplify responding adverbial modifiers of its single verb Czech source sentences before translation by re- paraphrase pol´ıbit ‘to kiss’ in dat´ va´snivˇ y/n´ eˇz-ˇ placing CPs with their respective single predica- ny/letm´ y/man´ zelskˇ y/m´ ajov´ y/smrt´ ´ıc´ı polibek ‘give tive verb paraphrases. Human annotators have a passionate/tender/fleeting/marriage/May/fatal evaluated quality of the translated simplified sen- 1 Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 1–10, Valencia, Spain, April 4. c 2017 Association for Computational Linguistics tences higher than of the original sentences con- combined with information on syntactic and/or tain CPs. semantic properties of CPs are employed (e.g. This paper is structured as follows. First, related Bannard (2007), Fazly et al. (2005)). The auto- work on CPs generally and on their paraphrases is matic detection benefits especially from parallel introduced (Section 2). Second, the paraphrasing corpora representing valuable sources of data in model for CPs is thoroughly described, especially which CPs can be automatically recognized via the selection of CPs, an automatic extraction of word alignment, see e.g. (Chen et al., 2015), candidates for their paraphrases and their manual (de Medeiros Caseli et al., 2010), (Sinha, 2009), evaluation (Section 3). Third, the resulting data (Zarrießand Kuhn, 2009). and the structure of the lexical space of the dictio- Work on paraphrasing CPs is still not exten- nary are discussed (Section 4). Finally, in order to sive. A paraphrasing model has been proposed present one of many practical applications of this within the Meaning Text Theory(Zolkovskijˇ and ↔ dictionary, a random sample of paraphrases from Mel’cuk,ˇ 1965). Its representation of CPs by the ParaDi dictionary is used in a machine trans- means of lexical functions and rules applied in lation experiment (Section 5). the paraphrasing model are thoroughly described in (Alonso Ramos, 2007). Further, Fujita et al. 2 Related Work (2004) present a paraphrasing model which takes A theoretical research on CPs with light verbs has advantage of semantic representation of CPs by a long history, which can be traced back to Jes- lexical conceptual structures. Similarly as our pro- persen (1965). An ample literature devoted to this posed dictionary of paraphrases, this model also language phenomenon so far is characterized by takes into account changes in the grammatical cat- an enormous diversity in used terms and analyses, egory of voice and changes in morphological cases see esp. (Amberber et al., 2010) and (Alsina et al., of arguments, which have appeared to be highly 1997). Here we use the term CP with the light verb relevant for the paraphrasing task. for a collocation within which the verb – not re- taining its full semantic content – provides rather 3 Paraphrase Model grammatical functions (incl. syntactic structure) In this section, the process of paraphrase extrac- and to which individual semantic properties are tion is described in detail. First, we present the se- primarily contributed by the noun (Algeo, 1995). lection of CPs (Section 3.1). For their paraphras- The information on CPs is a part of several ing, we had initially intended to use some of exist- lexical resources containing manually annotated ing sources of paraphrases, however, they turned data. For instance, CPs are represented in syn- out to be completely unsatisfactory for our task.4 tactically rich annotated corpora from the family Word2vec is a group of shallow neural networks of the Prague Dependency Treebanks: the Prague generating representations of words in a continu- 1 Dependency Treebank 3.0 (PDT) and the Prague ous vector space depending on contexts they ap- 2 Czech-English Dependency Treebank 2.0 , see pear in (Mikolov et al., 2013). In line with dis- (Bejcekˇ et al., 2013) and (Hajicˇ et al., 2012). Fur- tributional hypothesis (Harris, 1954), semantically ther, the PropBank3 project has been recently en- hanced with the information on CPs; the anno- 4We used the ParaPhrase DataBase (PPDB), (Ganitke- tation scheme of CPs in PropBank is thoroughly vitch and Callison-Burch, 2014; Ganitkevitch et al., 2013) the largest paraphrase database available for the Czech lan- described in (Hwang et al., 2010). Finally, the guage. PPDB has been created automatically from large par- Hungarian corpus of CPs based on the data from allel data and it comes in several sizes ranging from S to the Szeged Treebank has been built (Vincze and XXL. However, the bigger its size, the bigger the amount of noise. We chose the size L as a reasonable trade-off between Csirik, 2010). quality and quantity. We combined the phrasal paraphrases, At present, one of trending topics in NLP many-to-one and one-to-many. We lemmatized and tagged the collection of PPDB using the state-of-the-art POS tagger community is an automatic identification of CPs. Morphodita (Strakova´ et al., 2014). Even though this collec- In this task, various statistical measures often tion contains almost 400k lemmatized paraphrases in total, it contained only 54 candidates for single predicative verb para- 1 http://ufal.mff.cuni.cz/pdt3.0 phrases of CP. Only 2 of these 45 candidates these candidates 2http://ufal.mff.cuni.cz/pcedt2.0/en/ have been detected correctly, the rest was noise in PPDB. As index.html a result, we chose not to use parallel data in our task but we 3https://verbs.colorado.edu/˜mpalmer/ have adopted another approach applying word2vec, a neural projects/ace.html network based model to large monolingual data. 2 similar words are mapped close to each other CPs Verbs Nouns (measured by the cosine similarity) so we can ex- First dataset 726 49 612 pect CPs and their single verb paraphrases to have Second dataset 1640 126 699 similar vector space distribution. Union 2257 154 1061 Word2vec computes vectors for single tokens. As CPs represent MWEs, their preprocessing was Table 1: The number of unique CPs, light verbs necessary: CPs have to be first identified and con- and predicative nouns from two datasets. Their nected into a single token (Section 3.2). union has been used in the paraphrase extraction Particular settings of our model for an auto- task. matic extraction of candidates for single predica- Corpus Sentences Tokens tive verb paraphrases are presented in Section 3.3. CNK2000 2.78 121.81 Finally, a manual evaluation of the extracted can- CNK2005 7.95 122.99 didates, including their further annotation with CNK2010 8.18 122.48 semantic and syntactic information, is described Czeng 1.0 14.83 206.05 (Section 3.4).

Load more