ParaDi: Dictionary of Paraphrases of Czech Complex Predicates with Light

Petra Barancˇ´ıkova´ and Vaclava´ Kettnerova´ Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Malostranske´ nam´ estˇ ´ı 25, 118 00, Praha, Czech Republic, [email protected], [email protected]

Abstract kiss’ vs. va´snivˇ e/nˇ eˇznˇ e/letmo/*manˇ zelsky/*mˇ a-´ jove/*smrtelnˇ eˇ pol´ıbit ‘kiss passionately/tender- We present a new freely available dic- ly/fleetingly/*marriagely/*Mayly/?fatally’. Easier tionary of paraphrases of Czech complex modification of CPs is usually considered as the predicates with light verbs, ParaDi. Can- main motivation for their widespread use (Brinton didates for single predicative paraphrases and Akimoto, 1999). of selected complex predicates have been In this paper, we present ParaDi, a dictionary extracted automatically from large mono- of single predicative paraphrases of Czech lingual data using word2vec. They have CPs. We restricted the dictionary only to CPs been manually verified and further refined. that consist of light verbs and predicative , We demonstrate one of many possible ap- which represent the most frequent and central type plications of ParaDi in an experiment with of CPs in the Czech language. improving machine translation quality. ParaDi was built on a semi-automatic basis. First, candidates for single verb paraphrases of 1 Introduction selected CPs have been automatically identified Multiword expressions (MWEs) pose a serious in large monolingual data using word2vec, a shal- challenge for both foreign speakers and many NLP low neural network. The list of these candidates tasks (Sag et al., 2002). From various multiword has been then manually checked and further re- expressions, those that involve verbs are of great fined. In many cases, if CPs are to be correctly significance as verbs represent the syntactic center paraphrased by the identified single predicative of a sentence. verbs, these verbs require certain semantic and/or In this paper, we focus on one particular type syntactic modifications. of Czech multiword expressions – on complex It has been widely acknowledged that many predicates with light verbs (CPs). CPs consist NLP applications – let us mention, e.g. informa- of a and another predicative element tion retrieval (Wallis, 1993), question answering, – a predicative , an , an or machine translation (Madnani and Dorr (2013); a verb; the pairs function as single predicative Callison-Burch et al. (2006); Marton et al. (2009)) units. As such, most CPs have their single pred- or machine translation evaluation (Kauchak and icative counterparts by which they can be para- Barzilay (2006); Zhou et al. (2006); Barancˇ´ıkova´ phrased, e.g. the CPs dat´ polibek and dat´ pusu et al. (2014)) – can benefit from paraphrases. ‘give a kiss’ can be both paraphrased by pol´ıbit Here we show how the dictionary providing ‘to kiss’. high quality data can be integrated into an ex- In contrast to their single predicative para- periment with improving statistical machine trans- phrases, CPs manifest much greater flexibility lation quality. If translated separately, CPs of- in their modification, c.f. adjectival modifiers ten cause errors in machine translation. In our of the CP dat´ polibek ‘give a kiss’ and the cor- experiment, we use the dictionary to simplify responding adverbial modifiers of its single verb Czech source sentences before translation by re- paraphrase pol´ıbit ‘to kiss’ in dat´ va´snivˇ y/n´ eˇz-ˇ placing CPs with their respective single predica- ny/letm´ y/man´ zelskˇ y/m´ ajov´ y/smrt´ ´ıc´ı polibek ‘give tive verb paraphrases. Human annotators have a passionate/tender/fleeting/marriage/May/fatal evaluated quality of the translated simplified sen-

1 Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 1–10, Valencia, Spain, April 4. c 2017 Association for Computational Linguistics tences higher than of the original sentences con- combined with information on syntactic and/or tain CPs. semantic properties of CPs are employed (e.g. This paper is structured as follows. First, related Bannard (2007), Fazly et al. (2005)). The auto- work on CPs generally and on their paraphrases is matic detection benefits especially from parallel introduced (Section 2). Second, the paraphrasing corpora representing valuable sources of data in model for CPs is thoroughly described, especially which CPs can be automatically recognized via the selection of CPs, an automatic extraction of word alignment, see e.g. (Chen et al., 2015), candidates for their paraphrases and their manual (de Medeiros Caseli et al., 2010), (Sinha, 2009), evaluation (Section 3). Third, the resulting data (Zarrießand Kuhn, 2009). and the structure of the lexical space of the dictio- Work on paraphrasing CPs is still not exten- nary are discussed (Section 4). Finally, in order to sive. A paraphrasing model has been proposed present one of many practical applications of this within the Meaning Text Theory(Zolkovskijˇ and ↔ dictionary, a random sample of paraphrases from Mel’cuk,ˇ 1965). Its representation of CPs by the ParaDi dictionary is used in a machine trans- means of lexical functions and rules applied in lation experiment (Section 5). the paraphrasing model are thoroughly described in (Alonso Ramos, 2007). Further, Fujita et al. 2 Related Work (2004) present a paraphrasing model which takes A theoretical research on CPs with light verbs has advantage of semantic representation of CPs by a long history, which can be traced back to Jes- lexical conceptual structures. Similarly as our pro- persen (1965). An ample literature devoted to this posed dictionary of paraphrases, this model also language phenomenon so far is characterized by takes into account changes in the grammatical cat- an enormous diversity in used terms and analyses, egory of voice and changes in morphological cases see esp. (Amberber et al., 2010) and (Alsina et al., of arguments, which have appeared to be highly 1997). Here we use the term CP with the light verb relevant for the paraphrasing task. for a collocation within which the verb – not re- taining its full semantic content – provides rather 3 Paraphrase Model grammatical functions (incl. syntactic structure) In this section, the process of paraphrase extrac- and to which individual semantic properties are tion is described in detail. First, we present the se- primarily contributed by the noun (Algeo, 1995). lection of CPs (Section 3.1). For their paraphras- The information on CPs is a part of several ing, we had initially intended to use some of exist- lexical resources containing manually annotated ing sources of paraphrases, however, they turned data. For instance, CPs are represented in syn- out to be completely unsatisfactory for our task.4 tactically rich annotated corpora from the family Word2vec is a group of shallow neural networks of the Prague Dependency Treebanks: the Prague generating representations of words in a continu- 1 Dependency Treebank 3.0 (PDT) and the Prague ous vector space depending on contexts they ap- 2 Czech-English Dependency Treebank 2.0 , see pear in (Mikolov et al., 2013). In line with dis- (Bejcekˇ et al., 2013) and (Hajicˇ et al., 2012). Fur- tributional hypothesis (Harris, 1954), semantically ther, the PropBank3 project has been recently en- hanced with the information on CPs; the anno- 4We used the ParaPhrase DataBase (PPDB), (Ganitke- tation scheme of CPs in PropBank is thoroughly vitch and Callison-Burch, 2014; Ganitkevitch et al., 2013) the largest paraphrase database available for the Czech lan- described in (Hwang et al., 2010). Finally, the guage. PPDB has been created automatically from large par- Hungarian corpus of CPs based on the data from allel data and it comes in several sizes ranging from S to the Szeged Treebank has been built (Vincze and XXL. However, the bigger its size, the bigger the amount of noise. We chose the size L as a reasonable trade-off between Csirik, 2010). quality and quantity. We combined the phrasal paraphrases, At present, one of trending topics in NLP many-to-one and one-to-many. We lemmatized and tagged the collection of PPDB using the state-of-the-art POS tagger community is an automatic identification of CPs. Morphodita (Strakova´ et al., 2014). Even though this collec- In this task, various statistical measures often tion contains almost 400k lemmatized paraphrases in total, it contained only 54 candidates for single predicative verb para- 1 http://ufal.mff.cuni.cz/pdt3.0 phrases of CP. Only 2 of these 45 candidates these candidates 2http://ufal.mff.cuni.cz/pcedt2.0/en/ have been detected correctly, the rest was noise in PPDB. As index.html a result, we chose not to use parallel data in our task but we 3https://verbs.colorado.edu/˜mpalmer/ have adopted another approach applying word2vec, a neural projects/ace.html network based model to large monolingual data.

2 similar words are mapped close to each other CPs Verbs Nouns (measured by the cosine similarity) so we can ex- First dataset 726 49 612 pect CPs and their single verb paraphrases to have Second dataset 1640 126 699 similar vector space distribution. Union 2257 154 1061 Word2vec computes vectors for single tokens. As CPs represent MWEs, their preprocessing was Table 1: The number of unique CPs, light verbs necessary: CPs have to be first identified and con- and predicative nouns from two datasets. Their nected into a single token (Section 3.2). union has been used in the paraphrase extraction Particular settings of our model for an auto- task. matic extraction of candidates for single predica- Corpus Sentences Tokens tive verb paraphrases are presented in Section 3.3. CNK2000 2.78 121.81 Finally, a manual evaluation of the extracted can- CNK2005 7.95 122.99 didates, including their further annotation with CNK2010 8.18 122.48 semantic and syntactic information, is described Czeng 1.0 14.83 206.05 (Section 3.4). Czech Press 258.40 4018.89 Total 292.14 4592.22 3.1 CPs Selection Two different datasets of CPs, containing together Table 2: Basic statistics of datasets (numbers 2,257 unique CPs, have been used. As both these in millions of units). datasets have been manually created, they allow us to achieve the desired quality of the resulting data. of Czech texts: SYN2000 (Cermˇ ak´ et al., 2000), The first dataset resulted from the experiment SYN2005 (Cermˇ ak´ et al., 2005), SYN2010 (Krenˇ examining the native speakers’ agreement on the et al., 2010) and CzEng 1.0 (Bojar et al., 2011). interpretation of light verbs (Kettnerova´ et al., As these four large corpora with almost 600 mil- 2013). CPs in this dataset consist of collocations lion tokens in total have turned out to be insuffi- of light verbs and predicative nouns expressed by cient, they have been extended with the data from a prepositionless case (e.g., polozitˇ otazku´ ‘put the Czech Press – a large collection of contempo- a question’), by a simple prepositional case (e.g., rary news texts containing more than 4,000 mil- dat´ do porˇadku´ ‘put in order’), and by a complex lion tokens. The overall statistics on all datasets is prepositional group (e.g., prejˇ ´ıt ze sm´ıchu do pla´ceˇ presented in Table 2. ‘go from laughing to crying’). To generate CPs paraphrases, all the selected The second dataset resulted from a project aim- CPs (Section 3.1) had to be automatically iden- ing to enhance the high coverage valency lexicon tified in the given corpora. For the identification of Czech verbs, VALLEX,5 with the information of the CPs, we proceeded from light verbs. First, on CPs (Kettnerova´ et al., 2016). In this case, only all verbs in the corpora were detected. From these the nominal collocates expressed in the preposi- verbs, only those verbs that represent light verbs as tionless accusative were selected as they represent parts of the selected CPs were further processed. the central type of Czech CPs. As the frequency For each identified light verb, each noun phrase and saliency have been taken as the main crite- in the context 4 words from the given light verb ria for their selection, the resulting set represents ± was extracted in case the verb and the given noun a valuable source of CPs for Czech. phrase can combine in some of the selected CPs. The overall number of CPs in the datasets is pre- sented in Table 1. The union of CPs from these Further, as word2vec generates representations datasets – 2,257 CPs in total – has been used in the of single word units, every detected noun phrase paraphrase candidates extraction task. was connected with its respective light verb into a single word unit. In case that some light verb 3.2 Data Preprocessing could combine with more than one noun phrase into CPs, or in case that one noun phrase could For word2vec training, only monolingual data – be connected with more than one light verb, we generally easily obtainable in a large amount – is have followed the principle that every verb should necessary. We have used large lemmatized corpora be connected to at least one candidate in order 5http://ufal.mff.cuni.cz/vallex/3.0/ to maximize a number of identified CPs.

3 rank CP frequency 3.3 Word2vec Model m´ıt problem´ 1. 319,791 To the resulting data, we have applied gen- ’have a problem’ sim, a freely available word2vec implementation m´ıt moznostˇ 2. 300,330 (Rehˇ u˚rekˇ and Sojka, 2010). In particular, we have ’have a possibility’ used a model of vector size 500 with continuous m´ıt sanciˇ 3. 292,340 bag of word (CBOW) training algorithm and neg- ’have a chance’ ative sampling...... As it is impossible for the model to learn any- vznest´ zalobuˇ 998. 535 thing about a rarely seen word, we have set a min- ’bring charges’ imum number of word occurrences to 100 in or- ...... der to limit the size of the vocabulary to reason- vest´ k sebevyvracen´ ´ı 1775. 1 able words. This requirement filtered also uncom- ’lead to self-refutation’ monly used CPs from the identified CPs in the cor- doj´ıt k flagelantstv´ı 1776. 1 pora: from 1,776 CPs only 1,486 CPs fulfilled the ’flagellation takes place’ given limit. Table 3: The ranking of the CPs identified in the After training the model, for each of 1,486 CPs corpora, based on their frequency. we have extracted 30 words with the most similar vectors. From these 30 words, we have selected up to ten single verbs closest to the given CP. These verbs were taken as candidates for single predica- For example, if there were two light verbs v1 tive verb paraphrases of the given CP. and v2 in a sentence and v1 had a candidate c1, As a result, 8,921 verbs in total corresponding while v2 had two candidates c1 and c2, v1 was con- to 3,735 unique verb lemmas have been selected as nected with c1 and v2 with c2. In case this princi- candidates for single predicative verb paraphrases ple was not sufficient, the light verb was assigned of the given 1,486 CPs. the closest noun phrase on the basis of word order. 3.4 Annotation Process When each noun phrase was connected maxi- mally with one light verb and each light verb was In this section, the annotation process of the ex- connected maximally with one noun phrase, we tracted 8,921 candidates for single predicative have joined the noun phrases to their respective verb paraphrases of CPs is thoroughly described. light verbs into single word units with the under- Manual processing of the extracted single verbs score character and erase the noun phrases from allowed us to evaluate the results of the adopted their original position in sentences. method. Let us repeat that word2vec generates seman- For example, after identifying the light verb m´ıt tically similar words depending on their contexts ‘have’ in a sentence and the prepositionless noun they appear in. However, not only words having phrase problem´ ‘problem’ in its context on the the same meaning can have similar space repre- above principles, the given light verb and the given sentation. Words with the opposite meaning (e.g. noun phrase have been connected into the result- ‘finish’ vs ‘start’), more specific meaning (‘finish’ ing single word unit m´ıt problem´ ; this whole unit vs. ‘graduate’) or even different meaning can be then replaced the verb m´ıt ‘have’ in the sentence, extracted as they can appear in similar contexts as while the noun phrase problem´ ‘problem’ was well. Manual evaluation of the extracted candi- deleted from the sentence. dates for single verb paraphrases is thus necessary. On this basis, almost 8.5 million instances In the manual evaluation, two annotators have of CPs were identified in the corpora, 99,9% been asked to indicate for each instance of the of them has frequency more than 100 occurrences extracted candidates for single verb paraphrases in the corpora. However, only 1,776 unique CPs of a CP whether it represents the paraphrase of the were detected – almost 500 CPs from the selected given CP, or not. For example, the single verbs datasets (Section 3.1) did not occur even once. uprednostˇ novatˇ and preferovat ‘to prefer’ are the The rank and frequency of selected CPs identified paraphrase of the CP davat´ prednostˇ ‘to give in the corpora is presented in Table 3. a preference’ while the verb srazit ‘to run down’

4 not. synonyms antonyms Moreover, single verbs antonymous with the re- no constrains 1607 51 spective CPs have been indicated as well as in par- + reflexive morpheme 353 2 ticular context they can also function as a para- + voice change 173 5 phrase. For example, depending on contexts + an adjective 53 – both extracted single verbs stoupnout ‘to rise’ total 2177 58 and poklesnout ‘to drop’ can function as para- phrases of the CP zaznamenat propad ‘to experi- Table 4: The basic statistics on the annotation. The ence a drop’, while the first one has the meaning synonyms column does not add up as the condi- synonymous with the given CP, the meaning of the tions are not mutually exclusive as mentioned ear- latter is antonymous. lier. Further, when the annotators have determined a certain candidate as the single verb paraphrase e.g., m´ıt zajem´ ‘to be interested’ and cht´ıt of a CP, they have taken the following three mor- ‘to want’ phological, syntactic and semantic aspects into ac- v is an antonym of l (the modification of the count. • First, they had to pay special attention to the context is necessary) morphological expression of arguments. Changes e.g., zaznamenat propad ‘to experience in their morphological expression reflect different a drop’ and stoupnout ‘to rise’ syntactic perspectives from which the action de- v is a paraphrase of l but changes in the mor- noted by the given CP and its single verb para- • phological expression of arguments are nec- phrase is viewed. For example, the single verb essary potrestat ‘to punish’ can serve as the paraphrase of e.g., dostat nab´ıdku ‘to get an offer’ and the CP dostat trest ‘to get a punishment’ in a sen- nab´ıdnout ‘to offer’ tence, however, the semantic roles of the subject and the object are switched. v is a paraphrase of l but the reflexive mor- • Second, in some cases the reflexive morpheme pheme se/si has to be added (the modification se/si, reflecting the inchoative meaning, had to be of verb lemma is necessary) added to single predicative verb paraphrases so e.g., nest´ nazev´ ‘to be called’ and nazyvat´ se that their meaning corresponds to the meaning ‘to be called’ of their respective CPs. For example, the CP m´ıt problem´ ‘have a problem’ can be paraphrased v is a paraphrase of l with a particular adjecti- • by the verb trapit´ only on the condition that the val modification (the adjective modifier of the reflexive morpheme is attached to the verb lemma noun should be present) trapit´ se ‘to worry’. e.g., podat oznamen´ ´ı ‘to make an announce- Third, some single predicative verbs function ment’ can be paraphrased as zalovatˇ ‘to sue’ as paraphrases of particular CPs only if nouns only if the noun oznamen´ ´ı is modified with in these CPs have certain adjectival modifications. the adjective trestn´ı ‘criminal’ These paraphrases have been assigned the given during the annotation. v is a not a paraphrase of l • As the above given three features are not mu- tually exclusive, they can combine. For example, As a result of the annotation process, the total the verb zamestnatˇ ‘to hire’ is a paraphrase of the number of the indicated single verb paraphrases CP nalezt´ uplatnenˇ ´ı ‘to find an use’ but both the of CPs was 2,177. For 999 CPs at least one single reflexive morpheme se and a modification by the verb paraphrase has been found. The highest num- adverb pracovn´ı ‘working’ is required. ber of single verb paraphrases indicated for one To summarize, for each identified single pred- CP has been eight; it has been the CP vznest´ dotaz icative verb paraphrase v of a CP l, the annotators ‘to ask a question’. Figure 1 shows the number have chosen from the following options: of paraphrases per CPs. Table 4 presents more detailed results of the an- v is a synonymous paraphrase of l (without notation. It shows frequency of additional mor- • any modification of the context) phological, syntactic and semantic features.

5 ’lverb’: ’zaznamenat’, [{’noun’: ’propad’, ’morph’: ’4’, ’synonyms’: [ {’lemma’: ’poklesnout’}, {’lemma’: ’klesnout’}, {’lemma’: ’propadnout’, ’reflexive’: ’se’} ], ’antonyms’: [ {’lemma’: ’stoupnout’} ], ... ] Figure 1: The number of single predicative verb } paraphrases and antonymous verbs per CPs in the ParaDi dictionary. Figure 2: The lexical representation of the CP za- znamenat propad ‘to record a slump’. 4 Dictionary of Paraphrases

2,235 single predicative verbs indicated by the An illustrative example of the lexical represen- annotators as synonymous or antonymous verbs tation of paraphrases in ParaDi is presented in Fig- of 999 CPs (Section 3.4) form the lexical stock ure 2. It displays the lexical entry of the CP zaz- of ParaDi, a dictionary of single verb paraphrases namenat propad ‘to record a slump’. Under the of Czech CPs. The format of the ParaDi dictio- light verb zaznamenat ‘to record’, there is a list nary has been designed with respect to both hu- of nouns that combine with this light verb into man and machine readability. The dictionary is CPs. In case of the noun propad ‘slump’, the noun represented in JSON, as it is flexible and language- is expressed by the prepositionless accusative. independent data format. This CP has three single verb paraphrases (pok- The lexical entries in the dictionary describe lesnout ‘to decrease’, klesnout ‘to drop’, propad- individual light verbs. Under light verb keys, nout se ‘to slump’) and one antonymous verb all predicative nouns constituting CPs with the (stoupnout ‘to increase’). The paraphrase propad- given light verb are listed. The predicative nouns nout ‘to slump’ needs to have the reflexive mor- are lemmatized; the information on their mor- pheme se. phology is included under their morph keys the ParaDi is freely available at the following value of which are prepositionless and preposi- URL: http://hdl.handle.net/11234/ tional cases. 1-1969 Each CP in the lexical entry might be assigned one or two lists of single predicative verbs: one for 5 Machine Translation Experiment synonymous paraphrases and the other for antony- mous verbs. Paraphrases in the lists are sorted We have taken advantage of the ParaDi dictio- based on the distance from their respective LVC nary in a machine translation experiment in order in the vector space. Moreover, each verb may be to verify its benefit for one of key NLP tasks. We assigned one or more following features: have selected 50 random CPs from the dictionary. For each of them, we have randomly extracted one voice change – indicating changes in the • sentence from our data containing the given CP. morphosyntactic expression of arguments, This set of sentences is referred to as BEFORE. adjective – indicating necessary adjectival By substituting a CP for its first (i.e. closest in the • modification, vector space) paraphrase on the basis of the dictio- nary, we have created a new dataset AFTER. reflexive – indicating that reflexive morpheme We have translated both these datasets – BE- • is necessary, FORE and AFTER – using two freely avail-

6 Source Moses GT tions, namely an experiment with improving ma- BEFORE 30% 33% chine translation quality. However, the dictionary AFTER 45% 44% can be used in many other NLP tasks (text sim- TIE 25% 23% plification, information retrieval, etc.) and can be similarly created for other languages. Table 5: Results of the experiment. First column shows a source of better ranked sentence from the Acknowledgments pairwise comparison or whether they tied. The research reported in this paper has been sup- ported by the Czech Science Foundation GA CR,ˇ 6 able MT systems – Google Translate (GT) and grant No. GA15-09979S. This work has been us- 7 Moses in the Czech to English setting. ing language resources developed and/or stored We have used crowdsourcing for evaluation and/or distributed by the LINDAT-Clarin project of the resulting translations. Both options were of the Ministry of Education, Youth and Sports of presented in a randomized order and the annota- the Czech Republic, project No. LM2015071. tors were instructed to choose whether one trans- lation is better or they have the same quality. We have collected almost 300 comparisons. We References measured inter-annotator agreement using Krip- John Algeo. 1995. Having a look at the expanded pendorff’s alpha (Krippendorff, 2007), a reliabil- predicate. In B. Aarts and Ch. F. Meyer, editors, ity coefficient developed to measure the agreement The Verb in Contemporary English: Theory and between judges. The inter-annotator agreement Description, pages 203–217. Cambridge University Press, Cambridge. has achieved 0.58, i.e. moderate agreement. The results (see Table 5) are very promising: Margarita Alonso Ramos. 2007. Towards the syn- in most cases the annotators clearly preferred thesis of support verb constructions: Distribution of translations of AFTER (i.e. with single predicative syntactic actants between the verb and the noun. In L. Wanner and I. A. Mel’cuk,ˇ editors, Selected Lex- verbs) to BEFORE (i.e. with CPs). The results are ical and Grammatical Issues in the Meaning-Text consistent for both translation systems. Theory, pages 97–137. John Benjamins Publishing However, it is clear from the example in Table 6 Company, Amsterdam, Philadelphia. that even though the change in the source sentence Alex Alsina, Joan Bresnan, and Peter Sells, editors. was minimal, the translations changed substan- 1997. Complex Predicates. CSLI Publications, tially as both the translation models are phrase- Stanford. based. Based on this fact, we can expect that Mengistu Amberber, Brett Baker, and Mark Harvey, not only difference in quality between translations editors. 2010. Complex Predicates in Cross- of CPs and their respective synonymous verbs was Linguistic Perspective. Cambridge University Press, evaluated. This low quality translation inevitably Cambridge. reflected in lower inter-annotator agreement, typi- Colin Bannard. 2007. A measure of syntactic flexibil- cal for machine translation evaluation (Bojar et al., ity for automatically identifying multiword expres- 2013). sions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, 6 Conclusion MWE ’07, pages 1–8, Stroudsburg, PA, USA. Asso- ciation for Computational Linguistics. We have presented ParaDi, a semiautomatically Petra Barancˇ´ıkova,´ Rudolf Rosa, and Alesˇ Tamchyna. created dictionary of single verb paraphrases of 2014. Improving Evaluation of English-Czech Czech complex predicates with light verbs. We MT through Paraphrasing. In Nicoletta Calzolari, have shown that such paraphrases are automati- Khalid Choukri, Thierry Declerck, Hrafn Loftsson, cally obtainable from large monolingual data with Bente Maegaard, and Joseph Mariani, editors, Pro- ceedings of the 9th International Conference on a manual verification. ParaDi represents a core Language Resources and Evaluation (LREC 2014), of such dictionary, which can be further enriched. pages 596–601, Reykjav´ık, Iceland. European Lan- We have demonstrated one of its possible applica- guage Resources Association.

6http://translate.google.com Eduard Bejcek,ˇ Eva Hajicovˇ a,´ Jan Hajic,ˇ Pavl´ına 7http://quest.ms.mff.cuni.cz/moses/ J´ınova,´ Vaclava´ Kettnerova,´ Veronika Kola´rovˇ a,´ demo.php Marie Mikulova,´ Jirˇ´ı M´ırovsky,´ Anna Nedoluzhko,

7 Fotbaliste´ Budejovicˇ opetˇ nedali branku BEFORE Football players Budejoviceˇ again did not give gate Football players of Budejoviceˇ didn’t make a goal again Source Fotbaliste´ Budejovicˇ opetˇ neskorovali´ AFTER Football players Budejoviceˇ again did not score Football players of Budejoviceˇ didn’t score again BEFORE Footballers Budejovice again not given goal GT AFTER Footballers did not score again Budejovice BEFORE Footballers Budejoviceˇ again gave the gate Moses AFTER Footballers Budejoviceˇ score again

Table 6: An example of the translated sentences. The judges unanimously agreed that AFTER trans- lations are better than BEFORE. Moses translated the CP dat´ branku literally word by word and the meaning of this translation is far from the meaning of the source sentence.

Jarmila Panevova,´ Lucie Polakov´ a,´ Magda Koprivovˇ a,´ Michal Kren,ˇ Renata Novotna,´ Vladim´ır Sevˇ cˇ´ıkova,´ Jan Stˇ epˇ anek,´ and Sˇarka´ Zikanov´ a.´ Petkevic,ˇ Veraˇ Schmiedtova,´ Hana Skoumalova,´ Jo- 2013. Prague dependency treebank 3.0. hanka Spoustova,´ Michal Sulc,ˇ and Zdenekˇ Vel´ısek.ˇ 2005. SYN2005: balanced corpus of written Czech. Ondrejˇ Bojar, Zdenekˇ Zabokrtskˇ y,´ Ondrejˇ Dusek,ˇ Pe- LINDAT/CLARIN digital library at Institute of For- tra Galusˇcˇakov´ a,´ Martin Majlis,ˇ David Marecek,ˇ Jirˇ´ı mal and Applied Linguistics, Charles University in Marsˇ´ık, Michal Novak,´ Martin Popel, and Alesˇ Tam- Prague. chyna. 2011. Czech-english parallel corpus 1.0 (CzEng 1.0). LINDAT/CLARIN digital library at Wei-Te Chen, Claire Bonial, and Martha Palmer. 2015. Institute of Formal and Applied Linguistics, Charles English light verb construction identification using University in Prague. lexical knowledge. In Proceedings of the Twenty- Ninth AAAI Conference on Artificial Intelligence, Ondrejˇ Bojar, Christian Buck, Chris Callison-Burch, AAAI’15, pages 2375–2381. AAAI Press. Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Helena de Medeiros Caseli, Carlos Ramisch, Maria das Lucia Specia. 2013. Findings of the 2013 Work- Grac¸as Volpe Nunes, and Aline Villavicencio. 2010. shop on Statistical Machine Translation. In Pro- Alignment-based extraction of multiword expres- ceedings of the Eighth Workshop on Statistical Ma- sions. Language Resources and Evaluation, 44(1- chine Translation, pages 1–44, Sofia, Bulgaria, Au- 2):59–77. gust. Association for Computational Linguistics. Afsaneh Fazly, Ryan North, and Suzanne Stevenson. Laurel Brinton and Minoji Akimoto, editors. 1999. 2005. Automatically distinguishing literal and figu- Collocational and Idiomatic Aspects of Composite rative usages of highly polysemous verbs. In Pro- Predicates in the History of English. John Ben- ceedings of the ACL-SIGLEX Workshop on Deep jamins Publishing Company, Amsterdam, Philadel- Lexical Acquisition, DeepLA ’05, pages 38–47, phia. Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Chris Callison-Burch, Philipp Koehn, and Miles Os- borne. 2006. Improved Statistical Machine Transla- Atsushi Fujita, Kentaro Furihata, Kentaro Inui, Yuji tion Using Paraphrases. In Proceedings of the Main Matsumoto, and Koichi Takeuchi. 2004. Para- Conference on Human Language Technology Con- phrasing of japanese light-verb constructions based ference of the North American Chapter of the Asso- on lexical conceptual structure. In Proceedings of ciation of Computational Linguistics, HLT-NAACL the Workshop on Multiword Expressions: Integrat- ’06, pages 17–24, Stroudsburg, PA, USA. Associa- ing Processing, MWE ’04, pages 9–16, Stroudsburg, tion for Computational Linguistics. PA, USA. Association for Computational Linguis- tics. Frantisekˇ Cermˇ ak,´ Renata Blatna,´ Jaroslava Hlava´covˇ a,´ Jan Kocek, Marie Koprivovˇ a,´ Michal Juri Ganitkevitch and Chris Callison-Burch. 2014. Kren,ˇ Vladim´ır Petkevic,ˇ and Michal Schmiedtova,´ The Multilingual Paraphrase Database. In The 9th Veraˇ Sulc.ˇ 2000. SYN2000: balanced corpus of edition of the Language Resources and Evalua- written Czech. LINDAT/CLARIN digital library at tion Conference, Reykjavik, Iceland, May. European Institute of Formal and Applied Linguistics, Charles Language Resources Association. University in Prague. Juri Ganitkevitch, Benjamin Van Durme, and Chris Frantisekˇ Cermˇ ak,´ Jaroslava Hlava´covˇ a,´ Milena Callison-Burch. 2013. PPDB: The Paraphrase Hnatkov´ a,´ Toma´sˇ Jel´ınek, Jan Kocek, Marie Database. In Proceedings of the 2013 Conference of

8 the North American Chapter of the Association for Klaus Krippendorff. 2007. Computing Krippendorff’s Computational Linguistics: Human Language Tech- Alpha Reliability. Technical report, University of nologies, pages 758–764, Atlanta, Georgia, June. Pennsylvania, Annenberg School for Communica- Association for Computational Linguistics. tion.

Jan Hajic,ˇ Eva Hajicovˇ a,´ Jarmila Panevova,´ Petr Nitin Madnani and Bonnie J. Dorr. 2013. Generat- Sgall, Ondrejˇ Bojar, Silvie Cinkova,´ Eva Fucˇ´ıkova,´ ing Targeted Paraphrases for Improved Translation. Marie Mikulova,´ Petr Pajas, Jan Popelka, Jirˇ´ı Se- ACM Trans. Intell. Syst. Technol., 4(3):40:1–40:25, mecky,´ Jana Sindlerovˇ a,´ Jan Stˇ epˇ anek,´ Josef Toman, July. Zdenkaˇ Uresovˇ a,´ and Zdenekˇ Zabokrtskˇ y.´ 2012. Announcing prague czech-english dependency tree- Yuval Marton, Chris Callison-Burch, and Philip bank 2.0. In Proceedings of the 8th International Resnik. 2009. Improved Statistical Machine Trans- Conference on Language Resources and Evaluation lation Using Monolingually-derived Paraphrases. In (LREC 2012), pages 3153–3160, Istanbul,˙ Turkey. Proceedings of the 2009 Conference on Empirical ELRA, European Language Resources Association. Methods in Natural Language Processing: Volume 1 - Volume 1, EMNLP ’09, pages 381–390, Strouds- Zellig Harris. 1954. Distributional structure. Word, burg, PA, USA. Association for Computational Lin- 10(23):146–162. guistics.

Jena D. Hwang, Archna Bhatia, Claire Bonial, Aous Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Mansouri, Ashwini Vaidya, Nianwen Xue, and Dean. 2013. Efficient estimation of word represen- Martha Palmer. 2010. PropBank Annotation of tations in vector space. CoRR, abs/1301.3781. Multilingual Light Verb Constructions. In Proceed- ings of the Fourth Linguistic Annotation Workshop, Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software pages 82–90, Uppsala, Sweden, July. Association Framework for Topic Modelling with Large Cor- for Computational Linguistics. pora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– Otto Jespersen. 1965. A Modern on 50, Valletta, Malta, May. ELRA. Historical Principles VI., Morphology. A Modern English Grammar on Historical Principles. George Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Allen & Unwin Ltd., London. Copestake, and Dan Flickinger. 2002. Multiword David Kauchak and Regina Barzilay. 2006. Para- Expressions: A Pain in the Neck for NLP. In In phrasing for Automatic Evaluation. In Proceedings Proc. of the 3rd International Conference on Intelli- of the Main Conference on Human Language Tech- gent Text Processing and Computational Linguistics nology Conference of the North American Chap- (CICLing-2002, pages 1–15. ter of the Association of Computational Linguistics, HLT-NAACL ’06, pages 455–462, Stroudsburg, PA, R. Mahesh K. Sinha. 2009. Mining Complex Predi- USA. Association for Computational Linguistics. cates in Hindi Using a Parallel Hindi-English Cor- pus. In Proceedings of the Workshop on Multiword Vaclava´ Kettnerova,´ Marketa´ Lopatkova,´ Eduard Expressions: Identification, Interpretation, Disam- Bejcek,ˇ Anna Vernerova,´ and Marie Podobova.´ biguation and Applications, MWE ’09, pages 40– 2013. Corpus Based Identification of Czech 46, Stroudsburg, PA, USA. Association for Compu- Light Verbs. In Katar´ına Gajdosovˇ a´ and Adriana´ tational Linguistics. Zˇ akov´ a,´ editors, Proceedings of the Seventh Interna- tional Conference Slovko 2013; Natural Language Jana Strakova,´ Milan Straka, and Jan Hajic.ˇ 2014. Processing, Corpus Linguistics, E-learning, pages Open-Source Tools for Morphology, Lemmatiza- 118–128, Ludenscheid,¨ Germany. Slovak National tion, POS Tagging and Named Entity Recognition. Corpus, L’. Stˇ ur´ Institute of Linguistics, Slovak In Proceedings of 52nd Annual Meeting of the As- Academy of Sciences, RAM-Verlag. sociation for Computational Linguistics: System Demonstrations, pages 13–18, Baltimore, Mary- Vaclava´ Kettnerova,´ Petra Barancˇ´ıkova,´ and Marketa´ land, June. Association for Computational Linguis- Lopatkova.´ 2016. Lexicographic Description of tics. Czech Complex Predicates: Between Lexicon and Grammar. In Proceedings of the XVII EURALEX Veronika Vincze and Janos´ Csirik. 2010. Hungarian International Congress. Corpus of Light Verb Constructions. In Proceed- ings of the 23rd International Conference on Com- Michal Kren,ˇ Toma´sˇ Barton,ˇ Vaclav´ Cvrcek,ˇ Milena putational Linguistics, COLING ’10, pages 1110– Hnatkov´ a,´ Toma´sˇ Jel´ınek, Jan Kocek, Renata 1118, Stroudsburg, PA, USA. Association for Com- Novotna,´ Vladim´ır Petkevic,ˇ Pavel Prochazka,´ Veraˇ putational Linguistics. Schmiedtova,´ and Hana Skoumalova.´ 2010. SYN2010: balanced corpus of written Czech. LIN- Alexander K. Zolkovskijˇ and Igor A. Mel’cuk.ˇ DAT/CLARIN digital library at Institute of For- 1965. O vozmoznomˇ metode i instrumentax se- mal and Applied Linguistics, Charles University in manticeskogoˇ sinteza. Naucno-texniˇ ceskajaˇ infor- Prague. macija, (6).

9 Peter Wallis. 1993. Information Retrieval based on Paraphrase. Sina Zarrießand Jonas Kuhn. 2009. Exploiting Trans- lational Correspondences for Pattern-independent MWE Identification. In Proceedings of the Work- shop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE ’09, pages 23–30, Stroudsburg, PA, USA. As- sociation for Computational Linguistics. Liang Zhou, Chin-Yew Lin, and Eduard Hovy. 2006. Re-evaluating Machine Translation Results with Paraphrase Support. In Proceedings of the 2006 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP ’06, pages 77–84, Stroudsburg, PA, USA. Association for Computa- tional Linguistics.

10