A Resource of Corpus-Derived Typed Predicate Argument Structures for Croatian

CROATPAS: A Resource of Corpus-derived Typed Predicate Argument Structures for Croatian Costanza Marini Elisabetta Ježek University of Pavia University of Pavia Department of Humanities Department of Humanities costanza.marini01@ [email protected] universitadipavia.it Abstract 2014). The potential uses and purposes of the resource range from multilingual The goal of this paper is to introduce pattern linking between compatible CROATPAS, the Croatian sister project resources to computer-assisted language of the Italian Typed-Predicate Argument learning (CALL). Structure resource (TPAS1, Ježek et al. 2014). CROATPAS is a corpus-based 1 Introduction digital collection of verb valency structures with the addition of semantic Nowadays, we live in a time when digital tools type specifications (SemTypes) to each and resources for language technology are argument slot, which is currently being constantly mushrooming all around the world. developed at the University of Pavia. However, we should remind ourselves that some Salient verbal patterns are discovered languages need our attention more than others if following a lexicographical methodology they are not to face – to put it in Rehm and called Corpus Pattern Analysis (CPA, Hegelesevere’s words – “a steadily increasing Hanks 2004 & 2012; Hanks & and rather severe threat of digital extinction” Pustejovsky 2005; Hanks et al. 2015), (2018: 3282). According to the findings of initiatives such as whereas SemTypes – such as [HUMAN], [ENTITY] or [ANIMAL] – are taken from a the META-NET White Paper Series (Tadić et al. shallow ontology shared by both TPAS 2012; Rehm et al. 2014), we can state that and the Pattern Dictionary of English Croatian is unfortunately among the 21 out of 24 Verbs (PDEV2, Hanks & Pustejovsky official languages of the European Union that are 2005; El Maarouf et al. 2014). The currently considered under-resourced. As a theoretical framework the resource relies matter of fact, Croatian “tools and resources for on is Pustejovsky’s Generative Lexicon […] deep parsing, machine translation, text theory (1995 & 1998; Pustejovsky & semantics, discourse processing, language Ježek 2008), in light of which verbal generation, dialogue management simply do not polysemy and metonymic argument exist” (Tadić et al. 2012: 77). An observation shifts can be traced back to that is only strengthened by the update study compositional operations involving the carried out by Rehm et al. (2014), which shows variation of the SemTypes associated to that, in comparison with other European the valency structure of each verb. The languages, Croatian has weak to no support as corpus used to identify verb patterns in far as text analytics technologies go and only CROATPAS is the Croatian Web as fragmentary support when talking of resources Corpus (hrWac 2.2, RELDI PoS-tagged) such as corpora, lexical resources and grammars. (Ljubešić & Erjavec 2011), which In this framework, a semantic resource such as contains 1.2 billion types and is available CROATPAS could play its part not only in NLP, on the Sketch Engine3 (Kilgarriff et al. (e.g. multilingual pattern linking between other existing compatible resources), but also in automatic machine translation, computer-assisted 1 http://tpas.fbk.eu (last visited on July 12th 2019) 2 http://pdev.org.uk (last visited on July 12th 2019) 3 https://www.sketchengine.eu/ (last visited on July 12th 2019) Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). language learning (CALL) and theoretical and our 74 Italian candidates around the median applied cross-linguistic studies. frequency value after taking out the first and the The paper is structured as follows: first a detailed last 20 verbs on the list. Finally, the Croatian overview of the resource is presented (Section 2), translational equivalents for these verbs were followed by its theoretical underpinnings chosen using the 2017 Zanichelli Italian/Croatian (Section 3) and a summary of the Croatian- bilingual dictionary Croato compatto, edited by specific challenges we faced while building the Aleksandra Špikić. resource editor (Section 4). An overview of the The theoretical framework the resource relies on existing related works is given in Section 5. is Pustejovsky’s Generative Lexicon theory Finally, Section 6 hints at the creation of a (1995 & 1998; Pustejovsky & Ježek 2008), in multilingual resource linking CROATPAS, light of which verbal polysemy and metonymic TPAS (Italian) and PDEV (English) patterns and shifts can be traced back to compositional explores CROATPAS’s potential for computer- operations involving the contextual variation of assisted L2 teaching and learning. the SemTypes associated to the valency structure of each verb. 2 Resource overview CROATPAS rests on four key-components, namely: CROATPAS, i.e. the Croatian Typed-Predicate Argument Structure resource, is the Croatian 1) a representative corpus of Croatian; equivalent of the Italian TPAS resource (Ježek et 2) a shallow ontology of SemTypes; al. 2014) and is a corpus-derived collection of 3) a methodology for corpus analysis; Croatian verb argument structures, whose 4) adequate corpus tools. argument slots have been annotated using As for the first component, the corpus used to semantic type specifications (SemTypes). identify verb patterns is the Croatian Web as The first version of the resource is currently Corpus (hrWac 2.2, RELDI PoS-tagged) being developed at the University of Pavia with (Ljubešić & Erjavec, 2011), containing 1.2 the technical assistance of Lexical Computing billion types and available on the Sketch Engine Ltd. in the person of Vìt Baisa and will be (Kilgarriff et al. 2014). We chose to work with released in 2020 through an Open Access the Croatian Web as Corpus since the reference graphical user interface on the website of the corpus for the Italian TPAS resource is a reduced Language Centre of the University of Pavia version of the Italian Web as Corpus (Baroni & (CLA)4. Kilgarriff, 2006), so as to make the two resources CROATPAS contains a sample of 100 medium- as comparable as possible. frequency Croatian verbs, whose Italian As for the shallow ontology of Semantic Type translational counterparts are already available in labels, CROATPAS is based on the same the TPAS resource: 26 of these verbs are hierarchy shared by TPAS and the PDEV project Croatian translational equivalents of Italian of 180 SemTypes, which originates from the “coercive verbs”, i.e. verbs that instantiate Brandeis Shallow Ontology (BSO) (Pustejovsky metonymic shifts in one of their senses (Ježek & et al. 2004) and its initial 65 labels. As pointed Quochi 2010), while the remaining 74 are out by Ježek (2014: 890), SemTypes “are not Croatian translational equivalents of a sample of abstract categories but semantic classes Italian fundamental verbs, i.e. verbs belonging to discovered by generalizing over the statistically that group of approximately 2000 lexemes relevant list of collocates that fill each position”. deemed essential for communicating in Italian For example, the Croatian lexical set for the and that can be found in any sort of text (De SemType [BEVERAGE] in the context of the verb Mauro 2016). pair PITI/POPITI (= TO DRINK, imperfective/perfective) Our 74-verbs sample was selected as follows: we contains, among others: {vodu = water, kavu = first extracted the frequency counts for all the coffee, koktel = cocktail, vino = wine, čaj = tea, 452 fundamental verbs on De Mauro’s list from a pivo = beer, limonadu = lemonade}, as shown in reduced version of the ItWAC (Baroni & the following pattern string from the resource. Kilgarriff, 2006), which contains over 900 million tokens and is available on the Sketch Engine (Kilgarriff et al. 2014). We then selected Figure 1 – One of the pattern strings of PITI 4 https://cla.unipv.it/?page_id=53723 (last visited on July 12th 2019) The corpus analysis methodology used for both and are able to provide different interpretations TPAS and CROATPAS is a lexicographical for a wide range of lexical phenomena. methodology called Corpus Pattern Analysis The principle of co-composition, for instance, (CPA, Hanks 2004 & 2012; Hanks & offers an alternative take on verbal polysemy Pustejovsky 2005; Hanks et al. 2015), which is with respect to traditional accounts. If we based on the Theory of Norms and Exploitations consider lexical items expressing verb arguments (TNE, Hanks 2004, 2013). TNE divides word to be as semantically active and influential as the uses in two main classes: conventional uses verb itself (Pustejovsky 2002: 421), we do not (norms) and deviations from the norms need to think of verbs as polysemous, but rather (exploitations). CPA’s potential lies in that it conceive their meaning as contextually defined does not try to identify meaning in isolation, but by the SemTypes of the surrounding arguments. rather associates it with prototypical contexts, For instance, if we apply this reasoning to the thus focusing on the norms. The standard CPA Croatian verb pair PITI/POPITI (= TO DRINK, procedure requires: imperfective/perfective), we can notice how its meaning changes depending on what is said to be 1) sampling concordances for each verb “drunk”, namely a [BEVERAGE] (1), a [DRUG] 2) identifying its typical patterns – i.e. (2) or a {GOAL} (3). senses – while going through the corpus lines (1) [[HUMANNOM] PIJE [[BEVERAGE]ACC] Djeca ne piju kavu. 3) assigning SemTypes to the argument Children don’t drink

A Resource of Corpus-Derived Typed Predicate Argument Structures for Croatian

International Computer Science Institute Activity Report 2007

3 Corpus Tools for Lexicographers

Better Web Corpora for Corpus Linguistics and NLP

Overview: Computational Lexical Semantics and the Week Ahead

Advances in Computational Linguistics

Unit 3: Available Corpora and Software

Adam Kilgarriff's Legacy to Computational Linguistics And

Finding Definitions in Large Corpora with Sketch Engine

A Toolbox for Linguistic Discovery1

Semantic Word Sketches

Deliverable D3.3.-B Third Batch of Language Resources

John Benjamins Publishing Company