IJCoL Italian Journal of Computational Linguistics

3-1 | 2017 Emerging Topics at the Third Italian Conference on Computational Linguistics and EVALITA 2016

Electronic version URL: http://journals.openedition.org/ijcol/411 DOI: 10.4000/ijcol.411 ISSN: 2499-4553

Publisher Accademia University Press

Electronic reference IJCoL, 3-1 | 2017, “Emerging Topics at the Third Italian Conference on Computational Linguistics and EVALITA 2016” [Online], Online since 01 June 2017, connection on 28 January 2021. URL: http:// journals.openedition.org/ijcol/411; DOI: https://doi.org/10.4000/ijcol.411

IJCoL is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License editors in chief

Roberto Basili Università degli Studi di Roma Tor Vergata Simonetta Montemagni Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR

advisory board

Giuseppe Attardi Università degli Studi di Pisa (Italy) Nicoletta Calzolari Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR (Italy) Nick Campbell Trinity College Dublin (Ireland) Piero Cosi Istituto di Scienze e Tecnologie della Cognizione - CNR (Italy) Giacomo Ferrari Università degli Studi del Piemonte Orientale (Italy) Eduard Hovy Carnegie Mellon University (USA) Paola Merlo Université de Genève (Switzerland) John Nerbonne University of Groningen (The Netherlands) Joakim Nivre Uppsala University (Sweden) Maria Teresa Pazienza Università degli Studi di Roma Tor Vergata (Italy) Hinrich Schütze University of Munich (Germany) Marc Steedman University of Edinburgh (United Kingdom) Oliviero Stock Fondazione Bruno Kessler, Trento (Italy) Jun-ichi Tsujii Artificial Intelligence Research Center, Tokyo (Japan) editorial board

Cristina Bosco Università degli Studi di Torino (Italy) Franco Cutugno Università degli Studi di Napoli (Italy) Felice Dell’Orletta Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR (Italy) Rodolfo Delmonte Università degli Studi di Venezia (Italy) Marcello Federico Fondazione Bruno Kessler, Trento (Italy) Alessandro Lenci Università degli Studi di Pisa (Italy) Bernardo Magnini Fondazione Bruno Kessler, Trento (Italy) Johanna Monti Università degli Studi di Sassari (Italy) Alessandro Moschitti Università degli Studi di Trento (Italy) Roberto Navigli Università degli Studi di Roma “La Sapienza” (Italy) Malvina Nissim University of Groningen (The Netherlands) Roberto Pieraccini Jibo, Inc., Redwood City, CA, and Boston, MA (USA) Vito Pirrelli Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR (Italy) Giorgio Satta Università degli Studi di Padova (Italy) Gianni Semeraro Università degli Studi di Bari (Italy) Carlo Strapparava Fondazione Bruno Kessler, Trento (Italy) Fabio Tamburini Università degli Studi di Bologna (Italy) Paola Velardi Università degli Studi di Roma “La Sapienza” (Italy) Guido Vetere Centro Studi Avanzati IBM Italia (Italy) Fabio Massimo Zanzotto Università degli Studi di Roma Tor Vergata (Italy)

editorial office Danilo Croce Università degli Studi di Roma Tor Vergata Sara Goggi Istituto di Linguistica Computazionale “Antonio Zampolli” - CNR Manuela Speranza Fondazione Bruno Kessler, Trento AILC IDENTITY - CMYK Registrazione presso il Tribunale di Trento n. 14/16 del 6 luglio 2016

Rivista Semestrale dell’Associazione Italiana di Linguistica Computazionale (AILC) Green Color© 2017 primary Associazione version Italiana di Linguistica Computazionale (AILC)

C:100 M:0 Y:100 K:0

direttore responsabile One-color version Red Michele Arnese

Pubblicazione resa disponibile nei termini della licenza Creative Commons Attribuzione – Non commerciale – Non opere derivate 4.0

C:0 M:100 Y:100 K:0

Dark background version

(pick only the design elements)

www.sarabarcena.com isbn 978-88-99982-64-5

Accademia University Press via Carlo Alberto 55 I-10123 Torino [email protected] www.aAccademia.it/IJCoL_3_1

aAccademia university press

Accademia University Press è un marchio registrato di proprietà di LEXIS Compagnia Editoriale in Torino srl IJCoL Volume 3, Number 1 june 2017

Emerging Topics at the Third Italian Conference on Computational Linguistics and EVALITA 2016

CONTENTS Nota editoriale Roberto Basili, Simonetta Montemagni 7

Panta rei: Tracking Semantic Change with Distributional Semantics in Ancient Greek Martina A. Rodda, Marco S. G. Senaldi, Alessandro Lenci 11

Distributed Representations of Lexical Sets and Prototypes in Causal Alternation Edoardo Maria Ponti, Elisabetta Jezek, Bernardo Magnini 25

Determining the Compositionality of - Pairs with Lexical Variants and Distributional Semantics Marco S. G. Senaldi, Gianluca E. Lebani, Alessandro Lenci 43

LU4R: adaptive spoken Language Understanding For Robots Andrea Vanzo, Danilo Croce, Roberto Basili, Daniele Nardi 59

For a performance-oriented notion of regularity in inflection: the case of Modern Greek conjugation Stavros Bompolas, Marcello Ferro, Claudia Marzi, Franco Alberto Cardillo, Vito Pirrelli 77

EVALITA Goes Social: Tasks, Data, and Community at the 2016 Edition Pierpaolo Basile, Francesco Cutugno, Malvina Nissim, Viviana Patti, Rachele Sprugnoli 93

Nota Editoriale

Roberto Basili∗ Simonetta Montemagni∗∗ Università di Roma, Tor Vergata ILC–CNR, Pisa

Eccoci al quarto numero dell’Italian Journal of Computational Linguistics (IJCoL), la Ri- vista Italiana di Linguistica Computazionale edita dall’ “Associazione Italiana di Lingui- stica Computazionale” (AILC - www.ai-lc.it). La rivista, al suo terzo anno di pub- blicazione, si sta affermando come importante occasione per la promozione e la dif- fusione della ricerca in linguistica computazionale condotta all’interno della comunità nazionale da prospettive diverse e complementari, che integrano in un rapporto dialet- tico i punti di vista umanistico e matematico–formale, teorico e applicato. Il ruolo sempre più centrale che il linguaggio svolge nei processi di comuni- cazione odierna è quotidianamente certificato da riflessioni e annunci che anche i media tradizionali hanno fatto propri. La natura essenzialmente linguistica della comuni- cazione nelle reti sociali on–line, così come il vantaggio competitivo che oggi l’industria associa alla capacità computazionale di trattare il linguaggio in volumi crescenti di dati confermano l’importanza strategica della ricerca in questa area che costituisce in prospettiva un fecondo terreno di innovazione sociale, industriale ed economica. I temi che ruotano attorno a linguaggio e computazione forniscono opportunità uniche per comprendere da un lato i modi con cui le macchine possono elaborare le produzioni linguistiche (scritte e orali) e definire processi avanzati di tipo applicativo, e dall’altro contribuire a una sempre migliore comprensione dei modi con cui il linguaggio opera e cambia nel tempo, nello spazio e attraverso diversi canali e mezzi di comunicazione. I contributi di questo volume ben si collegano a queste due dimensioni di ricerca, in linea con lo spirito della rivista che intende proporsi come forum in cui le diverse anime della linguistica computazionale dialogano e si confrontano. Seguendo la tradizione dei primi due numeri, questo è un volume miscellaneo che raccoglie lavori di ricerca ispirati da giovani ricercatori che sono emersi nell’ambito della Conferenza CLiC-it 2016, tenutasi a Napoli il 5 e 6 dicembre 2016, come parti- colarmente promettenti nel panorama della linguistica computazionale italiana. Questi contributi, selezionati tra le diverse aree tematiche della conferenza, testimoniano linee di ricerca originali e innovative della comunità italiana, e in modo particolare dei suoi più giovani protagonisti. Gli articoli sono stati selezionati attraverso un processo iterativo di peer–review. Ogni articolo è stato sottoposto a tre valutazioni da parte di comitati diversi: come contributo alla conferenza; come candidato ai premi di “Best Young Paper” e “Distinguished Young Paper” di CLiC-it 2016; infine, nella versione estesa, come articolo di rivista scientifica. A questi articoli si aggiunge un contributo invitato che propone una rassegna e una riflessione critica sull’esperienza di EVALITA 2016, la campagna di valutazione delle tecnologie del linguaggio per la lingua italiana

Dept. of Enterprise Engineering - Via del Politecnico 1, 00133 Rome. ∗ E-mail: [email protected] Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC-CNR) - Via Moruzzi 1, 56124, Pisa. ∗∗ E-mail: [email protected]

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1 scritta e parlata che, come da tradizione, contribuisce a fare il punto sull’efficacia e la qualità dei metodi di analisi della lingua italiana all’interno di un ricco e vario repertorio di task applicativi. Anche nell’edizione del 2016, EVALITA ha costituito un raccordo essenziale tra la fase empirica e applicativa della ricerca e la riflessione metodologica e teorica. I contributi del volume si articolano in due macro–sezioni: la prima raccoglie con- tributi di ricerca originali e innovativi, la seconda fornisce un resoconto della campagna di valutazione EVALITA 2016 delineandone al contempo linee di sviluppo per il futuro. All’interno della prima macro–sezione, i primi quattro contributi sono accomunati dall’utilizzo di modelli di semantica distribuzionale: è interessante notare che nel 2016 Distributional Semantics è risultata essere la parola chiave usata più frequentemente dagli autori nella caratterizzazione dei propri contributi a CLiC-it. Non meraviglia quindi che 4 dei 5 contributi selezionati siano riconducibili a questo paradigma di ricerca. Essi differiscono a vari livelli, che vanno dalle finalità della ricerca e le lingue trattate, ai modelli delle distribuzioni statistiche delle parole nei corpora, alle proprietà semantiche considerate e alla caratterizzazione dei contesti linguistici usati nella determinazione degli spazi semantici. In questa sede ci limitiamo a segnalare i diversi scenari – sia teorici sia applicativi – all’interno dei quali tali tecniche sono state utilizzate. Il lavoro di Rodda, Senaldi e Lenci propone un metodo per lo studio del cam- biamento semantico a partire dalla variazione degli spazi semantici distribuzionali, dimostrando il potenziale contributo della semantica distribuzionale in una prospet- tiva diacronica. Il metodo è stato sperimentato all’interno di uno studio finalizzato all’identificazione delle aree coinvolte da cambiamenti semantici nel lessico del greco antico tra l’epoca precristiana e cristiana. L’articolo di Ponti, Jezec e Magnini utilizza tecniche di semantica distribuzionale per esplorare questioni aperte in ambito linguistico sull’interfaccia tra sintassi e se- mantica, riguardanti le proprietà lessico–semantiche degli argomenti verbali. Lo studio, condotto sulla classe dei verbi ad alternanza causativa–incoativa, ha portato alla luce importanti differenze nelle proprietà degli argomenti che rappresentano categorie non uniformi, la cui distribuzione attorno a un prototipo varia in misura significativa tra diverse posizioni argomentali, suggerendo l’esistenza di diverse restrizioni di selezione. Nel contributo di Senaldi, Lebani e Lenci, la similarità distribuzionale tra il vettore di una data espressione e il vettore delle sue varianti lessicali è alla base di una serie di indici finalizzati a discriminare tra espressioni idiomatiche e composizionali: viene dimostrato che gli indici proposti presentano una maggiore efficacia rispetto a quanto proposto nella letteratura distribuzionale sulla composizionalità. Diversamente dai precedenti contributi che mostrano come modelli di semantica distribuzionale contribuiscano significativamente a fare luce su questioni aperte della linguistica teorica, Vanzo, Croce, Basili e Nardi utilizzano la semantica distribuzionale all’interno di uno scenario applicativo di Human-Robot Interaction. In particolare, dimostrano come essa svolga un ruolo rilevante nell’apprendimento strutturato di un processo di Semantic Role Labeling per interfacce robotiche. Chiude la macro–sezione il contributo di Bompolas, Ferro, Marzi, Cardillo e Pirrelli che mostra ruolo e impatto derivante dall’utilizzo di approcci basati sulla nozione di paradigma sull’analisi e sull’apprendimento delle parole. Con esperimenti condotti su tre lingue (greco moderno, italiano e tedesco) gli autori dimostrano che diverse classi verbali sono apprese in funzione del loro grado di trasparenza e predicibilità. Tali risultati, in linea con evidenza psicolinguistica, contribuiscono significativamente a rafforzare l’ipotesi del lessico mentale come sistema emergente integrato.

8 Basili et al. Nota Editoriale

Il volume si chiude con il contributo invitato degli organizzatori della quinta edizione della campagna di valutazione EVALITA. Nel 2016 sono stati organizzati 6 shared tasks, tra cui alcuni nuovi, e una competizione sponsorizzata dall’IBM, che hanno attratto globalmente 34 partecipanti. Una delle novità di questa edizione è rappresentata dal focus sulla lingua dei social media, che ha portato alla creazione di una risorsa con annotazioni multi–livello (PoS tags, sentiment information, named entities and linking, e factuality information) che è stata condivisa tra diversi tasks, permettendone anche una maggiore compenetrazione. Sul versante dei metodi e delle tecniche alla base dei sistemi che hanno partecipato a EVALITA 2016, è interessante segnalare che Deep Learning è risultata la parola chiave usata più frequentemente dagli autori nella carat- terizzazione dei propri contributi, mostrando anche la complementarità, per il 2016, di EVALITA rispetto a CLiC-it. A partire dall’analisi dell’edizione 2016, gli organizzatori di EVALITA concludono delineando interessanti prospettive di sviluppo della campagna di valutazione per gli anni a venire. Al lettore dunque il piacere di approfondire i temi e gli stimoli che la rivista – anche in questo numero – continua a raccogliere.

Editorial Note Summary

We are pleased to announce the fourth issue of the Italian Journal of Computational Linguistics (IJCoL), published by the Italian Association of Computational Linguistics (AILC, www.ai-lc.it). The journal, in its third year of publication, is becoming an important opportunity for promoting and disseminating research results achieved by the Italian computational linguistics community from different and complementary – e.g. humanistic vs. computational or theoretical vs. applied – perspectives. The increasingly central role that language plays in today’s communication pro- cesses is daily acknowledged. The essentially linguistic nature of communication in social networks, as well as the competitive advantage that industry today associates with the computational capacity to handle language in big data confirm – together – the strategic importance of the research in this area, creating a fertile ground for social, industrial and economic innovation. The topics revolving around language and computation provide unique opportunities, on the one hand, to understand the ways in which machines can process language productions (both written and oral) and to define advanced applications, and on the other hand to contribute to a deeper understanding of the ways in which language works and changes over time, space and through different channels and means of communication. Contributions to this volume are well linked to these two dimensions of research, in line with the spirit of the journal that presents itself as a forum where the different souls of computational linguistics confront each other and are combined. As in the case of the first two IJCoL issues, this is a miscellaneous volume that collects research work inspired by young researchers who emerged as particularly promising within the CLiC-it 2016 Conference, held in Naples on 5–6 December 2016. These contributions testify original and innovative research lines of the Italian com- putational linguistics community, and in particular of its youngest protagonists. The articles were evaluated through an iterative peer–review process carried out by different committees: as a contribution to the CLiC-it conference; as a candidate for the CLiC- it 2016 “Best Young Paper” and “Distinguished Young Paper” awards; finally, in its extended version, as a scientific journal . This issue also includes an invited contribution by the organizers of the EVALITA 2016 evaluation campaign.

9 Italian Journal of Computational Linguistics Volume 3, Number 1

The contributions in this issue are organized into two macro–sections, with the first one collecting original and innovative research contributions, and the second one providing an overview of EVALITA 2016 and outlining further developments. The first four papers illustrate different applications of distributional semantic mod- els: it is interesting to note that in 2016 Distributional Semantics was the keyword most frequently used by CLiC-it authors for characterizing their contribution. They differ at different levels, ranging from the research goal and the language(s) dealt with to the computational techniques used to model word co-occurrence statistics, the semantic properties taken into account and the contexts used in determining semantic spaces. The work by Rodda, Senaldi and Lenci proposes an innovative method for studying semantic change starting from the variation of distributional semantic spaces, thus demonstrating the potential contribution of distributional semantics to diachronic stud- ies. The method was tested in a case study aimed at identifying the areas of semantic change in the ancient Greek lexicon between the pre-Christian and Christian era. The paper by Ponti, Jezec and Magnini uses distributional semantics to investigate open linguistic issues on the syntax–semantics interface, with particular emphasis on the lexical-semantic properties of verbal arguments. The study, carried out on the class of causative–inchoative verbs, has brought to light important differences in the properties of arguments. In the contribution of Senaldi, Lebani and Lenci, distributional similarity between the vector of a given expression and the vector of its lexical variants is used as the basis of a set of indices aimed at discriminating between idiomatic vs. compositional expressions, which turned out to be more effective than those proposed so far in the distributional literature on compositionality. Unlike the previous contributions showing how distributional semantics models can significantly contribute to shed light on open theoretical linguistics issues, Vanzo, Croce, Basili and Nardi use distributional semantic models within an application sce- nario: Human–Robot Interaction. In particular, they show the beneficial impact of distributional semantic lexicons on the structured learning of a Semantic Role Labeling component in robotic interfaces. This section is closed by the paper by Bompolas, Ferro, Marzi, Cardillo and Pirrelli, who demonstrate role and impact of paradigm–based approaches for word processing and learning. The results of a case study simulating the acquisition of Modern Greek conjugation, compared with evidence from German and Italian, support a view of the mental lexicon as an emergent integrative system. The volume closes with the invited contribution of the organizers of the fifth edition of the EVALITA evaluation campaign. In 2016, 6 shared tasks were organized together with a challenge sponsored by IBM, which globally attracted 34 participants. One of the novelties of this edition is the focus of social media language, which led to the creation of a resource with multi-level annotations (PoS tags, sentiment information, named entities and linking, and factuality information) that has been used across different tasks. For what concerns methods and techniques underlying EVALITA 2016 participant systems, it is worth reporting that Deep Learning was the keyword most frequently used by the authors in characterizing their system; this also demonstrates the complementar- ity of EVALITA with respect to CLiC–it in 2016. EVALITA’s organizers conclude their paper by outlining interesting lines of development for future EVALITA campaigns. The synthetic view provided above does not exhaust the wide range of topics touched by the papers in this issue; this leaves the reader the pleasure to discover the themes and stimuli that the journal continues to collect – even in this issue.

10 Panta rei: Tracking Semantic Change with Distributional Semantics in Ancient Greek

Martina A. Rodda ∗ Marco S. G. Senaldi ∗∗ Scuola Normale Superiore di Pisa Scuola Normale Superiore di Pisa

Alessandro Lenci † Università di Pisa

We present a method to explore semantic change as a function of variation in distributional semantic spaces. In this paper, we apply this approach to automatically identify the areas of semantic change in the lexicon of Ancient Greek between the pre-Christian and Christian era. Distributional Semantic Models are used to identify meaningful clusters and patterns of semantic shift within a set of target words, defined through a purely data-driven approach. The results emphasize the role played by the diffusion of Christianity and by technical languages in determining semantic change in Ancient Greek and show the potentialities of distributional models in diachronic semantics.

1. Introduction

Distributional Semantics is grounded on the assumption that the meaning of a word can be described as a function of its collocates in a corpus. This suggests that diachronic meaning shifts can be traced through changes in the distribution of these collocates over time (Sagi, Kaufmann, and Clark 2011). While some studies focused on testing the explanatory power of this method over frequency- and syntax-based approaches (Wijaya and Yeniterzi 2011; Kulkarni et al. 2015), more advanced contributions to the field explored how distributional models can be used to test competing hypotheses about semantic change (Xu and Kemp 2015), or to investigate the productivity of constructions in diachrony (Perek 2016). The results attest the explanatory power of distributional methods in modeling diachronic shifts in meaning. In this paper, we propose a method to identify semantic change through the Representational Similarity Analysis (RSA) (Kriegeskorte and Kievit 2013) of distributional vector spaces built from diachronic corpora. RSA is a method extensively used in neuroscience to test cognitive and computational models by comparing the geometry of their representation spaces (Edelman 1998). Stimuli are represented with a representational dissimilarity matrix that contains a measure of the dissimilarity relations of the stimuli with each other. Different matrices are compared to evaluate the correspondence of the representational spaces built from different sources (e.g., behavioral and neuroimaging data). We argue that this method can be applied to compare distributional representations of the lexicon at different temporal stages. The hypothesis is that the elements in the lexical spaces

Scuola Normale Superiore - Piazza dei Cavalieri 7, 56126 Pisa, Italy. E-mail: [email protected] ∗ Scuola Normale Superiore - Piazza dei Cavalieri 7, 56126 Pisa, Italy. E-mail: [email protected] ∗∗ CoLing Lab, University of Pisa - Via S. Maria 36, 56126 Pisa, Italy . † E-mail: [email protected]

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1 showing larger geometrical variations in time correspond to the lexical areas that un- derwent major semantic changes. To the best of our knowledge, this is the first time RSA is used in diachronic distributional semantics. Here we present a case study that applies RSA to track patterns of semantic change within the lexicon of Ancient Greek. We focus on the first few centuries AD, when the rise of Christianity caused a deep and widespread cultural shift within the Hellenic world. We predict that this shift will be reflected in the Greek lexicon of the time. In addition to past studies (Boschetti 2009; O‘Donnell 2005), we apply a bottom-up approach to the detection of semantic change, with no prior definition of a list of lemmas to be analyzed. The goal is to develop a quantitative “discovery procedure” to detect lexical semantic changes, enabling the researcher to discuss and interpret any meaningful patterns that may arise. From a methodological standpoint, this study aims to show how Distributional Semantics can be applied fruitfully to such a small and literary corpus as the collection of Ancient Greek texts. The results will also highlight the ways in which Distributional Semantics can complement the intuition of the researcher in analyzing semantic change in Ancient Greek, providing a useful tool for future studies in Classics. A distributional approach seems particularly suited to philological research, as it is already common and intuitive for researchers in this field to determine the exact meaning, usage restrictions, and stylistic connotations of a word by analyzing the context in which it occurs, especially when no other sources (such as ancient lexica) are available. Distributional Semantics provides the tools to perform a similar task not just on a much larger scale, but drawing information from the whole corpus; as such, it has the potential to highlight patterns in semantic change that would not otherwise be noticeable.

2. Related Work

The past few years have seen the rise of a series of studies tackling diachronic semantic change via computational methods. As pointed out by Sagi, Kaufmann, and Clark (2011), the increasing availability of computational tools for analyzing and manipulat- ing large data sets and corpora allows for testing hypotheses and detecting statistical trends in a large-scale perspective that does not hinge on the intuitions of the linguist or the philologist. Crucially, most of this research has relied on a diachronic application of the distributional hypothesis (Harris 1954) by modeling semantic shift as a change of the co-occurrence patterns of a given lemma over time. In Sagi, Kaufmann, and Clark (2011)’s proposal, the semantic narrowing or broad- ening of English words in the 1150-1710 period is modeled as an increase or decrease in density of the vector space populated by all the token occurrences of a given word in the various decades. The mean cosine similarity between all the token vectors of dog, for instance, decreases over time since it shifts from denoting a specific breed of dog to indicating Canis familiaris exemplars in general. Contrariwise, the mean cosine similarity between the token vectors of hound increases through the decades, since it originally meant ‘dog’ in general and ended up referring to dogs bred for hunting. Gulordava and Baroni (2011) resort to the American English section of the Google Books Ngram corpus, a collection of more than 5 millions of digitized books that were published between the sixteenth century and today (Michel et al. 2011), to build vector representations for words at two different time spans (the 60s and the 90s). The cosine similarity between the vector of a given word in the 60s space and the vector of the same word in the 90s space is then used as a measure of semantic shift for that term. These two time spans are taken into consideration in light of the major technological innovations that occurred in

12 Rodda et al. Tracking Semantic Change in Ancient Greek the 90s and presumably affected the English lexicon. Such a distributional approach is shown by the authors to complement the results of a simpler frequency-based one, already proposed by Michel et al. (2011), which for instance interprets the increase in relative frequency of a given term over time as a signal of its acquired popularity and therefore of its semantic shift. As Wijaya and Yeniterzi (2011) highlight, such a method falls short of describing the nature of the investigated changes and of spotting more gradual shifts that are not reflected in frequency variations. In their work (Wijaya and Yeniterzi 2011), k-means clustering and Topics-Over-Time (Wang and McCallum 2006), a time-dependent topic model, are exploited to observe how and when the topics surrounding a given word change in diachrony. Results clearly bring to light words that change their semantic meaning over time (e.g. gay from ‘frolicsome’ to ‘homosexual’ around the 70s) and words getting additional meanings (e.g. mouse from ‘long-tailed animal’ to ‘computer device’ around the 80s-90s). Kulkarni et al. (2015) compare the frequency-based approach with a syntactic one, which tracks variations in the probability distribution of tags given a target word in the different time snapshots of a corpus, and a best-performing word embeddings-based one (Mikolov et al. 2013), which learns word vectors for different time periods, warps the vector spaces into a unique coordinate system and creates a distributional time series for every word to assess its semantic displacement across time. With respect to Wijaya and Yeniterzi (2011), they also propose an algorithm for detecting the exact semantic change point in the time series built for each word with each of the three methods presented above. Their approach is also shown to be scalable and applicable to spotting shifts in different time spans, namely a century of written books with Google Book Ngrams, in years of Twitter blogging and in a decade of Amazon movie reviews. Diachronic distributional semantics is instead employed by Xu and Kemp (2015) to corroborate the parallel change law with respect to the differentiation one as for the semantic behavior of synonyms in time. Synonymic pairs like impending and imminent therefore tend to semantically evolve in parallel rathen than going dif- ferent routes, maybe by virtue of analogical forces that aim at maintaining relationship patterns between words. Another application of the distributional approach to a diachronic corpus is carried out by Perek (2016), who investigates the productivity of the “V the hell out of NP” construction from 1930 to 2009. The vectors of the verbs occurring in this construction are analyzed with multidimensional scaling and clustering to pinpoint the preferred semantic domains of the construction in its diachronic evolution, while a mixed effects logistic regression analysis shows the density of the semantic space of the construction around a given word in a certain period to be predictive of that word joining the construction in the subsequent period. Hamilton, Leskovec, and Jurafsky (2016b) evaluate the performance of different kinds of word embeddings (PPMI, SVD, word2vec) in detecting attested historical semantic shifts (e.g. broadcast from ‘scatter’ to ‘transmit’) on cross-linguistic data by measuring changes in pair-wise similarities and the semantic displacement of a given lemma across time and run a series of regression analyses that reveal two general statistical laws of semantic change, namely that frequent words evolve at a slower rate and polysemous ones mutate faster. In a second study (Hamilton, Leskovec, and Jurafsky 2016a), they make use of both a global and a local neighborhood measure of semantic change to disentangle shifts due to cultural changes from purely linguistic ones. While the first index, which measures the cosine similarity between the vectors of the same word in consecutive decades, fares better in spotting purely linguistic changes for verbs, the second one, which keeps track of the changes in the nearest neighbors of

13 Italian Journal of Computational Linguistics Volume 3, Number 1 a word over time, performs better in detecting culturally motivated changes on (e.g. virus from ‘infectious disease’ to ‘unauthorized and harmful computer program’).

3. Materials and Methods

3.1 The Corpus

The corpus used for this study is based on the TLG-E (Thesaurus Linguae Graecae) collection of Ancient Greek literary texts. This corpus does not include inscriptions or private letters and/or non-literary papyri, but it does include several fragmentary texts in both poetry and prose genres. Texts were divided into two sub-corpora, the former spanning from the 7th to the 1st century BC (pre-Christian era), while the latter spans from the 1st to the 5th century AD (early Christian era). The pre-Christian sub-corpus contains 6,795,253 tokens, while the Christian sub-corpus totalizes 29,051,269 tokens.

Table 1 Percentage distribution of the main textual genres in the BC era and the AD era subcorpora (please keep in mind that a given text may belong to more than one genre at once). Genre BC era AD era Epic poetry 2.3% 0.3% Historiography 13.79% 15.43% Iambus and lyric 13% 6.24% Tragedy 6.7% 2.6% Comedy 12.88% 0% Philosophy 14.86% 47.87% Astronomy 2.54% 7.10% Medicine 3.84% 19.17% Mathematics 5.71% 2.09%

As Table 1 clearly shows, the two subsections are rather heterogeneous as regards the distribution of the main textual genres that compose them. When inspecting per- centage values, please keep in mind that a given text may partake of more than one genre at once. As we can see, while the percentage of poetical texts (epic, iambic and lyric poetry) and theatrical texts (tragedy and comedy) diminishes from the BC to the AD era, the AD centuries are characterized by a greater diffusion of philosophical and technical (e.g. astronomical and medical) writings, with the exception of mathematical writings, that decrease from 5.71% to 2.09%. The percentage of historiographical works, finally, does not appear to vary considerably across the centuries. Texts were lemmatized using Morpheus (Crane 1991). This parser is estimated to reach approximately 80% accuracy in lemmatizing Ancient Greek (Boschetti 2009, p. 60). Minor issues with the lemmatization are therefore to be expected, and will be mentioned and discussed in the Results section. Generally speaking, they seem to fall into two categories. The most basic issue arises when some inflected forms of a lemma are erroneously lemmatized separately (examples are visible in Table 5, where the comparative and superlative of the adjective ταχύς “takhýs; swift”, e.g., appear as distinct lemmas); this kind of mis-lemmatization, however, should not have a significant impact on the semantic analysis, since said redundant lemmas will effectively have the

14 Rodda et al. Tracking Semantic Change in Ancient Greek same meaning, and can be expected to behave in similar ways. Cases where forms of a word are lemmatized under an entirely unrelated lemma could, on the other hand, affect the results in a more significant way, but they appear to be very rare (the main example that can be detected in our data concerns forms of ψυχή “psykhé; soul” being erroneously lemmatized under ψῦχος “psykhos;ˆ cold”: see section 4.3 below).

3.2 Building the Distributional Vector Spaces

Distributional Semantic Models (Lenci 2008; Turney and Pantel 2010) implement the distributional hypothesis advanced by Harris (1954), whereby linguistic expressions that are similar in meaning tend to occur in similar contexts. In these models, target linguistic expressions are represented as vectors in a high-dimensionality space, while each dimension of the vectors records the co-occurrence statistics of the target elements with some contextual features, e.g. the content words occurring in a fixed contextual window on the left and on the right of the target. By virtue of their representation with distributional vectors, words are encoded as points in a semantic space (Sahlgren 2006), and geometric measures of vector similarity or distance, like cosine (Turney and Pantel 2010), are exploited to model their semantic similarity. Like previous applications of distributional semantics to Ancient Greek (Boschetti 2009), we built two vector spaces from the TLG corpus, one from the pre-Christian subsection (BC-Space henceforth) and one from the Christian subsection (AD-Space henceforth). After filtering out stop-words (mainly particles, and connectives) and lemmas occurring with a frequency below 100 tokens, the pre-Christian and Christian sub-corpus contain, respectively, 4,109 and 10,052 lemmas, which were used both as targets and dimensions in our vector spaces. A vector space model was then built for each sub-corpus using the DISSECT toolkit (Dinu, Pham, and Baroni 2013). Co- occurrences were computed within a window of 11 words (5 content words to the right and to the left of each target word). Association scores were weighted using positive point-wise mutual information (PPMI) (Turney and Pantel 2010), a statistical association measure that computes if two words x and y co-occur more often than expected by chance and sets to zero the negative results:

P (x, y) PPMI(x, y)=max(0, log ) (1) P (x)P (y)

The resulting matrices were reduced to 300 latent dimensions with Singular Value Decomposition (SVD) (Deerwester et al. 1990).

3.3 RSA of the Distributional Vector Spaces

We have adapted the RSA method to discover semantic changes between the two vector spaces:

1. we identified the words occurring in both sub-corpora with a frequency higher than 100 tokens, obtaining 3,977 lemmas; 2. we built a representational similarity matrix (RSM) from the BC-Space (RSMBC) and one from the AD-Space (RSMAD). Each RSM is a square matrix indexed horizontally and vertically by the 3,977 lemmas and containing in each cell the cosine similarity of a lemma with the other

15 Italian Journal of Computational Linguistics Volume 3, Number 1

lemmas in a vector space (this is a minor variation with respect to the original RSA method, which instead uses dissimilarity matrices). A RSM is a global representation of the semantic space geometry in a given period: vectors represent lemmas in terms of their position relative to the other lemmas in the semantic space; 3. for each lemma, we computed the Pearson correlation coefficient between its vector in RSMBC and the corresponding vector in RSMAD.

The Pearson coefficient measures the degree of semantic shift across the two tem- poral slices. The lower the correlation, the more a word changed its meaning.

4. Discussion of Results

The following section focuses on the words that underwent the biggest changes, i.e. those with the lowest correlation scores. The primary goal is to establish whether these words can be clustered into meaningful groups. This would allow us to pinpoint the areas within the lexicon of Ancient Greek that underwent a significant semantic shift during the earliest centuries of Christianity.

4.1 Qualitative Analysis

The 50 lemmas with the lowest correlation coefficients were scrutinized by hand, in order to establish whether meaningful subgroups emerge. (This list of words is not reproduced here due to space constraints. They are a subset of the 200 words used to build the plot in section 4.3) The findings in this section, while inevitably limited by the intuition of the researcher, will provide the starting point for a more sophisticated analysis to be performed in the following sections. The lemmas under consideration form a somewhat heterogeneous collection, including some and relatively common verbs such as ἕπομαι “hépomai; follow”, as well as some proper nouns. This notwithstanding, two promising subsets of words emerge even at this preliminary stage (see the examples in Table 2).

Table 2 Some examples of lemmas undergoing the most substantial semantic change Lemma BC era meaning AD era meaning CHRISTIAN TERMS παραβολή parabolé ‘comparison’ ‘parable’ λαός laós ‘people’ ‘the Christians’ κτίσις ktísis ‘founding’ ‘creation’ TECHNICAL TERMS ὑπόστασις hypóstasis ‘foundation’ ‘substance’ δύναμις dýnamis ‘power’ ‘property (of beings)’ ῥητός rhetós ‘stated’ ‘literal (vs. allegorical)’

The first group comprises several nouns designating eminently Christian concepts, such as παραβολή (“parabolé; parable”, previously “comparison”), λαός (“laós”; used for the Christian community as opposed to non-Christians, previously “people”), κτίσις

16 Rodda et al. Tracking Semantic Change in Ancient Greek

(“ktísis; creation”, previously “founding, settling”). These findings are in line with the idea that the diffusion of Christianity played a substantial role to drive semantic change in the first centuries AD (cf. Boschetti (2009)). Other Christian terms, such as θεός (“theós; God”), ἄγγελος (“ángelos; angel”, previously “messenger”), πατήρ (“patér; father”), υἱός (“hyiós; son”), also occur among the 100 words with the lowest correlation coefficients. The shift undergone by words such as τόκος (“tókos; childbirth”) is also likely to be connected to their occurrence in Christian contexts, even though it is hard to define this as a “meaning shift” stricto sensu. Such cases, and the theoretical issues they bring about, will be discussed separately in section 4.2, in light of the results of the nearest neighbor analysis. Another group of lemmas comprises technical terms whose usage seems to have undergone a specialization or a shift from one domain of knowledge to another. These include words such as ὑπόστασις (“hypóstasis; substance”, previously “sediment, foundation”), δύναμις (“dýnamis; property (of beings)”, previously “power”), or ῥητός (“rhetós; literal” as opposed to “allegorical”, previously “stated”). When the lemmas in this group refer to metaphysical concepts or exegetical terms, the influence of Christian thought may also be present. Within this category as well, one finds cases such as ἐνιαύσιος (“eniáusios; annual”), where the meaning of the word can hardly be assumed to have changed in the strictest sense, but its context of usage (as will be made clear by the nearest neighbor analysis in the next section) has shifted towards technical literature. Together, the most clear-cut examples of these two groups (including those for which a semantic shift will be recognizable thanks to the nearest neighbor analysis per- formed in the following section) account for about half of the 50 words that underwent the most substantial semantic change. There is, of course, a measure of subjectivity in judging which words shifted towards a Christian or technical meaning; the findings in this section, however, can be supported through a more refined analysis.

4.2 Analysis of Nearest Neighbors

Nearest neighbor analysis proves especially useful when it comes to detecting shifts in meaning that would not be predictable through simple observation. Thus, for instance, the neighbors for μοῖρα (“môira”, another highly polysemous lemma, with meanings spanning from “part” to “destiny”) in the AD-Space come exclusively from the domain of astronomy and geometry (see Table 4; note that διάμετρον “diámetron; daily ration” is likely to be a lemmatization error for διάμετρος “diámetros; diameter”), showing a strong specialization towards a technical usage (“degree” or “division” of the Zo- diac). Similarly, among the neighbors for the apparently anodyne noun ζυγόν (“zygón; yoke”) one finds the constellations Λέων (“Léon; Leo”), Σκορπίον (“Skorpíon; Scor- pius”), Παρθένος (“Parthénos; Virgo”), and Τοξότης (“Toxótes; Sagittarius”), revealing a shift in usage towards the astronomical sense, where Ζυγόν is the name of the constel- lation and Zodiac sign “Libra”. This word, however, is the only name of a constellation that appears among the last 50 lemmas according to the correlation coefficient; in any case, the presence of words such as ὑποτάσσω (“hypotásso; to set, to submit”), δούλειος (“dóuleios; slavish”), and φορτίον (“fortíon; load”) among the nearest neighbors in the AD-space shows that the astronomical meaning did not become as predominant as in the case of μοῖρα. A similar surprising result comes from the geographical adjective, Ποντικός (“Pon- tikós; coming from Pontus”), whose nearest neighbors shift from proper names and philosophical terms in the pre-Christian age (an association due, without doubt, to the usage of “Ponticus” as an epithet for authors, e.g. Heraclides) to names of currency and

17 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 3 Examples of nearest neighbors in the BC- and AD-space πνεῦμα ‘breath’ ‘spirit’ → BC-space NNs AD-space NNs ἀήρ aér ‘air’ θεάομαι theáomai ‘to contemplate’ ὑγρός hygrós ‘moist’ ἀληθινός alethinós ‘true’ θερμός thermós ‘hot’ αἰών aión ‘aevum’ ψυχρός psykhrós ‘cold’ κτίσις ktísis ‘creation’ ὑγράζω hygrázo ‘to be wet’ υἱός hyiós ‘son’ θερμαίνω thermáino ‘to heat’ θεός theós ‘God’ πυκνός pyknós ‘compact’ πατήρ patér ‘God the Father’ ἀναπνοή anapnoé ‘breathing’ δοξάζω doxázo ‘magnify’ ψυχρόομαι psykhróomai ‘to be chilly’ οἰκονομία oikonomía ‘administration’ θερμότης thermótes ‘heat’ πληρόω pleróo ‘to fill’ δύναμις ‘power’ ‘property (of beings)’ → BC-space NNs AD-space NNs προάγω proágo ‘to lead forward’ ἐνέργεια enérgeia ‘activity’ πολιορκία poliorkía ‘siege’ μετέχω metékho ‘to partake of’ ἀθροίζω athróizo ‘to gather’ ἐνεργέω energéo ‘to be in action’ στρατόπεδον stratópedon ‘encampment’ κινητικός kinetikós ‘related to motion’ στρατιώτης stratiótes ‘soldier’ φύς phýs ‘son’ παράταξις parátaxis ‘line of battle’ οὐσία ousía ‘substance’ ἀναζεύγνυμι anazéugnymi ‘to yoke’ ἰδιότης idiótes ‘specific property’ καταπλήσσω kataplésso ‘to strike down’ φύσις phýsis ‘nature’ Καρχηδόνιος Karkhedónios ‘Carthaginian’ ποιότης poiótes ‘quality’ ἀναλαμβάνω analambáno ‘to take up’ δισσός dissós ‘twofold’

trade wares, probably as a reflection of the integration of Pontus as a Roman province (with the obvious repercussions on trade) in the 1st century AD. This is not, strictly speaking, a shift in meaning, but in real-word reference and usage; as such, it is parallel to cases such as θεός, where the most relevant change is in the cultural context. Specialization towards a narrower usage is not, however, the only possible route of semantic change for technical terms: some of these appear to have moved from one domain to another. The case of πνεῦμα, whose semantic domain shifts from physics to metaphysics and philosophy (see Table 2 above), has already been discussed. Another example is σύμπτωμα (“sýmptoma” with the generic meaning of “chance occurrence”), whose top three neighbors in the BC-space are λογισμός (“logismós; calculation, reason- ing”), θεωρέω (“theoréo; to contemplate”), and προερέω (“proeréo; to predict”); in the AD-space, in their place we find πυρετέω (“pyretéo; to be feverish”), νόσημα (“nósema; disease”), and πυρετός (“pyretós; fever”), revealing a shift from the philosophical to the medical domain (i.e. from “property” to “symptom”). Another example, this time spanning the technical and Christian domains, is παραβολή (“parabolé; parabola, para- ble”, among other possible meanings), whose neighbors in the BC-space mostly have to do with geometry, while in the AD-space they pertain to the domain of biblical and literary exegesis. The nearest neighbors of ῥητός, one of the lemmas that had already

18 Rodda et al. Tracking Semantic Change in Ancient Greek

Table 4 Examples of nearest neighbors for astronomical terms μοῖρα ‘part, portion’ ‘degree, division (of the Zodiac)’ → BC-space NNs AD-space NNs ἕπομαι hépomai ‘to follow’ ἔγγιστος éngistos ‘nearest, next’ δύω dýo ‘to plunge in, to enter’ ζῳδιακός zodiakós ‘Zodiac’ μένος ménos ‘might, spirit’ ἰσημερινός isemerinós ‘equinoctial’ κέω kéo ‘to lie down, to rest’ πάροδος párodos ‘passage, entrance’ γαῖα gâia ‘earth’ διάμετρον diámetron [‘diameter’*] ἀστήρ astér ‘star’ τμῆμα tmêma ‘section, sector’ ἦμαρ êmar ‘day’ Κριός Kriós ‘Aries’ τόσος tósos ‘so much (as)’ μεσουρανέω mesouranéo ‘to culminate’ λείπω léipo ‘to leave’ κέντρον kéntron ‘center’ αὐτίκα autíka ‘at once’ μεσημβρινός mesembrinós ‘of noon, southern’ * see in-text discussion. ζυγόν ‘yoke’ ‘Libra’ → BC-space NNs AD-space NNs κέω kéo ‘to lie down, to rest’ ὑποτάσσω hypotásso ‘to set; to submit’ ὦμος ômos ‘shoulder’ δούλειος dóuleios ‘slavish’ ἕπομαι hépomai ‘to follow’ Λέων Léon ‘Leo’ πούς póus ‘foot’ κυριεύω kyriéuo ‘to be lord’ μέση mése ‘middle string’ φορτίον fortíon ‘load’ μέσος mésos ‘middle’ δουλεύω douléuo ‘to be slave’ δόρυ dóry ‘shaft, spear’ Σκορπίον Skorpíon ‘Scorpius’ μοῖρα môira ‘part, portion’ Παρθένος Parthénos ‘Virgo’ λαιός laiós ‘left’ Τοξότης Toxótes ‘Sagittarius’ γόνυ góny ‘knee’ ἐλεύθερος eléutheros ‘free’

been singled out as promising examples of a shift towards a technical meaning through qualitative analysis, show a similar evolution from the mathematical to the exegetical domain. There are also sporadic cases where the shift in meaning seems to be from a more technical usage in the BC-space to a more generalized meaning in the AD-space. A representative example is the δίειμι (“díeimi; to go through”). Its nearest neigh- bors in the BC-space all come from the domain of physics, and are indeed strongly specialized towards indicating properties of matter (see Table 5; some more minor issues with lemmatization make an appearance here, with the same adjective being categorized as two different lemmas, but since these lemmas seem to behave in a similar fashion, the impact on the results can be supposed to be minimal). In the AD-space, the physical domain seems to have disappeared entirely, with the synonym διέρχομαι (“diérkhomai; to go through”) now taking pride of place among the nearest neighbors. Of course, it is also possible that the appearance of this kind of pattern for a limited number of lemmas might be due to the different size of the two sub-corpora. Finally, like in the qualitative analysis, we find examples of lemmas where the shift seems to have to do with a different context of usage rather than thorough meaning

19 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 5 Nearest neighbors for δίειμι δίειμι ‘to go through’ BC-space NNs AD-space NNs πυκνός pyknós ‘compact’ διέρχομαι diérkhomai ‘to go through’ λεπτόν leptón ‘thin’* θάσσων thásson ‘swifter’* λεπτός leptós ‘thin’* διεξέρχομαι diexérkhomai ‘to pass through’ ξηρά xerá ‘dry’* ἀξιόλογος axiólogos ‘remarkable’ παχύς pakhýs ‘thick’ ὁπόσος hopósos ‘as much (as)’ ψυχρός psykhrós ‘cold’ τάχιστος tákhistos ‘very swift’* ξηρός xerós ‘dry’* χωρίον khoríon ‘place’ ὑγρός hygrós ‘moist’ διέξειμι diéxeimi ‘to pass through’ μανός manós ‘sparse’ ἀποχωρέω apokhoréo ‘to go away’ ὑγρότης hygrótes ‘moisture’ πλεῖστος pléistos ‘(the) most, greatest, largest’ * see in-text discussion.

change. Perhaps the most clear-cut case is the locative αὐτόθεν (“autóthen; from this very spot, immediately”), whose nearest neighbors in the BC-space are en- tirely generic (including words such as ἄγνυμι “ágnymi; to break” and ναῦς “náus; ship”), while in the AD-space they seem to pertain mostly to the domain of logical and mathematical reasoning (with words such as ὑπόθεσις “hypóthesis; hypothesis”, ἀκόλουθος “akólouthos; following, consequent”, and ἀποδείκνυμι “apodéiknymi; to prove, to demonstrate”). In this case, just as for τόκος in section 4.1, it is hard to posit a “meaning shift” of any sort, but we can envisage a technical context of usage becoming predominant. Cases in which the change in context does not seem to straightforwardly translate to a shift in meaning, draw attention to one of the subtlest implications of the results presented here. Given the small dimensions of the corpus, it is sometimes difficult to rule out an influence of the genre of the texts analyzed on the distribution of results — for instance, the impact of technical usage on the meaning of many of the terms that underwent the most significant semantic change might be connected to the presence of a higher number of philosophical and technical treatises in the AD-space. As we showed in Section 3.1, the percentage of works classified as “philosophical” in the TLG catego- rization system does indeed rise steeply in the AD-corpus (47.87%, as opposed to 14.86% in the BC-corpus), but the increase is less noticeable for other technical genres (e.g. astronomical writings, 2.54% to 7.10%, and medical writings, 3.84% to 19.17%), while the percentage of mathematical writings is actually lower in the AD-corpus (2.09%) than in the BC-corpus (5.71%). Note that, since the same work can be categorized as belonging to more than one genre in the TLG, percentages for different genres need to be kept apart. Further research should undoubtedly highlight the effect of corpus composition; a focus on shorter periods of time might be of interest for future studies, since, for instance, the rise of technical prose writing is widely recognized as being a characteristic of the Hellenistic Age (cf. e.g. Gutzwiller (2007, p. 154-167). Note that, for the aims of this study, texts from this period are included in the BC-space, not the AD- space). A documented change in the proportion of different possible usages of a word, however, is in itself a very informative result, especially in a field such as Classics, where

20 Rodda et al. Tracking Semantic Change in Ancient Greek the analysis of (literary) texts is paramount. Indeed, the shift towards Christian usage for several terms can in itself be described as the introduction of an entirely new genre of Christian writings, but this would sidestep the issue that there has been a noticeable change in the usage of these words (and, by definition, their meaning, according to the Distributional Hypothesis).

4.3 t-SNE Plot

As a final analysis, we embedded the RSMAD vectors for the 200 words with the lowest correlation coefficient with the corresponding RSMBC vectors in a two-dimensional space using t-SNE (Figure 1), a technique for dimensionality reduction and data visu- alization that overcomes some of the limitations of standard multidimensional scaling (Van der Maaten and Hinton 2008). This procedure allows for easy identification of clusters, thus revealing the semantic relation between the most recent meanings of the words that underwent the greatest semantic change. While the analysis in the previous sections was aimed at detecting patterns of semantic shift between the BC-space and AD-space, the purpose of the t-SNE plot is to investigate whether there is any significant relationship between the meanings of the words that underwent such a shift; because of this difference in purpose, the information contained in the plot is limited to one semantic space. For the same reasons, the potential issues about the composition of the corpus and the impact of genre, as sketched at the end of section 4.2 above, are not relevant for the discussion here. A number of small clusters can be observed in the plot. Near the left periphery, the most relevant group (in blue) is composed of terms pertaining to (Christian) theology, from κύριος (“kýrios; Lord”), λαός and θεός, to παρουσία (“parousía; Advent”), ποιμήν (“poimén; shepherd”), τέρας (“téras; sign, portent”), and οὐρανός (“ouranós; heaven”). The position of ψῦχος (“psykhos;ˆ cold”) near this cluster is due to the mis-lemmatization of some inflected forms of ψυχή (“psykhé; soul”) under this lemma, as revealed by nearest neighbor analysis (see section 3.1 above). To the left of this group, a small cluster of terms (in light blue) pertaining to Christian exegesis (ῥητός, παραβολή, διασαφέω “diasaphéo; to illustrate”) can be recognized. At the far right of the plot, diametrically opposed to the previous clusters, another small group of Christian terms can be recog- nized; this includes πατήρ, ὑιός, πνεῦμα, and potentially καρδία (“kardía; heart”) and σάρξ (“sárx; flesh”). The upper portion of the plot (in green) houses technical terms from the domains of medicine (the upper-most group, spanning the personal name ῾Ιπποκράτης “Hip- pokrátes; Hippocrates”, the nouns διάθεσις “diáthesis; condition” and σύμπτωμα, the verb καταπλάσσω “kataplásso; to apply a plaster/poultice”, and the adjective πρόσφατος “prósphatos; fresh”), astronomy and geometry (difficult to distinguish, from μοῖρα and πάροδος “párodos; passage” to ἄκρος “ákros; top-most” and δισσός “dissós; two-fold”). Philosophical terminology (in red) can be found in the lower right area (δύναμις, ὑπόστασις, etc.), while a separate cluster of terms pertaining to moral philosophy (ἐπιτήδειος “epitédeios; suitable”, ἱκανός “hikanós; sufficient”, ἐπιμελής “epimelés; care- ful”, all clustering around the crucial term ἄλυπος “álypos; without pain, painless”) is visible nearer to the center of the plot (in brown). Some smaller groups are also noticeable, such as μνᾶ (“mnâ; mina”) and δραχμή (“drakhmé; drachma”), both units of currency, on the left (in orange), and πρώτιστος (“prótistos; the very first”) and Τίμαιος (the proper name Tímaios, Latin Timaeus), both connected to (Neo-)Platonic philosophy, on the right (in red). All in all, despite the inevitable amount of noise, the plot in Figure 1 supports the findings detailed so far. We can see how the main

21 Italian Journal of Computational Linguistics Volume 3, Number 1

καταπλάσσω πρόσφατος χρηστέον ἁρµόζω

σύµπτωδιάθεσιςµα Ἱπποκράτη

100 Ἱπποκράτης µοῖρα πάροδος

ἐπιβολεύς ἕποµαι µῦς ἐπιβολή δωδέκατος ἤτοι γόνιµος ὁλοσχερήςἐνιαύσιος ἄκρος δισσόςἔµπαλιν πρώτιστος σύνεγγυς προερέωχωρίς Τίµαιος συνάγωπροστίθηµι µνᾶ ἐκφαίνω ἀποκαθίστηµι φωνήεις λήγω δραχµή θεός κτίσις πρόειµι προσλαµβάνω 50 τουτέστι θεάοµαιαἰών θέωψῦχος Ἀρίσταρχος παράγω τρίς τελειόω δυάω χρῄζω σωτήρ οὐδέτερος ἔσχατοςἄνω πρόδηλος χρέος κύριος ἐξαιρέω ἐνδείκνυµι πρόσκειµαικαίνωἐγκλίνω παρουσία ἐξαίρω πλεονάκιςἐγχωρέω πρόθεσις πληρόω Ἀπίς συντελέωὑγιής ὁµός ὑπεναντίος δήπου λαόςἰδού τύπον προσαρµόζω ἑτοιµάζω ὧδε τύχος ἐξαποστέλλω καταγράφω ὑπόλοιπος ἄγγελος ποιµτέραςήν διπλάζω σύµπας ἀξιόλογος παραβολή ἀπόστασις πνεῦµα πατήρ Dim 2 Dim ῥητός θρόνονἀνέρχοµαι ὕψος παµπολύςἐπικρατέω υἱός διασαφέω οὐρανός ἐγγράφω θλίβωκαρδίασάρξ 0 Στέφανος ἀποτέµνω περιφέρω ὀνοµάζω σκότος λαµπρόςἀνώτερος δίειµι ἀνώνυµος φωνέωπρόσωπον ἡλίκοςπρόσωθενὑποστρέφω πολλαχῆ ἐντολή ἄλλοθι ἐντεῦθεν ἠρεὑποχωρέωµέω ὁπόσος ἅτεροςπροσεῖπον ἔστε καταναλίσκωἀπωθέω κυρέωφοιτάω ταπεινόωχήραταλαιπωρία τοσόσδε Διογένης δύναµις ῥύοἀποστρέφωµαι ἐπιµένω καταπίπτω εὐδία ὑπόστασις φυσάω ἐπίσχωὑπέρχοµαι προυπάρχωὑφίστηµι ἐπιτήδειος ἐπικίνδυνος σκώληξ ἄλυπος Ἄρατος θιγγάνω ἱκανός κυριεύω ἐπιβάλλω προγίγνοµαι ὕπαιθρος µεταφορά ἁδρόςεὔθετος -50 εἰσβαίνω ζυγόν ἡµίονοςµύρµηξ Ἀσκληπιάδαιἐπιµελής παρέξ Κρόνος ἀποτίθηἀναλαµµιβάνω ἐπιτίθηµι ἀνθέω ὄρθιος Ἰώ γονή ἄρδω µόγις ἄκαρποςβλαστάνω ὀλισθάνω ἕωθεν ἐξαπίνηςφρίσσω Σάτυρος ὄψιος πονέοµαι Χῖος πάµπαν Ποντικός κάµατος ῥόδονἰός τόκος -100 ἀνά

-100 -50 0 50 100

Dim 1 Figure 1 Relative positions within the AD-Space of the 200 words with the lowest correlation scores. Dimensionality reduction was performed using t-SNE.

semantic changes in the Greek lexicon between the pre-Christian and Christian era affected the domains of religion (in a broader sense) and/or technical language. Within these domains, some more fine-grained relations between words that went through a significant semantic shift can be observed.

5. Conclusions and future work

This paper shows how Distributional Semantics can be used as an exploratory tool to detect semantic change. In this case study on Ancient Greek, the proposed method based on distributional RSA not only confirms the hypothesis that the diffusion of Christianity was a crucial cause of semantic change in the Greek lexicon, but also allows for the identification of unexpected patterns of evolution, such as the specialization in the usage of technical terms. From a methodological standpoint, the fact that the results obtained from such a small corpus of purely literary texts are both meaningful and informative is of great relevance. The nearest neighbor analysis performed in section 4.2 brought to light several patterns of change, which proved informative both as concerns the evolution of some semantic domains between the BC- and AD-space, and

22 Rodda et al. Tracking Semantic Change in Ancient Greek the potential effects of the composition of the corpus (in itself a potentially interesting source of information for Ancient Greek). The t-SNE plot, by showing how the words that underwent the most relevant meaning shifts tend to form semantically-motivated clusters, provided a further opportunity to detect areas of the lexicon that underwent significant semantic change. As far as broader methodological issues are concerned, the choice to adopt a data- driven approach proved fruitful, in that it brought to light directions of change that were not expected a priori. For traditional research in Classics, a computational approach to the lexicon of Ancient Greek is compelling because it provides new information about a language for which the judgments of native speakers are unavailable (cf. Perek (2016)). The results of this study show how Distributional Semantics can complement the findings of the philologist, as well as help discover patterns of lexical change that would otherwise be impossible to grasp beyond an intuitive level. Nonetheless, a few issues remain open and could benefit from a more fine-grained investigation in future studies. First and foremost, it could be interesting to observe which parts of speech tend to change first, e.g. whether nouns or verbs (Dubossarsky, Weinshall, and Grossman 2016), and whether specific genres are more prone to change than others. Secondly, a targeted study of a more restricted period right after or right before the advent of Christianity (rather than the twelve-century time span considered here) could help confirm that the shifts we detected were primarily due to the spread of Christianity itself, which would have then represented a major breaking point, and rule out the possibility that a more natural and broad-spectrum change was already taking place.

References Boschetti, Federico. 2009. A Corpus-based Approach to Philological Issues. Ph.D. thesis, University of Trento. Crane, Gregory. 1991. Generating and parsing classical greek. Literary and Linguistic Computing, 6(4):243–245. Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391. Dinu, Georgiana, Nghia The Pham, and Marco Baroni. 2013. DISSECT — DIStributional SEmantics Composition Toolkit. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 31–36, Sofia,Bulgaria, August, 4-9. Dubossarsky, Haim, Daphna Weinshall, and Eitan Grossman. 2016. Verbs change more than nouns: A bottom-up computational approach to semantic change. Lingue e linguaggio, 15(1):7–28. Edelman, Shimon. 1998. Representation is representation of similarities. Behavioral and Brain Sciences, 21:449–467. Gulordava, Kristina and Marco Baroni. 2011. A distributional similarity approach to the detection of semantic change in the google books ngram corpus. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pages 67–71, Edinburgh, Scotland, July 31. Gutzwiller, Kathryn J. 2007. A Guide to Hellenistic Literature. Blackwell Publishing. Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016a. Cultural shift or linguistic drift? Comparing two computational measures of semantic change. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2116–2122, Austin, Texas, USA, November 1-5. Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016b. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1489–1501, Berlin, Germany, August 7-12. Harris, Zellig S. 1954. Distributional structure. Word, 10(2-3):146–162. Kriegeskorte, Nikolaus and Roger A. Kievit. 2013. Representational geometry: Integrating cognition, computation, and the brain. Trends in Cognitive Sciences, 17(8):401–412.

23 Italian Journal of Computational Linguistics Volume 3, Number 1

Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically significant detection of linguistic change. In Proceedings of the 24th International World Wide Web Conference, pages 625–635, Florence, Italy, May 18-22. Lenci, Alessandro. 2008. Distributional semantics in linguistic and cognitive research. Italian Journal of Linguistics, 20(1):1–31. Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, et al. 2011. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26tth International Conference on Neural Information Processing System, pages 3111–3119, Lake Tahoe, Nevada, USA, December 5-10. O‘Donnell, Matthew Brook. 2005. Corpus Linguistics and the Greek of the New Testament. Number 6. Sheffield Phoenix Press. Perek, Florent. 2016. Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics, 54(1):149–188. Sagi, Eyal, Stefan Kaufmann, and Brady Clark. 2011. Tracing semantic change with latent semantic analysis. In Kathryin Allan and Justyna A. Robinson, editors, Current Methods in Historical Semantics. Mouton de Gruyter, pages 161–183. Sahlgren, Magnus. 2006. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis. Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37:141–188. Van der Maaten, Laurens and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605. Wang, Xuerui and Andrew McCallum. 2006. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 424–433, Philadelphia, Pennsylvania, USA, August 20-23. Wijaya, Derry Tanti and Reyyan Yeniterzi. 2011. Understanding semantic change of words over centuries. In Proceedings of the 2011 International Workshop on DETecting and Exploiting Cultural diversiTy on the Social Web, pages 35–40, Glasgow, United Kingdom, October 24-28. Xu, Yang and Charles Kemp. 2015. A computational evaluation of two laws of semantic change. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, pages 2703–2708, Pasadena, California, July 22-25.

24 Distributed Representations of Lexical Sets and Prototypes in Causal Alternation Verbs

Edoardo Maria Ponti∗ Elisabetta Jezek∗∗ Universitá di Cambridge Università degli Studi di Pavia

Bernardo Magnini† Fondazione Bruno Kessler

Lexical sets contain the words filling an argument slot of a verb, and are in part determined by selectional preferences. The purpose of this paper is to unravel the properties of lexical sets through distributional semantics. We investigate 1) whether lexical set behave as prototypical categories with a centre and a periphery; 2) whether they are polymorphic, i.e. composed by sub- categories; 3) whether the distance between lexical sets of different arguments is explanatory of verb properties. In particular, our case study are lexical sets of causative-inchoative verbs in Italian. Having studied several vector models, we find that 1) based on spatial distance from the centroid, object fillers are scattered uniformly across the category, whereas intransitive subject fillers lie on its edge; 2) a correlation exists between the amount of verb senses and that of clusters discovered automatically, especially for intransitive subjects; 3) the distance between the centroids of object and intransitive subject is correlated with other properties of verbs, such as their cross-lingual tendency to appear in the intransitive pattern rather than transitive one. This paper is noncommittal with respect to the hypothesis that this connection is underpinned by a semantic reason, namely the spontaneity of the event denoted by the verb.

1. Introduction

The arguments of a verb are the “slots” that have to be filled to satisfy its valency (subject, object, etc.). Hence, verbs display so-called selectional restrictions over the pos- sible fillers occupying these slots (Séaghdha and Korhonen 2014), which play a major role in determining the verb meaning. Moreover, the selection of the fillers happens in accordance with the different senses of a verb. Lists of the possible fillers occurring with a certain pattern of a verb can be collected in a corpus-driven fashion: these lists are named “lexical sets” (Hanks 1996). Several approaches in Computational Linguistics managed to inspect lexical sets and their patterns of variation (Montemagni, Ruimy, and Pirrelli 1995; McKoon and Macfarland 2000), as well as to use them as features for verb classification (McCarthy 2000; Joanis, Stevenson, and James 2008). On the other hand, selectional preferences were employed for many tasks in Natural Language Processing, including Word Sense Disambiguation (Resnik 1997; McCarthy and Carroll 2003, inter alia), Metaphor Pro- cessing (Shutova, Teufel, and Korhonen 2013), Information Extraction (Pantel et al.

E-mail: [email protected] ∗ E-mail: [email protected] ∗∗ E-mail: [email protected]

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1

2007), Discourse Relation Classification (Ponti and Korhonen 2017), Dependency Pars- ing (Mirroshandel and Nasr 2016), and Semantic Role labeling (Gildea and Jurafsky 2002; Zapirain et al. 2013). We aim at establishing a new method of analysis for lexical sets. In particular, we address the following key questions: are lexical sets Aristotelian categories (yes-no membership) or prototypical categories (graded membership)? If the latter holds true, are the fillers arranged homogeneously or do they cluster around some sub-categories? Finally, is the relation between lexical sets from different patterns of a same verb infor- mative about this verb’s meaning? We address these questions under a distributional semantics perspective. In fact, distributional semantics provides a mapping between each filler and a vector lying in a continuous multi-dimensional space. By virtue of this, lexical sets can be treated as continuous categories, where members can be either central or peripheral. Moreover, this allows to quantify the distance between vectors with spatial measures. In order to test the relevance for linguistic theory of this approach, we focus on a case study, namely verbs undergoing the causative-inchoative alternation. Solving the above-mentioned issues may help clarifying some of the vexed questions about this class of verbs. These show both transitive and intransitive patterns: the object of the former and the subject of the latter play the same semantic role of patient. Based on the cross-lingual frequency of each of these patterns and the direction of morphological derivation, it is possible to sort these verbs onto a scale (Haspelmath 1993; Samardzic and Merlo 2012; Haspelmath et al. 2014), which can be possibly interpreted semantically as the “spontaneity” of the corresponding event (see § 2.2). We investigate the existence of any asymmetry between the lexical sets of the transitive object and the intransitive subject, and if so the connection of this asymmetry with the spontaneity scale. The structure of the paper is as follows. In § 2, we define the core notions of this study, including lexical sets, causative-inchoative verbs, and distributional semantics. § 3 presents the method and the data, whereas § 4 reports the results of the experiments. Finally, § 5 draws the conclusions and § 6 proposes possible future lines of research.

2. Definitions and Previous Work

In this section, we describe in detail the notions underlying the subject matter of the research (lexical sets, § 2.1), the case study (causative-inchoative verbs, § 2.2), and the method (distributional semantics, § 2.3). At the same time, we present the previous work concerning each of these aspects.

2.1 Lexical Sets

A lexical set can be defined as the set of words that occupy a specific argument position within a single verb sense, such as {gun, bullet, shot, projectile, rifle...} for the sense ‘to shoot’ of to fire, or {employer, teacher, attorney, manager...} for its sense ‘to dismiss’. The notion of lexical set was firstly introduced by Hanks (1996). Its purpose is explaining how the semantics of verbs is affected by the patterns of complements they are found with. Hanks’ approach is justified by the pervasiveness of patterns in corpora: these patterns are instantiated by specific lexical items typically occurring in the argument po- sitions. These items, called fillers, form sets belonging to different patterns of meaning. Hanks and Pustejovsky (2005) and Hanks and Jezek (2008) propose an ontology where fillers are clustered into semantic types, i.e. categories such as [[Location]], [[Event]], [[Vehicle]], [[Emotion]]. These form a hierarchy that branches into more specific types.

26 Ponti et al. Distributed Representations of Lexical Sets and Prototypes

However, these categories are problematic, as lexical sets tend to “shimmer” (Jezek and Hanks 2010): their membership tends to change according to the verb they associate with. The shimmering nature of lexical sets is not an accidental phenomenon. Rather, it stems from the fact that verb selectional restrictions may cut across conceptual cat- egories due to the specific predication introduced by the verb. For example, both wash and amputate typically select [[Body Part]] as their direct object. Nevertheless, they select different prototypical members of the set, as can be seen in the examples below from Jezek and Hanks (2010) where only shared members are underlined:

(1) [[Human]] wash [[Body Part]] where [[Body Part]]: {hand, hair, face, foot, mouth...}

(2) [[Human]] amputate [[Body Part]] where [[Body Part]]: {leg, limb, arm, hand, finger...}

Lexical sets can be extracted from corpora automatically. This operation hinges upon traditional techniques for the acquisition of subcategorisation frames and selectional restrictions. The former allow to capture the syntactic pattern (Brent 1991; Schulte Im Walde 2000) or semantic frame pattern (Baker, Fillmore, and Lowe 1998) in which each verb is found. The latter create probabilistic models of preferences over fillers that are evaluated intrinsically through human judgments (Brockmann and Lapata 2003) and extrinsically through disambiguation tasks (Van de Cruys 2014).

2.2 Causative-Inchoative Verbs

In principle, lexical sets can be constructed for every verb. In this work, however, we limit our inquiry to a specific subset of verbs, namely causative-inchoative verbs in Italian. This provides a testbed for our methods of analysis, which can be easily extended to other classes of verbs and alternations. The choice of this specific subset is due to the fact that understanding the internal structure of lexical sets and their relations seems to be crucial to solve the problems surrounding this class of verbs, including asymmetries between transitive objects and intransitive subjects and their relation with the spontaneity scale (see below). Causative-inchoative verbs alternate. In other terms, they appear in two patterns, either as transitive or intransitive. In the former, an agent brings about a change of state affecting a patient; in the latter, the change of the same patient is presented as sponta- neous (e.g. break, as in “Mary broke the key” vs. “the key broke”). The verbs in the two patterns can be expressed by either a same form or two distinct forms cross-lingually. In the second case, the forms can be morphologically asymmetrical: one has a derivative affix and the other does not. Otherwise, the forms are suppletive, being completely unrelated (e.g. kill/die). The members of causative-inchoative verbs that retain a same form or are morphologically related in both patterns vary cross-lingually (Montemagni, Ruimy, and Pirrelli 1995). Alternations regarding physical change-of-state and manner- of-motion are found in English, whereas they are limited to psychological and physical changes-of-state in Italian. In Japanese and Salish languages, also verbs like arrive and appear do alternate (Alexiadou 2010). From a semantic point of view, Italian causative-inchoative verbs are required to be telic and have an inanimate patient (Cennamo 1995). Morpho-syntactically, they are generated from an asymmetrical derivation, called “anti-causativisation.” The intransi- tive form is sometimes marked with the pronominal clitic si: its presence is mandatory,

27 Italian Journal of Computational Linguistics Volume 3, Number 1 optional or forbidden according to verb-specific rules (Cennamo and Jezek 2011). Be- cause of this, many different categorisations of Italian causative-inchoative verbs were attempted (Folli 2002; Jezek 2003). Causative-inchoative verbs in general are endowed with peculiar properties. Haspelmath (1993) claims that verbs with a cross-lingual preference for a marked causative form denote a more “spontaneous” situation. Spontaneity is intended by the author as the likelihood that the occurrence of the event denoted by the verb does not require the intervention of an agent. In this way, a correlation between the form and the meaning of these verbs was borne out. Moreover, Samardzic and Merlo (2012) and Haspelmath et al. (2014) demonstrated that verbs appearing more frequently (intra- and cross-lingually) in the inchoative form tend to morphologically derive the causative form. Here, the correlation bridges between form and frequency. Vice versa, situations entailing an agentive participation prefer to mark the inchoative form and occur more frequently in the causative form. However, what remains uncertain is whether spontaneous and agentive variants of the same verbs differ in their lexical sets. Atkins and Levin (1995) and Levin and Hovav (1995) argued that selectional restrictions for spontaneous verbs (named in- ternal causation verbs) are stricter because their event unfolds due to some inherent property of the patient: referents without this capability are excluded, contrary to what happens with agentive verbs (named external causation verbs). This capability is defined “teleological” by (Copley and Wolff 2014). However, McKoon and Macfarland (2000) reported contradicting results from corpus-based analyses that did not find any significant difference in the content of lexical sets of spontaneous and agentive verbs (although from a sample of less than 100 sentences and only 5 categories). Instead, they reported a difference between transitive objects and intransitive objects in that the former contained a larger amount of abstract nouns compared to the latter.

2.3 Distributional Semantics

Once established the domain, we need to provide a reliable method of inquiry. Previous works based on set theory treated lexical sets as Aristotelian categories, of which a filler is either a member or not. For instance, Montemagni, Ruimy, and Pirrelli (1995) collected lexical sets manually and employed set intersection as a measure of similarity between two sets. Research in psychology, however, has long since demonstrated that the mem- bers of a linguistic set are found in a radial continuum where the most central element is the prototype for its category, and those at the periphery are less representative (Rosch 1973; Lakoff 1987).1 The full exploitation of the semantic information inherent to the argument fillers of verbs can take advantage of some recent developments in distributional semantics. Efficient algorithms have been devised to map each word of a vocabulary into a corre- sponding vector of n real numbers, which can be thought as a sequence of coordinates in a n-dimensional space (Mikolov et al. 2013a, inter alia). This mapping is yielded by unsupervised machine learning, based on the assumption that the meaning of a word can be inferred by its context, i.e. its neighbouring words in texts. This is known as Dis- tributional Hypothesis (Harris 1954; Firth 1957). Distributed models have some relevant properties: first, the geometric closeness of two vectors corresponds to the similarity

1 For previous work on lexical sets considering prototypicality in the context of the notion of shimmering, see Jezek and Hanks (2010).

28 Ponti et al. Distributed Representations of Lexical Sets and Prototypes in meaning of the corresponding words. Moreover, its dimensions possibly retain a semantic interpretation such that non-trivial analogies can be established among words. Word vectors allow to capture the spatial continuum implied by the notion of pro- totype. Previous works showed that word vectors can be clustered to imitate linguistic categories, because each cluster captures the ‘semantic landscape’ of a word (Hilpert and Perek 2015). The center of these clusters can be interpreted as the prototype of the corresponding category, and the proximity of the cluster members to the center as the degree of their prototypicality. In fact, the cluster members are not scattered randomly, but rather are arranged according to the internal structure of the cluster (Dubossarsky, Weinshall, and Grossman 2016). The prototypicality of the cluster members provides an explanation about linguistic phenomena, such as the resistance to the diachronic change of meaning (Geeraerts 1997; Dubossarsky et al. 2015). In this work, we extend the usage of the notion of prototypicality from meaning- based categories derived through vector quantization to grammatically defined cate- gories (i.e. lexical sets) derived through dependency parsing. Moreover, we go beyond the estimation of the distance of each word from the centroid of its category: in particu- lar, we propose new methods to assess the internal diversity in terms of sub-categories and the distance between lexical sets themselves.

Table 1 List of 20 causative-inchoative verbs and count of their fillers for each argument slot. Lemma Translation S O chiuder(si) to close 289 606 aprir(si) to open 195 1337 aumentare to improve 534 998 romper(si) to break 83 419 riempir(si) to fill 58 166 raccoglier(si) to gather 85 505 connetter(si) to connect 39 134 divider(si) to split 129 246 finire to stop 1092 721 uscire/portare fuori to go/put out 325 638 alzar(si) to arise/raise 75 304 scuoter(si) to rock 10 69 bruciare to burn 75 174 congelare to freeze 10 30 girare to spin 155 243 seccare to dry 15 14 svegliar(si) to awake/wake 68 89 scioglier(si) to melt 94 143 (far) bollire to boil 2 2 affondare to sink 18 74

29 Italian Journal of Computational Linguistics Volume 3, Number 1

3. Data and Method

We sourced the lexical sets from a sample of ItWac, a wide corpus gathered by crawling texts from the Italian domain in the web using medium frequency vocabulary as seeds (Baroni et al. 2009). This sample was further enriched with morpho-syntactic informa- tion through the graph-based MATE-tools parser (Bohnet 2010). We trained and evalu- ated this parser on the Italian treebank inside the collection of Universal Dependencies v1.3 (Nivre et al. 2016). The evaluation on gold standard data suggests how many errors we expect to affect the predictions on the new data, i.e. the ItWac corpus: these errors are then propagated to the following steps for the extraction of lexical sets. According to the LAS metric, the relevant dependency relations scored 0.751 for dobj (direct object), 0.719 for nsubj (subject), and 0.691 for nsubjpass (subject of a passive verb). A target group of 20 causative-inchoative verbs was taken from Haspelmath et al. (2014): they are listed in Table 1, together with the count of the extracted lexical sets for the relevant semantic macro-roles (see below). Once the sentences were parsed, the target verbs were identified inside the depen- dency trees. The lemmas of these verbs and the forms of their arguments were stored in a database. Argument fillers were grouped according to the semantic macro-roles defined by Dixon (1994), rather than their dependency relations: subjects of transitive verbs (A), subjects of intransitive verbs (S) and objects (O). In particular, the subjects of verb forms accompanied by the si-clitic and those without an object depending on the same verb were treated as S.2 These operations resulted in a database structured as follows: in each row, a verb is alongside of the fillers it occurs with in a specific sentence. For example, compare an original sentence and its corresponding entry in Example 3:

(3) Plinio il Vecchio non cita più il Po di Adria Pliny the Elder doesn’t mention anymore the Po of Adria perché l’ argine dell’ Adige si era rotto ed era because the bank of the Adige had broken and had confluito nella Filistina. merged with the Filistina.

Verb A S O citare Plinio _ Po rompere argine confluire _ _ _

Because of the nature of vector models, we made the following design choices to deal with special linguistic phenomena. We discarded everything but the head of multi-word nouns, such as Plinio of Plinio il Vecchio, to preserve the one-to-one mapping between words and vectors. We did not distinguish , such as Po and argine, since their representations lie in the same multi-dimensional space. Subjects in ellipsis or co-reference were neglected, since no pragmatic annotation of the sentences was available: for instance, Adige should appear as S in the entry for confluire, which is left blank instead. Finally, polysemous words and homonyms were

2 Subjects of verbs inflected in the passive voice were treated as O, instead.

30 Ponti et al. Distributed Representations of Lexical Sets and Prototypes represented by a single form, hence a single vector, since their representations conflate all the relevant meanings. For instance, Adige is both a river and a location, but these are not distinguished in the vector model. The database was later collapsed by verb lemma so that each of them became associated to three sets of fillers (one per macro-role). Each of these sets is a corpus- based lexical set. Compared to manually picked lexical sets, they are more noisy but less sparse: the vastness of the data mitigates the errors in the parsing step. Moreover, the automation in lexical set extraction allows to access the fillers of virtually every verb: resources based on manual selection like T-PAS (Jezek et al. 2014), on the other hand, are limited to a small amount of verbs. Afterwards, each of the argument fillers was mapped to a vector according to three different pre-trained models. The vectors are generated unsupervisedly back- propagating the gradient from the loss of a task to update randomly initialized em- beddings. Each model, however, relies on different tasks:

CBOW stands for Continuous Bag of Words and is part of the Word2Vec r suite (Mikolov et al. 2013b). The Italian model was developed by Dinu, Lazaridou, and Baroni (2015) through negative sampling. 300-dimensional representations were obtained by training a binary classifier that discrimi- nates whether a pair of a target word w and a context c belongs to an actual text or not. True contexts are obtained from a window of 5 words on either side of the target word, false contexts are drawn randomly from a noise distribution. Moreover, infrequent words are pruned. On the other hand, a preprocessing step called subsampling deletes words from windows whose frequency f is higher than an threshold with a probability p =1 t . The − f text from which word-context pairs are sampled is the full ItWac corpus, consisting in 1.6 billion tokens. The representations resulting from this algorithm are organised by topical similarity or relatedness. fastText embeddings were trained on the dump of the Italian Wikipedia r through a character-level skip-gram (Bojanowski et al. 2017). The skip-gram algorithm predicts the following element given a sequence of previous elements. The elements of Bojanowski’s version are characters: each word is represented by a bag of character n-grams and its representation consists in the sum of the representations of each of them. The vector dimensionality is 300; random sequences are drawn with a ratio of 5 to 1 compared to true sequences, with a probability proportional to the square root of the uni-gram frequency. The size of the context window is uniformly sampled from values between 1 and 5. The rejection threshold for subsampling is 4 10− . The usage of character n-grams makes this vector model sensitive to morphological similarity. Polyglot vectors result from a classifier distinguishing between sequences r taken from the dump of the Italian Wikipedia and corrupted sequences (Al-Rfou, Perozzi, and Skiena 2013). The window of phrases was 5, the dimensionality of vectors 64. Subsampling discarded words not appearing in the raking of the 100 thousand most frequent ones.

In order to measure spatial distances inside these vector models, many different metrics are available, including pure geometrical (Euclidean) distance. In this work, we

31 Italian Journal of Computational Linguistics Volume 3, Number 1 rely on the popular metrics of cosine distance (Cha 2007). Assume that a and b are th vectors, and that ai and bi are their i components, respectively. The cosine similarity between a and b is then defined as follows:

n a b cos(θ)= i=1 i i (1) n 2 n 2 i=1 ai i=1 bi   The opposite, namely cosine distance d, is simply defined as 1 cos(θ). As for the − values that d(a,b) can assume, the minimum is at 0 (angles completely overlap) and the maximum is at 1 (orthogonal vectors).

4. Experiments

In order to answer the questions outlined in the introduction, we devised three exper- iments. In all of them, we assume that the S and O lexical sets might not overlap, as borne out by Montemagni, Ruimy, and Pirrelli (1995) and McCarthy (2000). The first (§ 4.1) investigates the internal structure of lexical sets and how their members aggregate around a prototype. Based on psycholinguistic theories (see § 2.3), we expect vectors to lie on a continuum. As for causative-inchoative verbs in particular, objects were shown to be more homogeneous than subjects (McKoon and Macfarland 2000), hence they should form denser clusters. The second experiment (§ 4.2) deals with polymorphism of lexical sets, i.e. the amount of distinct sub-categories they contain. We expect the number of sub-clusters to be proportional to the number of verb senses. Finally, the third exper- iment (§ 4.3) is aimed at studying how lexical sets of different arguments of the same verb are related. In particular, we predict that the distance between intransitive subject and transitive object varies depending on the spontaneity of the causative- (see § 2.2). In fact, intransitive subjects and transitive objects of spontaneous verbs should be limited to referents with “teleological capability” (Atkins and Levin 1995; Levin and Hovav 1995), hence creating dense groups possibly located far from each other. Fillers of agentive verbs instead should show a higher degree of overlap because of their looser constraints.

4.1 Prototypicality: Distance from Centroid

Once the fillers have been mapped to their corresponding vectors, a lexical set appears as a group of points in a multi-dimensional model. The centre of this group is the Euclidean mean among the vectors, which is a vector itself and is called centroid. In the first experiment, we measure the cosine distance of every vector member of a lexical set from the centroid estimated from all the other members.3 This leave-one-out setting aims at avoiding biases due to outliers (such as phraseological usages or misspellings). In semantic terms, this measure should correspond to assessing how far a filler is from its prototype. We obtained two sets (S and O) of cosine distance values for each verb: these can be plotted as boxes and whiskers, like in Figure 1. The cosine distances are represented as a scatter plot: on its side, a box informs about the average (horizontal line), the median4

3 Every filler was weighted proportionally to its absolute frequency. 4 The median is the value separating the higher half of the ordered values from the lower half.

32 Ponti et al. Distributed Representations of Lexical Sets and Prototypes

chiudere aprire aumentare rompere 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 Cosine Distance Cosine Distance Cosine Distance Cosine Distance

0.25 0.25 0.25 0.25 Argument Slot: Argument Slot: Argument Slot: Argument Slot: object object object object subject subject subject subject 0.00 0.00 0.00 0.00 CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot

(a) chiudere (b) aprire (c) aumentare (d) rompere

riempire raccogliere connettere dividere 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 Cosine Distance Cosine Distance Cosine Distance Cosine Distance

0.25 0.25 0.25 0.25 Argument Slot: Argument Slot: Argument Slot: Argument Slot: object object object object subject subject subject subject 0.00 0.00 0.00 0.00 CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot

(e) riempire (f) raccogliere (g) connettere (h) dividere

finire uscire alzare scuotere 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 Cosine Distance Cosine Distance Cosine Distance Cosine Distance

0.25 0.25 0.25 0.25 Argument Slot: Argument Slot: Argument Slot: Argument Slot: object object object object subject subject subject subject 0.00 0.00 0.00 0.00 CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot

(i) finire (j) uscire (k) alzare (l) scuotere

bruciare congelare girare seccare 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 Cosine Distance Cosine Distance Cosine Distance Cosine Distance

0.25 0.25 0.25 0.25 Argument Slot: Argument Slot: Argument Slot: Argument Slot: object object object object subject subject subject subject 0.00 0.00 0.00 0.00 CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot

(m) bruciare (n) congelare (o) girare (p) seccare

svegliare sciogliere bollire affondare 1.00 1.00 1.00 1.00

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50 Cosine Distance Cosine Distance Cosine Distance Cosine Distance

0.25 0.25 0.25 0.25 Argument Slot: Argument Slot: Argument Slot: Argument Slot: object object object object subject subject subject subject 0.00 0.00 0.00 0.00 CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot CBOW fastText Polygot

(q) svegliare (r) sciogliere (s) bollire (t) affondare

Figure 1 Boxes and whiskers of vector distances from centroid. Lexical sets for objects are in red, for subjects in blue. The columns in each subplot represent different models: from left to right, CBOW, fastText, and Polyglot.

33 Italian Journal of Computational Linguistics Volume 3, Number 1

(diamond), the second and third quartiles (rectangle), and the extremes (bars) of the distribution. Firstly, we observe a same trend for both intransitive subjects and objects that depends on the algorithms. The whole bars tend to hover around higher values for CBOW, and lower values for Polyglot. fastText instead lies somewhere in between them, and its central quartiles cover a short span of values. Secondly, there is a systematic gap between the medians for S and O. Excluding bollire, for which data are insufficient, the object is lower 18 out of 19 times for CBOW. On the other hand, for Polyglot the preference is very mild (11 out of 18) and no clear pattern emerges for fastText.

4.2 Polymorphism: Sub-Clusters

The distance from the prototype examined in § 4.1 does not account for the complex in- ternal structure of a lexical set, which aggregates several sub-categories. Figure 2 shows a visual example of this sub-organisation by plotting the heatmaps of the density of vectors reduced to 2 dimensions through t-SNE (Maaten and Hinton 2008): it is possible to observe spots in isolation and aggregation. In order to assess the polymorphism of

20

20

20

10

0

0

0

−20 −10

−20 −20 −40

−25 0 25 50 −30 −20 −10 0 10 20 −20 0 20

(a) aumentare CBOW (b) aumentare fastText (c) aumentare Polyglot

75 30

20

20 50

10

10

25

0 0

0

−10 −10

−25 −20

−10 0 10 20 30 −20 −10 0 10 20 −20 −10 0 10 20

(d) finire CBOW (e) finire fastText (f) finire Polyglot

Figure 2 Heatmaps of the density (low is white and high is blue) of vectors reduced to 2 dimensions through t-SNE. They belong to the S lexical sets of aumentare (4 senses) and finire (12 senses).

34 Ponti et al. Distributed Representations of Lexical Sets and Prototypes lexical sets, i.e. the amount of different shades of meaning they contain, we cluster the vectors in each of them and contrast this measure with the number of the verb senses. Vector clustering is performed through the algorithm X-Means (Pelleg and Moore 2000), an extension of the simpler algorithm K-Means (MacQueen 1967). The latter performs vector quantisation: it assigns vectors in a space to a pre-specified amount of k clusters. The algorithm converges into a local optimum by initialising k arbitrary means inside the model. Then it carries on iteratively two steps: firstly it assigns each vector to these means minimising a measure of variance. Secondly, it calculates the new means based on the newly obtained clusters. Thus, for vectors v clusters C and centroids µ, its objective can be formalised as:

k

arg min v µi (2) C  −  i=1 v C ∈ i In addition to these steps, X-Means can perform a further operation, namely decid- ing if and where splitting clusters in two to create new clusters. Iterations start from k =2up to a pre-specified upper bound (in our case, k = 20): after a split, X-Means es- timates through a test whether the old cluster or the two new clusters fit the data better: in our case, this test consists in the Bayesian Information Criterion, which is defined as follows: given a split M, the maximised value of the log-likelihood Lˆ of the vectors V, the number of parameters k, and the number of data points n, BIC(M)=Lˆ(V ) k ln n. − 2 The number of clusters provided by X-Means upon convergence is displayed in Table 2. Moreover, we provide the number of senses of each verb according to WordNet for Italian (Artale, Magnini, and Strapparava 1997) in a separate column. At first glance, the results show that lexical sets tend to have a similar number of clusters across the al- gorithms, which is surprising considering the different natures of these representations. However, this might be due to a bias: in fact, X-Means is possibly inclined to make similar decisions for sets of identical cardinality and with a low upper bound. We estimated the Pearson’s correlation between the number of clusters and verb senses, which is reported in the bottom column: values of its coefficient mirror the strength of the correlations, ranging from -1 (negative), across 0 (absent), to 1 (positive). The p-value instead stands for the confidence by which we can exclude the null hy- pothesis (absence of correlation). Results for CBOW and fastText reveal a mild positive correlation. This is especially evident for subjects, which appear to be a better cue for predicting the number of verb senses. On the other hand, results are inconclusive for Polyglot, as the p-value is not significant enough.

4.3 Spontaneity: Distance between Centroids

In §§ 4.1–4.2, lexical sets of the same verb have been considered independently from each other. We now assess whether any relation holds between them by gauging the cosine distance between the centroid of S and the centroid of O for each verb. This operation is aimed at finding to which extent these two lexical sets overlap and unveil- ing possible asymmetries. In order to estimate whether spontaneity (see § 2.2) affects this degree of overlap, we compared the ranking of our sample of verbs according to the ratio of the cross-lingual frequency of their transitive and intransitive forms (Haspelmath et al. 2014) and a ranking based on their centroid distances.

35 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 2 Number of verb senses and clusters inside lexical sets upon convergence of X-Means. Lemma Senses CBOW fastText Polyglot S O S O S O chiudere 9 10 13 15 13 10 13 aprire 5 10 16 12 16 12 12 aumentare 4 16 14 15 14 16 16 rompere 6 9 14 10 14 10 13 riempire 3 2 10 10 15 8 14 raccogliere 6 2 15 8 14 2 14 connettere 3 2 8 9 10 3 7 dividere 2 12 11 13 13 12 11 finire 12 16 15 16 16 16 14 uscire 12 19 15 14 14 15 13 alzare 6 6 12 10 16 11 13 scuotere 1 1 10 1 10 1 8 bruciare 11 8 13 13 14 2 11 congelare 4 2 2 4 2 2 2 girare 6 11 11 12 13 13 13 seccare 4 2 5 8 7 2 3 svegliare 1 7 8 9 7 10 8 sciogliere 7 14 10 16 12 14 11 bollire 2 ------affondare 6 5 6 7 12 6 7 Pearson’s correlation 0.572 0.493 0.596 0.458 0.326 0.389 p-value 0.011 0.032 0.007 0.049 0.173 0.100

In Figure 3, we plot the ranking based on cross-lingual frequency against the cosine distance: it emerges that the latter tends to increase for more spontaneous verbs, i.e. verbs with a preference for the intransitive pattern. This tendency is straightforward for all the vector models, as the LOESS line suggests. Moreover, after ranking verbs based on cosine distance between S and O lexical sets, we estimated two correlation metrics with respect to the frequency-based ranking: Spearman and Kendall. Their coefficients with the corresponding p-values are reported in Table 3: they demonstrate a mild- strong positive correlation for both CBOW and Polyglot. However, the p-value does not allow to exclude the null hypothesis (absence of correlation) for fastText with reasonable certainty.

5. Discussion

The questions at the heart of these experiments were: how are lexical set structured? In particular, do their elements distribute uniformly in the space, or rather gather together (near or far from the prototype)? Are they polymorphic, i.e. composed by more than one sub-category? Moreover, which is the degree of overlap between lexical sets of different

36 Ponti et al. Distributed Representations of Lexical Sets and Prototypes

� Vector Models: � CBOW � fastText 0.75 � � Polyglot

� �

� �

� � � � � � � � �

� � � 0.50 � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � 0.25 � � � � � � Cosine Distance of S and O Centroids

0.00 5 10 15 20 Ranking by Frequency Ratio Figure 3 On the x-axis, ranking based on cross-lingual form frequencies as reported by Haspelmath et al. (2014). On the y-axis, cosine distances between centroid of S and O lexical sets in Italian. Lines are LOESS regressions, and shaded areas their confidence regions.

argument slots? In this section, we discuss the answers inferred from the results, and analyse the specific behaviour of every vector model. In the first experiment (§ 4.1), the members of O lexical sets are scattered from the centre to the periphery. On the other hand, the members of S lexical sets lie in a more compact range of distances, mostly farther from the centroid. This implies that O behaves more similarly to a radial category, whereas S just populates the periphery. From a linguistic point of view, this means that the content of O is more homogeneous, whereas S is more heterogeneous: this finding is in line with previous work (McKoon and Macfarland 2000). In the second experiment (§ 4.2), we observed a mild positive correlation between the number of clusters and of the verb senses. Hence, it is possible

37 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 3 Spearman and Kendall correlations between a ranking based on lexical set distance in Italian and another based on cross-lingual frequency ratio. CBOW fastText Polyglot Spearman 0.560 0.420 0.490 p-value 0.010 0.069 0.028 Kendall 0.411 0.305 0.358 p-value 0.012 0.064 0.030

to conclude that polymorphic verbs accept more categories of referents as their possible argument fillers. This holds true especially for S. In the third experiment (§ 4.3), we established the existence of a correlation between the verb ranking based on the cross-lingual ratio of intransitive and transitive verbs and the ranking based on cosine distance between S and O centroids in Italian. From a lin- guistic point of view, this is possibly due to the constraints on referents of spontaneous verbs (Atkins and Levin 1995; Levin and Hovav 1995), called teleological capability (Copley and Wolff 2014): this makes the sets clear-cut and possibly far from each other. This adds another piece to the puzzle of the so-called spontaneity scale: Figure 4 shows a synopsis of our result in the context of the correlations established in previous works. Solid lines stand for correlations proven based on corpus evidence. The dotted line, on the other hand, suggests the existence of and underlying motivation for the correlations (i.e. spontaneity), which nonetheless remains unproven and undetermined in its nature. Its possible validation is left to future research, but remains tricky due to its purely semantic nature. The conclusions can be drawn from more than one vector model, although results are not significant for all of them. In particular, fastText does not show any tendency with respect to the distribution of distances, nor it is possible to exclude the null hypothesis for the correlation between distance of S and O centres and frequency of the intransitive pattern. This might be due to the fact that fastText is based on characters, hence on morphological information, rather than capturing topic relatedness. Moreover, there are possible biases that plague the experiments. Firstly, the homogeneity of the O lexical set might be an artifact because the sample of objects is usually wider and hence more representative. Instead, the heterogeneity of the S lexical set is in part due to its method of extraction: sometimes also transitive subjects (A) are treated as S, because of either unexpressed objects or parsing mistakes.

6. Conclusions and Future Work

Our work provided evidence that lexical sets of Italian causative-inchoative verbs are non-uniform categories, whose distribution around the prototype varies to a great extent. This distribution is sensitive to the argument slot: transitive objects display a more uniform distribution of distances from the prototype, whereas the fillers of intransitive subjects lie on the edge of the category. This difference might be due to dif- ferent selectional restrictions applied to the object. Moreover, the number of verb senses

38 Ponti et al. Distributed Representations of Lexical Sets and Prototypes

Spontaneous

?

Frequently Intransitive

Unmarked Intransitive Distant S and O centres Figure 4 Synopsis of the correlations among features of causative-inchoative verbs. The measures are based on Kendall Tau test (τ) and Spearman’s ranking test (ρ).

appeared to play a role with respect to the polymorphism of lexical sets: intuitively, the more the shades of meaning, the more the argument types that match the selectional preferences. Finally, a correlation was discovered between the cosine distance of lexical sets of a given verb in Italian and the cross-lingual behaviour of its translations, i.e. the tendency to appear more frequently as intransitive or as transitive. This finding has to be paired with the previously established correlation between the latter and the cross- lingual tendency to derive morphologically the intransitive form or the transitive one. To amend the limitations mentioned in § 5 (noise from parsing and extraction, lexical sets not fully representative), further research should: resort to an enhanced database with a wider sample, try to reduce the parsing error with state-of-art parsers, and add sense disambiguation for polysemous word forms (Grave, Obozinski, and Bach 2013). Also, it should choose even more pre-trained vector models, in order to try and replicate these results. In particular, the new vector models could be optimized for sim- ilarity through semantic lexica (Faruqui et al. 2015) or based on syntactic dependencies (Séaghdha 2010). The experiments in this work may be extended to other languages, either individually or through a multi-lingual word embedding model (Faruqui and Dyer 2014). In fact, cross-lingually correlations are more clear-cut than those emerging single languages (Haspelmath et al. 2014).

Acknowledgements

This work was originally presented as a poster at ESSLI 2016 and then submitted to CLIC-IT 2016 (Ponti, Jezek, and Magnini 2016), where it was awarded as the ‘young best paper.’ As a reward, it is kindly hosted by this journal in a revised and extended form. We would like to thank the anonymous reviewers for their insightful and lavish comments.

39 Italian Journal of Computational Linguistics Volume 3, Number 1

References Al-Rfou, Rami, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183–192, Sofia, Bulgaria, 8–9 August 2013. Alexiadou, Artemis. 2010. On the morpho-syntax of (anti–) causative verbs. In Malka Rappaport Hovav, Edit Doron, and Ivy Sichel, editors, Syntax, lexical semantics and event structure. pages 177–203. Artale, Alessandro, Bernardo Magnini, and Carlo Strapparava. 1997. WordNet for Italian and its use for lexical discrimination. In AI*IA 97: Advances in Artificial Intelligence, pages 346–356, Rome, Italy, 17–19 September 1997. Springer. Atkins, Beryl T.S. and Beth Levin. 1995. Building on a corpus: A linguistic and lexicographical look at some near-synonyms. International Journal of Lexicography, 8(2):85–114. Baker, Collin F., Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, volume 1, pages 86–90, Montreal, Quebec, Canada, 10–14 August 1998. Baroni, Marco, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3):209–226. Bohnet, Bernd. 2010. Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 89–97, Beijing, China, 23–27 August 2010. Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. Brent, Michael R. 1991. Automatic acquisition of subcategorization frames from untagged text. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 209–214, Berkeley, California, 18–21 June 1991. Brockmann, Carsten and Mirella Lapata. 2003. Evaluating and combining approaches to selectional preference acquisition. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, volume 1, pages 27–34, Budapest, Hungary, 12–17 April 2003. Cennamo, Michela. 1995. Transitivity and VS order in Italian reflexives. Sprachtypologie und Universalienforschung, 48(1-2):84–105. Cennamo, Michela and Elisabetta Jezek. 2011. The anticausative alternation in italian. In Giovanna Massariello Merzagora and Serena Dal Maso, editors, I luoghi della traduzione. Le interfacce. Bulzoni Editore, pages 809–823. Cha, Sung-Hyuk. 2007. Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2):1. Copley, Bridget and Phillip Wolff. 2014. Theories of causation should inform linguistic theory and vice versa. In Bridget Copley and Fabienne Martin, editors, Causation in grammatical structures. Oxford University Press, pages 11–57. Dinu, Georgiana, Angeliki Lazaridou, and Marco Baroni. 2015. Improving zero-shot learning by mitigating the hubness problem. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), Workshop Track, San Diego, California, 7–9 May 2015. Dixon, Robert M.W. 1994. Ergativity. Cambridge University Press. Dubossarsky, Haim, Yulia Tsvetkov, Chris Dyer, and Eitan Grossman. 2015. A bottom up approach to category mapping and meaning change. In Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, pages 66–70, Pisa, Italy, 30 March – 1 April 2015. Dubossarsky, Haim, Daphna Weinshall, and Eitan Grossman. 2016. Verbs change more than nouns: a bottom-up computational approach to semantic change. Lingue e linguaggio, 15(1):7–28. Faruqui, Manaal, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2015), pages 1606–1615, Denver, Colorado, 31 May – 5 June. Faruqui, Manaal and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 462–471, Gothenburg, Sweden, 26–30 April

40 Ponti et al. Distributed Representations of Lexical Sets and Prototypes

2014. Firth, John R. 1957. A synopsis of linguistic theory, 1930-1955. In Philological Society of Great Britain, editor, Studies in linguistic analysis. Basil Blackwell, pages 1–32. Folli, Raffaella. 2002. Constructing Telicity in English and Italian. Ph.D. thesis, Oxford University. Geeraerts, Dirk. 1997. Diachronic prototype semantics: A contribution to historical lexicology. Oxford University Press. Gildea, Daniel and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational linguistics, 28(3):245–288. Grave, Edouard, Guillaume Obozinski, and Francis Bach. 2013. Hidden Markov tree models for semantic class induction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning (CoNLL), pages 94–103, Sofia, Bulgaria, 8–9 August 2013. Hanks, Patrick. 1996. Contextual dependency and lexical sets. International Journal of Corpus Linguistics, 1(1):75–98. Hanks, Patrick and Elisabetta Jezek. 2008. Shimmering lexical sets. In Proceedings of the XIII EURALEX International Congress, pages 391–402, Barcelona, Spain, 15–19 July 2008. Hanks, Patrick and James Pustejovsky. 2005. A pattern dictionary for natural language processing. Revue Française de linguistique appliquée, 10(2):63–82. Harris, Zelig. 1954. Distributional structure. Word, 10(23):146–162. Haspelmath, Martin. 1993. More on the typology of inchoative/causative verb alternations. In Bernard Comrie and Maria Polinsky, editors, Causatives and transitivity, volume 23. pages 87–120. Haspelmath, Martin, Andreea Calude, Michael Spagnol, Heiko Narrog, and Elif Bamyaci. 2014. Coding causal–noncausal verb alternations: A form–frequency correspondence explanation. Journal of Linguistics, 50(03):587–625. Hilpert, Martin and Florent Perek. 2015. Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts. Linguistics Vanguard, 1(1):339–350. Jezek, Elisabetta. 2003. Classi di verbi tra semantica e sintassi. ETS. Jezek, Elisabetta and Patrick Hanks. 2010. What lexical sets tell us about conceptual categories. Lexis, 4(7):22. Jezek, Elisabetta, Bernardo Magnini, Anna Feltracco, Alessia Bianchini, and Octavian Popescu. 2014. T-PAS: A resource of corpus-derived types predicate-argument structures for linguistic analysis and semantic processing. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26-31 May 2014. Joanis, Eric, Suzanne Stevenson, and David James. 2008. A general feature space for automatic verb classification. Natural Language Engineering, 14(03):337–367. Lakoff, George. 1987. Women, fire, and dangerous things: What categories reveal about the mind. Cambridge University Press. Levin, Beth and Malka Rappaport Hovav. 1995. Unaccusativity: At the syntax-lexical semantics interface. MIT press. Maaten, Laurens van der and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605. MacQueen, James. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297, Berkeley, California, 21 June 1965 – 7 January 1966. McCarthy, Diana. 2000. Using semantic preferences to identify verbal participation in role switching alternations. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 256–263, Seattle, Washington, 29 April – 4 May 2000. McCarthy, Diana and John Carroll. 2003. Disambiguating nouns, verbs, and adjectives using automatically acquired selectional preferences. Computational Linguistics, 29(4):639–654. McKoon, Gail and Talke Macfarland. 2000. Externally and internally caused change of state verbs. Language, 76(4):833–858. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR) Workshop, Scottsdale, Arizona, 2–4 May 2013. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013 (NIPS 2013). Curran Associates, Inc., Lake Tahoe, Nevada, 05–10 December 2013, pages 3111–3119.

41 Italian Journal of Computational Linguistics Volume 3, Number 1

Mirroshandel, Seyed Abolghasem and Alexis Nasr. 2016. Integrating selectional constraints and subcategorization frames in a dependency parser. Computational Linguistics, 42(1). Montemagni, Simonetta, Nilda Ruimy, and Vito Pirrelli. 1995. Ringing things which nobody can ring. a corpus-based study of the causative-inchoative alternation in Italian. Textus, 8(2):1000–1020. Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan T. McDonald, Slav Petrov, Sampo Pyysalo, and Natalia Silveira. 2016. Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), pages 1659–1666, Portoroz, Slovenia, 23–28 May 2016. Pantel, Patrick, Rahul Bhagat, Bonaventura Coppola, Timothy Chklovski, and Eduard H. Hovy. 2007. ISP: Learning inferential selectional preferences. In Proceedings of the 2007 Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), pages 564–571, Rochester, New York, 22–27 April 2007. Pelleg, Dan and Andrew W. Moore. 2000. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML), volume 1, pages 727–734, Stanford, California, 29 June – 2 Jul 2000. Ponti, Edoardo Maria, Elisabetta Jezek, and Bernardo Magnini. 2016. Grounding the lexical sets of causative-inchoative verbs with word embedding. In Proceedings of the Third Italian Conference on Computational Linguistics (CLiC-it 2016), Naples, Italy, 5–6 December 2016. Ponti, Edoardo Maria and Anna Korhonen. 2017. Event-related features in feedforward neural networks contribute to identifying causal relations in discourse. In Proceedings of EACL, Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 25–30, Valencia, Spain, 3–7 April 2017. Resnik, Philip. 1997. Selectional preference and sense disambiguation. In Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, pages 52–57, Washington, D.C., 4 April 1997. Rosch, Eleanor H. 1973. Natural categories. Cognitive psychology, 4(3):328–350. Samardzic, Tanja and Paola Merlo. 2012. The meaning of lexical causatives in cross-linguistic variation. Linguistic Issues in Language Technology, 7(12):1–14. Schulte Im Walde, Sabine. 2000. Clustering verbs semantically according to their alternation behaviour. In Proceedings of the 18th conference on Computational linguistics, volume 2, pages 747–753, Saarbrücken, Germany, 31 July – 4 August 2000. Séaghdha, Diarmuid O. 2010. Latent variable models of selectional preference. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 435–444, Uppsala, Sweden. Séaghdha, Diarmuid O. and Anna Korhonen. 2014. Probabilistic distributional semantics with latent variable models. Computational Linguistics, 40(3):587–631. Shutova, Ekaterina, Simone Teufel, and Anna Korhonen. 2013. Statistical metaphor processing. Computational Linguistics, 39(2):301–353. Van de Cruys, Tim. 2014. A neural network approach to selectional preference acquisition. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 26–35, Doha, Qatar, 25–29 October 2014. Zapirain, Benat, Eneko Agirre, Lluis Marquez, and Mihai Surdeanu. 2013. Selectional preferences for semantic role classification. Computational Linguistics, 39(3):631–663.

42 Determining the Compositionality of Noun-Adjective Pairs with Lexical Variants and Distributional Semantics

Marco S. G. Senaldi ∗ Gianluca E. Lebani ∗∗ Scuola Normale Superiore di Pisa Università di Pisa

Alessandro Lenci † Università di Pisa

In this work we tested whether a series of compositionality indices that compute the distributional similarity between the vector of a given expression and the vectors of its lexical variants can effectively tell apart idiomatic and more compositional expressions in a set of 13 idiomatic and 13 non-idiomatic Italian target noun-adjective constructions. The lexical variants were obtained by replacing the components of the original expressions with semantically related words automatically extracted from Distributional Semantic Models or manually derived from Italian MultiWordnet. Indices based on the Mean or the Centroid cosine similarity between the target and the variant vectors performed comparably or better than the addition-based measure traditionally reported in the distributional literature on compositionality.

1. Introduction

When an adjective combines with a noun, the semantics of the resulting expression does not always reflect a straightforward combination of their meanings. While a sentence like John has bought a white car entails John has bought something white and John has bought a car, saying that John is a skilled optician for sure entails he’s an optician, but not necessarily that he’s skilled in general as a person. Moving further on, if John is an alleged murderer, we are not even sure he’s a murderer at all and it’s not even grammatical to say *John is alleged. Finally, if one were to utter I thought John was the murderer, but actually he was just a red herring, they wouldn’t be claiming John is either red or a herring, but they would be just figuratively asserting that John has taken their attention away from the real murderer. All these different entailment patterns exhibited by white car, skilled optician, alleged murderer and red herring show the complexity and variability of the compositionality of adjective-noun (AN henceforth) pairs, i.e. the extent to which the meaning of a phrase as a whole is a function of the meanings of its components and of the syntactic relationship that links them (Partee 1995).

Laboratorio di Linguistica “G. Nencioni”, Scuola Normale Superiore - Piazza dei Cavalieri 7, I-56126 Pisa, ∗ Italy. E-mail: [email protected] CoLing Lab, Department of Philology, Literature and Linguistics - Via S. Maria 36, I-56126 Pisa, Italy. ∗∗ E-mail: [email protected] CoLing Lab, Department of Philology, Literature and Linguistics - Via S. Maria 36, I-56126 Pisa, Italy. † E-mail: [email protected]

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1

In formal semantic terms (Montague 1970; Kamp 1975), while the denotation of white car is said to be represented by the intersection of the denotations of white and car and the meaning of skilled optician is conceived as a subset of the denotation of optician, the intensional adjective alleged in alleged murderer is treated as a higher-order property that manipulates the modal parameter that is relevant for the interpretation of murderer (Chierchia and McConnell-Ginet 1990). Lastly and more interestingly for the study at hand, red herring classifies as an idiom, i.e. a semantically non-compositional multiword expression (MWE) characterized by figurativity, proverbiality and, in most cases, a cer- tain emotional connotation (Cacciari and Glucksberg 1991; Nunberg, Sag, and Wasow 1994; Sag et al. 2002; Cacciari 2014). Furthermore, lack of compositionality entails lack of salva-veritate-interchangeability and systematicity (Fodor and Lepore 2002). On the one hand, idioms exhibit greater lexical and morphosyntactic frozenness with respect to literal phrases: while the component of a compositional combination can be replaced with a synonym or a semantically related word without considerably affecting the meaning of the whole expression (e.g., from white car to white automobile, from skilled optician to skilled optometrist and from alleged murderer to alleged killer), performing the same operation on an idiomatic expression (e.g., transforming red herring into red fish) hinders a possible figurative reading most of the time. With respect to systematicity, if we can understand the meaning of white car and red herring used in the literal sense, we can also understand what white herring and red car mean, but the same reasoning does not apply to red herring taken as an idiom. AN compositionality therefore presents itself as a multifaceted and gradient phe- nomenon, whereby the interaction between the semantics of the adjective and the semantics of the noun leads to very different results in terms of the opacity of the output phrase. While previous computational literature on AN compositionality has been mainly concerned with the first three cases presented above (i.e. intersective, subsective and intensional), existing computational research on idiomaticity has mainly investigated verb-noun structures. In the present work we then decided to focus on the most opaque end of the AN compositionality continuum, by applying to Italian AN pairs a series of compositionality indices we already devised and tested on Italian idiomatic and non-idiomatic verbal constructions. In developing the compositionality measures in Senaldi, Lebani, and Lenci (2016), we were mainly inspired by two groups of previous computational works. On the one hand, Lin (1999) and Fazly, Cook, and Stevenson (2009) label a given word combination as idiomatic if the Pointwise Mutual Information (PMI) (Church and Hanks 1991) between its component words is higher than the PMIs between the components of a set of lexical variants of this combination. These variants are obtained by replacing the component words of the original expressions with thesaurus-extracted synonyms. On the other hand, some researches have exploited Distributional Semantic Models (DSMs) (Sahlgren 2008; Turney and Pantel 2010), comparing the vector of a given phrase with the single vectors of its subparts (Baldwin et al. 2003; Venkatapathy and Joshi 2005; Fazly and Stevenson 2008) or comparing the vector of a phrase with the vector deriving from the sum or the products of their component vectors (Mitchell and Lapata 2010; Krˇcmáˇr,Ježek, and Pecina 2013). In our previous work (Senaldi, Lebani, and Lenci 2016), we started from a set of Italian verbal idiomatic (e.g. dare i numeri ‘to lose one’s marbles’, lit. ‘to give the numbers’) and non-idiomatic (e.g. leggere un libro ‘to read a book’) phrases (henceforth targets) and generated lexical variants (simply variants henceforth) by replacing their components with semantic neighbors extracted from a window-based DSM and Ital- ian MultiWordNet (Pianta, Bentivogli, and Girardi 2002). Examples of DSM-generated

44 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs lexical variants for dare i numeri are offrire i numeri ‘to offer the numbers’, dare le unità ‘to give the units’ and offrire le unità ‘to offer the units’, while examples of variants for leggere un libro are sfogliare un libro ‘to leaf through a book’, leggere uno scritto ‘to read a work’ and sfogliare uno scritto ‘to leaf through a work’. Then, instead of measuring the associational scores between their subparts like in Lin (1999) and Fazly, Cook, and Stevenson (2009), we exploited Distributional Semantics to observe how different the context vectors of our targets were from the vectors of their variants, expecting, say, the vector of idiomatic dare i numeri to be less similar to the vectors of its variants offrire i numeri and dare le unità with respect to the similarity between the vector of non- idiomatic leggere un libro and the vectors of its variants sfogliare un libro and leggere uno scritto. Our proposal stemmed from the consideration that a high PMI value does not necessarily imply the idiomatic or multiword status of an expression, but just that its components co-occur more frequently than expected by chance, as in the case of read and book or solve and problem, which are all instances of compositional pairings. By contrast, what watertightly distinguishes an idiomatic from a collocation-like yet still compositional expression is their context of use. Comparing the distributional contexts of the original expressions and their alternatives should therefore represent a more precise refinement of the PMI-based procedure. Actually, idiomatic expressions vectors were found to be less similar to their variants vectors with respect to compositional expressions vectors. In some of our models, we also kept track of the variants that were not attested in our corpus by representing them as orthogonal vectors to the vector of the original expression, still achieving considerable results. In the present contribution, we propose to extend the method in Senaldi, Lebani, and Lenci (2016) to a set of Italian AN pairs, since this kind of structure has been usually left aside in the idiom literature. The performance of our indices is also compared with that of addition-based and multiplication-based measures, which are taken as a reference point in the distributional literature on compositionality modeling (Mitchell and Lapata 2010; Krˇcmáˇr,Ježek, and Pecina 2013).

2. Related work

The present work inserts itself in a longstanding tradition of studies addressing the computational modeling of compositionality. In the following section, we will first review previous research on the compositionality of AN structures in general. We will then switch our focus on how computational studies have so far tackled idiomatic expressions, which we said to represent the most opaque end of the compositionality continuum. As will become clear, our investigation on AN idioms combined insights from both these research strands. As for the first group of works, Distributional Semantics (Harris 1954; Lenci 2008; Turney and Pantel 2010) has been extensively applied as a computational model of compositionality. DSMs encode target lexical items as vectors in a high-dimensionality space that register their co-occurrence statistics with some contextual features, like doc- uments in a corpus or words occurring in the same contextual window. Vector similarity measures are then applied to model the semantic relatedness or similarity between the words represented by the distributional vectors. In recent years, this approach has been extended from representing the meaning of single words to modeling the semantics of complex phrases. Mitchell and Lapata (2010) propose three methods for combining vec- tor representations that are regarded as a reference point in the distributional literature on compositionality. The weighted additive model derives the vector of a complex phrase p from the weighted sum of the vectors of its components u and v (which in our case

45 Italian Journal of Computational Linguistics Volume 3, Number 1 stand for the adjective and the noun respectively) and roughly corresponds to feature union:

p = αu + βv (1)

The pointwise multiplicative model, which corresponds to feature intersection, multiplies each corresponding pair of dimensions of the u and v vectors to derive the correspond- ing dimension of the p vector. In this way, mutually exclusive features are reduced to zero in the final vector:

p = uivi (2)

Finally, the dilation model decomposes a head vector v (the noun vector in AN strings) into a parallel and an orthogonal component with respect to the modifier vector u (the adjective vector) and stretches the parallel component by a factor λ:

p =(λ 1)(u v)u +(u u)v (3) − · · So as to balance the way adjectives and nouns contribute to the meaning of the whole phrase, Guevara (2010) proposes a full additive model, in which the two n-dimensional component vectors are multiplied by two n x n weight matrices before being summed:

p = Au + Bv (4)

The two A and B matrices are estimated by means of partial least squares regression, us- ing u and v as predictors and the corresponding observed AN pair vector as dependent variable. The problem of estimating the A and B matrices in a full additive model is also investigated by Zanzotto et al. (2010), who come up with a linear equation system that is solved by resorting to Moore-Penrose pseudo-inverse matrices (Penrose 1955). To take account of the fact that each adjective can interact differently with the semantics of the noun it modifies, Baroni and Zamparelli (2010) propose a lexical function model that draws on the Fregean conception of compositionality as function application and learns adjective-specific functions by predicting the dimensions of the observed AN pair vectors from the dimensions of the component noun vectors. The estimated matrix U is then multiplied by the noun vector v:

p = Uv (5)

The adjective is then conceived of as a function that takes the meaning of the noun as an argument and returns the meaning of the modified noun. Boleda et al. (2013) compare the performance of all the aforementioned compositionality models on intensional vs. non-intensional adjectival modification. Their hypothesis is that the full additive and the lexical function models should achieve better scores in modeling intensional mod- ification, since they should represent an attempt to transpose formal semantics higher- order modification into distributional semantic terms. On the contrary, non-intensional adjectival modification should be modeled equally well by the weighted additive and the pointwise multiplicative ones, which are supposed to reflect feature combination. Their findings anyway show an overall better performance of matrix-based methods, irrespectively of the kind of modification at play (intensional vs. non-intensional).

46 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs

To obviate the limitation of the lexical function model in treating rare adjectives, Bride, Van de Cruys, and Asher (2015) come up with tensor for adjectival composition A which replaces adjective-specific matrices and is multiplied by the adjective vector u with a tensor dot product. The resulting matrix X is then multiplied with the noun vector. Hartung et al. (2017) apply all the compositional operations listed so far on CBOW word embeddings (Mikolov et al. 2013) of adjectives and nouns, registering superior performances with respect to count-based models in attribute selection and phrase similarity prediction tasks. Finally, Asher et al. (2017) resort to Latent Vector Weighting and tensor factorization to implement the Type Composition Logic (Asher 2011) conception of adjective-noun composition as a combination of two properties respectively representing the contextual contribution of the noun on the adjectival meaning and vice versa. As regards previous computational studies on idiomaticity, two complementary is- sues have been mainly addressed so far: automatically separating potentially idiomatic strings like spill the beans from strings that can only receive a literal reading like read a book (idiom type identification) and automatically telling apart idiomatic vs. literal usages of a given string in context (idiom token identification; e.g., John finally kicked the bucket after being ill for more than a decade vs. John kicked the bucket after he had accidentally stumbled on it). Since in the present work we will carry out a task of the first kind, we will just review the existing literature on idiom type detection. While some scholars like Grali´nski (2012) try to rely just on shallow features like metalinguistic markers (e.g. proverbially or literally) and quotation marks to spot the presence of idioms in running text, most research has exploited those linguistic properties that typically distinguish idioms from literals, namely compositionality and lexicosyntactic fixedness. Tapanainen, Piitulainen, and Järvinen (1998) compare the frequency of a target noun as object with the number of verbs that appear with that object, assuming that objects of idiomatic constructions occur with just one or a few verbs at most. McCarthy, Keller, and Carroll (2003) focus on the compositionality of phrasal verbs (e.g. eat up, blow up, etc.), finding a strong correlation between human compositionality judgments and thesaurus-based measures of the overlap between the neighbors of a phrasal verb (e.g. eat up) and those of its simplex verb (e.g. eat). Evert, Heid, and Spranger (2004) and Ritz and Heid (2006) use frequency information to determine the preferred morphosyntactic features of idiomatic expressions which distinguish them from compositional constructions, while Widdows and Dorow (2005) extract asymmetric lexicosyntactic patterns such as A and/or B which never occur in the reversed order B and/or A in their corpus. Such a fixed linear order emerges as a clue of various kinds of relationships between the lexemes pairs, among which idiomatic ones. Bannard (2007) studies syntactic variability of VP idioms, in the form of variability, internal modification via adjectives and passivization. Conditional PMI is used to calculate how the syntactic variation of the pair differs from what would be expected considering the variation of the single lexemes. Muzny and Zettlemoyer (2013) propose a supervised technique for identifying idioms among the Wiktionary lexical entries with lexical and graph-based features extracted from Wiktionary and WordNet. A group of studies employ distributional methods that compare the vector of a phrase with the vectors of its components (Baldwin et al. 2003; Venkatapathy and Joshi 2005; Fazly and Stevenson 2008) or with the vector deriving from the sum or the products of their components (Mitchell and Lapata 2010; Krˇcmáˇr,Ježek, and Pecina 2013). Among their vector-based indices, Fazly and Stevenson (2008) also include the cosine distance between the vector of a MWE as a whole and the vector of a verb that is morphologically related to the multiword noun, e.g. between make a decision and

47 Italian Journal of Computational Linguistics Volume 3, Number 1 decide. Finally, in a similar fashion to our proposal, some scholars have more precisely addressed lexical fixedness as a clue of idiomaticity. Lin (1999) classifies a phrase as non-compositional if the PMI between its components is significantly different from the mutual information of its variants. Each of these alternative forms is obtained by replacing one word in the original phrase with a semantic neighbour. For example, red tape, which means ‘bureaucracy’, receives a high non-compositionality score, because the PMI between red and tape is far higher than the PMI between yellow and tape or black and tape, yellow tape and black tape being the thesaurus-generated variants of red tape. On the other hand, economic impact is labeled as compositional, since its PMI is very similar to the PMIs of its variants like financial impact and economic consequence. Fazly, Cook, and Stevenson (2009) elaborate on Lin’s idea, labeling a given combination as idiomatic if the PMI between its constituents is significantly different from the average PMI between the components of its variants. In Senaldi, Lebani, and Lenci (2016) we built on the aforementioned distributional and variant-based approaches by investigating how similar the vectors of a set of target Italian verbal idiomatic and non-idiomatic constructions were to the vectors of lexical variants of these targets that were generated from a window-based DSM and the Italian section of MultiWordNet (Pianta, Bentivogli, and Girardi 2002). All in all, as we expected, idiom vectors appeared to be less similar to their variants vectors with respect to literal phrase vectors.

3. Our proposal

In this study, we firstly aim at extending the variant-based method tested in Senaldi, Lebani, and Lenci (2016) on verbal idioms to noun-adjective expressions, which are mostly neglected in the idiom literature, to observe whether our indices can perform effectively in separating idiomatic vs. non-idiomatic AN constructions as well. Sec- ondly, differently from our former work, we compare the variant-based method against conventional additive and multiplicative compositionality indices proposed in the dis- tributional literature (Mitchell and Lapata 2010; Krˇcmáˇr,Ježek, and Pecina 2013). Finally, beside using a window-based DSM and Italian MultiWordNet (Pianta, Bentivogli, and Girardi 2002) to extract our variants, we also experimented with a syntactic-based DSM (Padó and Lapata 2007; Baroni and Lenci 2010) that keeps track of the dependency relations between a given target and its contexts, to see whether variants generated with a different kind of distributional information can lead to improved performances. The rest of the paper is organized as follows: in Section 4 we describe the dataset of target AN constructions we started from and the generation and extraction of the lexical variants for our targets from a Linear DSM, a Structured DSM and the Italian section of MultiWordNet (Pianta, Bentivogli, and Girardi 2002); in Section 5 we describe the col- lection of human-elicited idiomaticity judgments we used as a gold standard to assess the performance of our measures; in Section 6 we present the compositionality indices we tested on our dataset; both quantitative results and a qualitative error analysis will be provided in Section 7; finally, in Section 8 we provide some concluding remarks and point out possible future research directions.

4. Data extraction

4.1 Extracting the target expressions

All in all, our dataset is composed of 26 types of Italian noun-adjective and adjective- noun combinations, including 13 Italian idioms selected from an idiom dictionary

48 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs

(Quartu 1993) and then extracted from the itWaC corpus (Baroni et al. 2009), which totalizes about 1,909M tokens. The frequency of these targets vary from 21 (alte sfere ‘high places’, lit. ‘high spheres’) to 194 (punto debole ‘weak point, weak spot’). The remaining 13 items are compositional pairs of comparable frequencies (e.g., nuova legge ‘new law’ or scrittore famoso ‘famous writer’).

4.2 Extracting lexical variants

As in Senaldi, Lebani, and Lenci (2016), we adopted two different procedures for deriv- ing lexical variants out of our targets, since we were interested in observing whether a fully automatic extraction method like the DSM-based one performed comparably with a more careful but time-consuming manual selection carried out on Italian MultiWord- Net (Pianta, Bentivogli, and Girardi 2002). Linear DSM variants. For both the noun and the adjective of each target, we extracted its top cosine neighbors in a window-based DSM created from the La Repubblica corpus (Baroni et al. 2004) (about 331M tokens). In Senaldi, Lebani, and Lenci (2016) we experimented with different thresholds of selected top neighbors (3, 4, 5 and 6). Since the number of top neighbors that were extracted for each constituent of the target did not significantly affect our performances, we decided to use the maximum number (i.e., 6) for the present study. In the DSM, all the content words occurring more than 100 times were represented as target vectors, ending up with 26,432 vectors, while the top 30,000 content words were used as dimensions. The co-occurrence counts were collected with a context window of 2 content words from each target word. The co- ± occurrence matrix was then weighted by Positive Pointwise Mutual Information (PPMI) (Evert 2008), which calculates whether the co-occurrence of two words x and y is more frequent than expected by chance and sets to zero all the negative results:

P (x, y) PPMI(x, y)=max(0, log ) (6) P (x)P (y)

Finally, Singular Value Decomposition (SVD) (Deerwester et al. 1990) to 300 latent dimensions was run on our initial 30,000-dimension matrix. The variants were finally obtained by combining the adjective with each of the noun’s top 6 neighbors (e.g., punto debole vantaggio debole ‘weak advantage’, termine debole ‘weak end’, etc.), the noun → with all the top 6 neighbors of the adjective (e.g., punto debole punto fragile ‘fragile → point’, punto incerto ‘uncertain point’, etc.) and finally all the top 6 neighbors of the adjective and the noun with each other (e.g., punto debole vantaggio fragile ‘fragile → advantage’, termine incerto ‘uncertain end’, etc.), ending up with 48 potential Linear DSM variants per target. Structured DSM variants. While unstructured (i.e., window-based) DSMs just record the words that linearly precede or follow a target lemma when collecting co-occurrence counts, structured DSMs conceive co-occurrences as triples, where r rep- resents the lexico-syntactic pattern or, like in our case, the parser-extracted dependency relation between w1 and w2 (Padó and Lapata 2007; Baroni and Lenci 2010). The ground- ing assumption of such models is that the syntactic relation linking the two words should act as a cue of their semantic relation (Grefenstette 1994; Turney 2006; Padó and Lapata 2007). Actually, structured DSMs have been shown to perform competitively or better than linear DSMs in a variety of semantic tasks, like modeling similarity judgments or selectional preferences or detecting synonyms (Baroni and Lenci 2010).

49 Italian Journal of Computational Linguistics Volume 3, Number 1

Since we wanted to exploit different kinds of distributional information to generate our variants, following the method described in Baroni and Lenci (2010) we created a structured DSM from La Repubblica (Baroni et al. 2004), where all the content words occurring more than 100 times were kept as targets and the co-occurrence matrix was once again weighted via PPMI and reduced to 300 latent dimensions. For each target, we generated 48 virtual lexical variants with the same procedure described for the window- based DSM variants. iMWN variants. For each noun, we extracted the words occurring in the same synsets and its co-hyponyms from Italian MultiWordNet (iMWN) (Pianta, Bentivogli, and Gi- rardi 2002). As for the adjectives, we experimented with two different approaches, extracting just their synonyms in the first case (iMWNsyn variants) and adding also the antonyms in the second case (iMWNant variants). Since antonyms were not available in the Italian section of MultiWordNet, we had to translate them from the English WordNet (Fellbaum 1998). For each noun and adjective, we kept its top 6 iMWN neighbors in terms of cosine similarity in the same DSM used to acquire the linear DSM variants. Once again, this method provided us with up to 48 potential iMWNsyn and 48 potential iMWNant variants per target.

5. Gold standard idiomaticity judgments

To validate our computational indices, we presented 9 linguistics students with our 26 targets and asked them to rate how idiomatic each expression was on a 1-7 scale, with 1 standing for “totally compositional” and 7 for “totally idiomatic”. The 26 targets were presented together in three different randomized lists. Each list was rated by three subjects. The mean score given to our idioms was 6.10 (SD = 0.77), while the mean score given to compositional expressions was 2.03 (SD = 1.24). This difference was proven by a t-test to be statistically significant (t = 10.05, p<0.001). Inter-coder reliability, measured via Krippendorff’s α (Krippendorff 2012), was 0.76. Following established practice, we took such value as a proof of reliability for the elicited ratings (Artstein and Poesio 2008).

6. Calculating compositionality indices

Two kinds of compositionality indices were computed for our 26 idiomatic and non- idiomatic AN targets. The former, described in Subsection 6.1, comprehends the variant-based measures we previously tested on verbal idioms (Senaldi, Lebani, and Lenci 2016). The latter, presented in Subsection 6.2, comprehends addition-based and multiplication-based measures that have been previously proposed in the distributional literature (Mitchell and Lapata 2010; Krˇcmáˇr,Ježek, and Pecina 2013).

6.1 Variant-based indices

The variant generation procedure explained in Section 4.2 provided us with just auto- matically generated and potential lexical alternatives for our initial constructions, but of course it could not assure us that they were actually attested in our corpus. For each of our 26 targets, we extracted from itWaC all the occurrences we could find of their respective 48 linear DSM, structured DSM, iMWNsyn and iMWNant variants. For every variant type (linear DSM, structured DSM, iMWNsyn and iMWNant) we built a DSM from itWaC representing the 26 targets and their variants as vectors. While the dimension of the La Repubblica corpus seemed to be enough for the variants extraction

50 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs procedure, we resorted to five-times bigger itWaC to represent the variants as vectors and compute the compositionality scores to avoid data sparseness and have a consider- able number of variants frequently attested in our corpus. Using two different corpora has the additional advantage of showing the variants method to be generalizable to corpora of different text genres and size. Co-occurrence statistics recorded how many times each target or variant construction occurred in the same sentence with each of the 30,000 top content words in the corpus. The matrices were then weighted with PPMI and reduced to 150 dimensions via SVD. We finally calculated four different indices:

Mean. The mean of the cosine similarities between the vector of a target construction and the vectors of its variants.

Max. The maximum value among the cosine similarities between a target vector and its variants vectors.

Min. The minimum value among the cosine similarities between a target vector and its variants vectors.

Centroid. The cosine similarity between a target vector and the centroid of its variants vectors.

In our predictions, ranking our 26 targets in ascending order according to each of the four compositionality indices should result in idioms being placed at the top of the ranking and non-idioms at the bottom of it, since idioms are expected to be the least similar ones with respect to their lexical alternatives. As we said before, the variant creation method presented in Section 4.2 consists in the generation of a list of potential variants for each target construction, but most of these were not actually found in itWaC (Baroni et al. 2009). Let’s consider the Linear DSM variants of the idiom testa calda ‘hothead’. While 11 of the Linear DSM-generated variants were actually retrieved in itWaC, like mano fredda ‘cold hand’ (12 tokens), mano calda ‘warm hand’ (6 tokens) and piede freddo ‘cold foot’ (2 tokens), the other 37 variants, like mano torrida ‘torrid hand’, gamba fresca ‘cool leg’ and spalla umida ‘moist shoulder’, were not attested at all. Since some of our targets had many variants that were not found in itWaC, each measure was computed twice: in the first case we simply did not consider the non-occurring variants; in the second case, we conceived them as vectors that were orthogonal to their target vector. For the sake of clarity, the first kind of models will henceforth be referred to as no models, while the latter will be labeled orth models. The rationale behind taking this negative evidence into account was that, if lexical alternatives for a given AN pair could not even be traced in the corpus, this should be taken as an additional and stronger clue of its formal idiosyncrasy and idiomatic status. Non-occurring variants were encoded as orthogonal vectors since, given a vector x, the cosine similarity between x and a vector y which is perpendicular to x is equal to 0.0, so that y is the most distant vector from x. Such an implementation reflects the consideration that a non-existing variant is de facto conceivable as an expression whose meaning is the farthest possible from the meaning of the respective target expression and which contributes in tilting its compositionality score towards 0.0. In practical terms, for the Mean, Max and Min indices, this meant to automatically set to 0.0 the cosine similarity between the target vector and the vector of the non-occurring variant at hand. For the Centroid measure, we first computed the cosine similarity between the target vector and the centroid of its attested variants (csa). From this initial cosine

51 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 1 Number of non-attested variants for each of the four DSM spaces built from Linear DSM, Structured DSM, iMWNsyn and iMWNant variants respectively.

Zero variants per target Space Total zero variants Mean SD Linear DSM 810 31.15 11.21 Structured DSM 1002 38.54 10.03

iMWNsyn 717 27.58 14.52 iMWNant 703 27.04 13.56

value we then subtracted the product between the number of non-attested variants (n), csa and a costant factor k. This factor k, which was set to 0.01 in previous investigations, represented the contribution of each zero variant in reducing the target-variants similar- ity towards 0.0. k was multiplied by the original cosine since we hypothesized that zero variants contributed differently in lowering the target-variants similarity, depending on the construction under consideration:

Centroid = cs (cs k n) (7) a − a · · Table 1 reports how many variants were not attested in our corpus for each of the four spaces we built (Linear DSM, Structured DSM, iMWNsyn and iMWNant). As we can see, the issue of non-attested variants was an across-the-board phenomenon which involved each of the four space types and was only slightly smaller in iMWN-derived spaces.

6.2 Addition-based and multiplication-based indices

The indices in Section 6.1 were compared against two of the measures by Mitchell and Lapata (2010) and Krˇcmáˇr,Ježek, and Pecina (2013). We trained a DSM on itWaC that represented all the content words with token frequency > 300 and our 26 targets as row-vectors and the top 30,000 content words as contexts. The co-occurrence window was still the entire sentence and the weighting was still the PPMI. SVD was carried out to 300 final dimensions. Please note that the context vectors of a given word did not include the co-occurrences of a target idiom that was composed of that word (e.g. the vector for punto did not include the contexts of punto debole), so as to make sure that the vector of the idiom as a whole was compared with vectors that actually represented the literal meaning of its constituents. We then computed the following measures: Additive. The cosine similarity between a target vector and the vector resulting from the component-wise sum of the noun vector and adjective vector. This roughly corre- sponded to performing a weighted addition as explained in Section 2 with both weights set to 1, since in our assumption both component vectors equally contributed to the representation of the whole phrase. Multiplicative. The cosine similarity between a target vector and the vector resulting from the component-wise product of the noun vector and adjective vector.

52 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs

7. Results and discussion

Our 26 targets were sorted in ascending order for each compositionality score. In each ranking, we predicted idioms (our positives) to be placed at the top and compositional phrases (our negatives) to be placed at the bottom, since we expected idiom vectors to be less similar to the vectors of their variants. First and foremost, we must report that three idioms for every type of variants (Window-based DSM, Structured DSM and iMWN) obtained a 0.0 score for all the variant-based indices since no variants were found in itWaC. Nevertheless, we kept this information in our ranking as an immediate proof of the multiwordness and idiomaticity of such expressions. These were punto debole, passo falso ‘false step’ and colpo basso ‘cheap shot’ for the Structured DSM spaces, punto debole, pecora nera ‘black sheep’ and faccia tosta ‘cheek’ for the iMWN spaces and punto debole, passo falso and zoccolo duro ‘hard core’ for the Window-based DSM spaces. Table 2 reports the 5 best models for Interpolated Average Precision (IAP), the F- measure at the median and Spearman’s ρ correlation with our gold standard idiomatic- ity judgments respectively. Coherently with Fazly, Cook, and Stevenson (2009), IAP was computed as the average of the interpolated precisions at recall levels of 20%, 50% and 80%. Interestingly, while Additive was the model that best ranked idioms before non-idioms (IAP), closely followed by our variant-based measures, and fig- ured among those with the best precision-recall trade-off (F-measure), Multiplicative performed comparably to the Random baseline. Although at odds with Mitchell and Lapata (2010)’s results, such a scarce performance of the Multiplicative model is in line with the findings by Baroni and Zamparelli (2010) and Boleda et al. (2013), who both found Addition to be more effective in modeling AN compositionality. While Baroni and Zamparelli (2010) hypothesize that product-based indices could be disadvantaged by SVD, which can output negative dimensions and therefore lead to counterintuitive component-wise product results, Boleda et al. (2013) suggest that the feature union performed by addition could more accurately represent the semantics of AN structures with respect to the massive feature intersection provoked by multiplication, whereby shared dimensions are inflated and mutually exclusive ones are canceled out. As for ρ correlation with the human ratings, the best score was achieved by one of our variant-based measures, namely Structured DSM Meanorth (-0.68). Additive did not belong to the 5 models with top correlation, but still achieved a high significant ρ score (-0.62). It’s worth noting that, as we wished, all these correlational indices are negative: the more the subjects perceived a target to be idiomatic, the less its vector was similar to its variants. Max and Min never appeared among the best performing measures, with all top models using Mean and Centroid. Moreover, the DSM models that worked the best for IAP and F-measure both used dependency-related distributional information, with window-based DSM models not reaching the top 5 ranks. This difference was nonetheless ironed out when looking at the Top ρ models. In Senaldi, Lebani, and Lenci (2016), models encoding zero variants as orthogonal vectors ranked better than the other ones only when predicting speakers’ judgments, while no models scored better as for IAP and F-measure. In this work, the majority of best IAP and F-measure models, and de facto all top ρ models, are orth models, thus showing that considering negative evidence about lexical variants is fruitful for AN compositionality estimation. In light of the overall results, generating variants from DSMs emerges as the best method, since these models had comparable performances with MultiWordNet-based models, but were fully automatic and did not require an intensive and time-consuming manual selection of the variants. Finally, the presence of antonymy-related information for iMWN models did not appear to influence the performances considerably.

53 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 2 Best models ranked by IAP (top), F-measure at the median (middle) and Spearman’s ρ correlation with the speakers’ judgments (bottom) against the multiplicative model and the random baseline (** = p<0.01, *** = p<0.001).

Top IAP Models IAP F ρ

Additive 0.85 0.77 -0.62∗∗∗ Structured DSM Meanorth 0.84 0.85 -0.68∗∗∗ iMWNsyn Centroidorth 0.83 0.85 -0.57∗∗ iMWNant Centroidorth 0.83 0.77 -0.52∗∗ iMWNant Meanorth 0.83 0.69 -0.64∗∗∗

Top F-measure Models IAP F ρ

Structured DSM Meanorth 0.84 0.85 -0.68∗∗∗ iMWNsyn Centroidorth 0.83 0.85 -0.57∗∗ Additive 0.85 0.77 -0.62∗∗∗ iMWNant Centroidorth 0.83 0.77 -0.52∗∗ iMWNsyn Centroidno 0.82 0.77 -0.57∗∗

Top ρ Models IAP F ρ

Structured DSM Meanorth 0.84 0.85 -0.68∗∗∗ Window-based DSM Meanorth 0.75 0.69 -0.66∗∗∗ iMWNsyn Meanorth 0.77 0.77 -0.65∗∗∗ iMWNsyn Meanno 0.70 0.69 -0.65∗∗∗ iMWNant Meanorth 0.83 0.69 -0.64∗∗∗

Baselines IAP F ρ Multiplicative 0.58 0.46 0.03 Random 0.50 0.31 0.05

7.1 Error analysis

In order to understand whether specific items in our dataset could be particularly trou- blesome for our algorithms, we carried out a qualitative analysis of the most common false positives (FPs henceforth, i.e. literals wrongly labeled as idioms) and false negatives (FNs henceforth, i.e. idioms wrongly classified as compositional expressions). One of the most common FPs was the ambiguous expression pesce grosso ‘big fish’, which is sometimes used in newspapers to denote an influential person inside a criminal organization. Maybe, given the nature of the corpora we selected, it would have been

54 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs better to treat this AN pair as an idiom in the first place. As it happened in Senaldi, Lebani, and Lenci (2016), other frequent FPs were compositional but collocation-like combinations like gruppo numeroso ‘large group’ or crescita rapida ‘rapid growth’, which contain nouns and adjectives that co-occur very often. On the other hand, while FNs in our previous study mostly consisted in strongly ambiguous expressions liable to both a figurative and a literal reading according to the context, in this case they were evident idioms, like testa calda or punto fermo ‘fundamental point’ (lit. ‘still point’). To discover why our algorithms ended up classifying them as literals, we had a look at the lexical variants that were generated and were available for each of them. For testa calda, only 1 Structured DSM variant occurring just 1 time and 2 iMWN variants occurring 1 time were found in itWaC and this led to a skewed and not reliable compositionality assessment. As regards punto fermo, the variants that were generated, like punto solido ‘solid point’ or passaggio chiaro ‘clear step’ seem to refer to the same semantic field of the original expression and exhibited quite predictably a similar distribution.

8. Conclusions

AN compositionality constitutes a highly multifaceted and complex phenomenon, whereby the interaction of the semantics of the two constituents can lead to different results, from fully transparent to fully opaque word combinations. Since AN struc- tures are usually neglected in the computational literature on idioms, we decided to focus on the most opaque end of the AN compositionality continuum, applying to AN constructions the same variant-based distributional measures we had previously proposed and tested on verbal idioms (Senaldi, Lebani, and Lenci 2016). Once again, effective performances were obtained, therefore confirming that comparing the vector of a given phrase with the vectors of its lexical variants is a reliable way to estimate its compositionality. More specifically, models computing the mean cosine similarity between the target vector and the variant vectors or the cosine similarity between the target vector and the centroid of the variant vectors stood out as the best performing ones, as did models that kept track of the variants that were not found for a given target in the form of orthogonal vectors. Interestingly, our measures performed comparably to or even better than the Additive method proposed in the distributional literature (Krˇcmáˇr,Ježek, and Pecina 2013), while the Multiplicative one performed considerably worse than all our models, together with the Random baseline. This finding mirrored results from previous studies (Baroni and Zamparelli 2010; Boleda et al. 2013) and suggested that the feature intersection carried out by component-wise vector product is not a viable approach for modeling the semantics of AN combinations. Future work will concern testing whether these variant-based measures can be successfully exploited to predict psycholinguistic data about the processing of idiom compositionality and flexibility, together with other corpus-based indices of idiomatic- ity. Moreover, we plan to extend the comparison of the variant-based approach to matrix and tensor-based models of AN composition.

References Artstein, Ron and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596. Asher, Nicholas. 2011. Lexical meaning in context: A web of words. Cambridge University Press, Cambridge, UK. Asher, Nicholas, Tim Van de Cruys, Antoine Bride, and Márta Abrusán. 2017. Integrating type theory and distributional semantics: A case study on adjective–noun compositions.

55 Italian Journal of Computational Linguistics Volume 3, Number 1

Computational Linguistics, 42(4):703–725. Baldwin, Timothy, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 89–96, Sapporo, Japan, July 12, 2003. Bannard, Colin. 2007. A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions, pages 1–8, Prague, Czech Republic, June 28, 2007. Baroni, Marco, Silvia Bernardini, Federica Comastri, Lorenzo Piccioni, Alessandra Volpi, Guy Aston, and Marco Mazzoleni. 2004. Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-Compliant Corpus of Newspaper Italian. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 1771–1774, Lisbon, Portugal, May 26-28, 2004. Baroni, Marco, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209–226. Baroni, Marco and Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673–721. Baroni, Marco and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1183–1193, Cambridge, MA, October 9-11, 2010. Boleda, Gemma, Marco Baroni, Louise McNally, and Nghia Pham. 2013. Intensionality was only alleged: On adjective-noun composition in distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics, pages 35–46, Potsdam, Germany, March 19-22, 2013. Bride, Antoine, Tim Van de Cruys, and Nicholas Asher. 2015. A Generalisation of Lexical Functions for Composition in Distributional Semantics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 281–291, Beijing, China, July 26-31, 2015. Cacciari, Cristina. 2014. Processing multiword idiomatic strings: Many words in one? The Mental Lexicon, 9(2):267–293. Cacciari, Cristina and Sam Glucksberg. 1991. Understanding idiomatic expressions: The contribution of word meanings. Advances in Psychology, 77:217–240. Chierchia, Gennaro and Sally McConnell-Ginet. 1990. Meaning and Grammar: An Introduction to Semantics. MIT Press, Cambridge, MA. Church, Kenneth W. and Patrick Hanks. 1991. Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29. Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391. Evert, Stefan. 2008. Corpora and collocations. In Anke Lüdeling and Merja Kytö, editors, Corpus Linguistics. An International Handbook, volume 2. Mouton de Gruyter, Berlin & New York, pages 1212–1248. Evert, Stefan, Ulrich Heid, and Kristina Spranger. 2004. Identifying morphosyntactic preferences in collocations. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 907–910, Lisbon, Portugal, May 26-28, 2004. Fazly, Afsaneh, Paul Cook, and Suzanne Stevenson. 2009. Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, 1(35):61–103. Fazly, Afsaneh and Suzanne Stevenson. 2008. A distributional account of the semantics of multiword expressions. Italian Journal of Linguistics, 1(20):157–179. Fellbaum, Christiane, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA. Fodor, Jerry A. and Ernest Lepore. 2002. The Compositionality Papers. Oxford University Press, Oxford, UK. Grali´nski,Filip. 2012. Mining the web for idiomatic expressions using metalinguistic markers. In Proceedings of Text, Speech and Dialogue: 15th International Conference, pages 112–118, Brno, Czech Republic, September 3-7, 2012.

56 Senaldi et al. Determining the Compositionality of Noun-Adjective Pairs

Grefenstette, Gregory. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Dordrecht, the Netherlands. Guevara, Emiliano. 2010. A regression model of adjective-noun compositionality in distributional semantics. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, pages 33–37, Uppsala, Sweden, July 16, 2010. Harris, Zellig S. 1954. Distributional structure. Word, 10(2-3):146–162. Hartung, Matthias, Fabian Kaupmann, Soufian Jebbara, and Philipp Cimiano. 2017. Learning Compositionality Functions on Word Embeddings for Modelling Attribute Meaning in Adjective-Noun Phrases. In Proceedings of the 15th Meeting of the European Chapter of the Association for Computational Linguistics, pages 54–64, Valencia, Spain, April 3-7, 2017. Kamp, Hans. 1975. Two theories about adjectives. In Edward L. Keenan, editor, Formal Semantics of Natural Language. Cambridge University Press, Cambridge, UK, pages 123–155. Krippendorff, Klaus. 2012. Content analysis: An introduction to its methodology. Sage, Los Angeles, London, New Delhi & Singapore. Krˇcmáˇr,Lubomír, Karel Ježek, and Pavel Pecina. 2013. Determining Compositionality of Expresssions Using Various Word Space Models and Measures. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pages 64–73, Sofia, Bulgaria, August 9, 2013. Lenci, Alessandro. 2008. Distributional semantics in linguistic and cognitive research. Italian Journal of Linguistics, 20(1):1–31. Lin, Dekang. 1999. Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 317–324, College Park, Maryland, June 20-26, 1999. McCarthy, Diana, Bill Keller, and John Carroll. 2003. Detecting a continuum of compositionality in phrasal verbs. In Proceedings of the ACL 2003 workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 73–80, Sapporo, Japan, July 12, 2003. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26tth International Conference on Neural Information Processing System, pages 3111–3119, Lake Tahoe, Nevada, December 5-10, 2013. Mitchell, Jeff and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive Science, 34(8):1388–1429. Montague, Richard. 1970. Universal grammar. Theoria, 36(3):373–398. Muzny, Grace and Luke S. Zettlemoyer. 2013. Automatic Idiom Identification in Wiktionary. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1417–1421, Seattle, WA, October 19-21, 2013. Nunberg, Geoffrey, Ivan Sag, and Thomas Wasow. 1994. Idioms. Language, 70(3):491–538. Padó, Sebastian and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199. Partee, Barbara. 1995. Lexical semantics and compositionality. In Lila R. Gleitman, Daniel N. Osherson, and Mark Liberman, editors, An invitation to cognitive science: Language, volume 1. MIT Press, Cambridge, MA, pages 311–360. Penrose, Roger. 1955. A generalized inverse for matrices. Mathematical Proceedings of the Cambridge Philosophical Society, 51(3):406–413. Pianta, Emanuele, Luisa Bentivogli, and Christian Girardi. 2002. MultiWordNet: Developing and Aligned Multilingual Database. In Proceedings of the First International Conference on Global WordNet, pages 293–302, Mysore, India, January 21-25, 2002. Quartu, Monica B. 1993. Dizionario dei modi di dire della lingua italiana. RCS Libri, Milan, Italy. Ritz, Julia and Ulrich Heid. 2006. Extraction tools for collocations and their morphosyntactic specificities. In Proceedings of the 5th International Conference on Language Resources and Evaluation, pages 1925–1930, Genoa, Italy, May 24-26, 2006. Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword Expressions: A Pain in the Neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics, pages 1–15, Mexico City, Mexico, February 17-23, 2002. Sahlgren, Magnus. 2008. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–54. Senaldi, Marco S. G., Gianluca E. Lebani, and Alessandro Lenci. 2016. Lexical variability and compositionality: Investigating idiomaticity with distributional semantic models. In Proceedings of the 12th Workshop on Multiword Expressions, pages 21–31, Berlin, Germany, August

57 Italian Journal of Computational Linguistics Volume 3, Number 1

11, 2016. Tapanainen, Pasi, Jussi Piitulainen, and Timo Järvinen. 1998. Idiomatic object usage and support verbs. In Proceedings of the 17th international conference on Computational Linguistics, pages 1289–1293, Montreal, Quebec, Canada, August 10-14, 1998. Turney, Peter D. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379–416. Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37:141–188. Venkatapathy, Sriram and Aravid Joshi. 2005. Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In Proceedings of Joint Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 899–906, Vancouver, British Columbia, Canada, October 06-08, 2005. Widdows, Dominic and Beate Dorow. 2005. Automatic extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, pages 48–56, Ann Arbor, Michigan, June 30, 2005. Zanzotto, Fabio Massimo, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating Linear Models for Compositional Distributional Semantics. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1263–1271, Beijing, China, August 23-27, 2010.

58 LU4R: Adaptive Spoken Language Understanding for Robots

Andrea Vanzo∗ Danilo Croce∗∗ Sapienza Università di Roma Università di Roma, Tor Vergata

Roberto Basili† Daniele Nardi‡ Università di Roma, Tor Vergata Sapienza Università di Roma

Service robots are expected to operate in specific environments, where the presence of humans plays a key role. It is thus essential to enable for a natural and effective communication among humans and robots. One of the main features of such robotics platforms is the ability to react to spoken commands. This requires a comprehensive understanding of the user utterance to trigger the robot reaction. Moreover, the correct interpretation of linguistic interactions depends on physical, cognitive and language-dependent aspects related to the environment. In this work, we present the latest version of LU4R - adaptive spoken Language Understanding 4 Robots, a Spo- ken Language Understanding framework for the semantic interpretation of robotic commands, that is sensitive to the operational environment. The overall system is designed according to a Client/Server architecture in order to be easily deployed in a vast plethora of robotic platforms. Moreover, an improved version of HuRIC - Human-Robot Interaction Corpus is presented. The main novelty presented in this paper is the extension to commands expressed in Italian. In order to prove the effectiveness of such system, we also present some empirical results in both English and Italian computed over the new HuRIC resource.

1. Introduction

One of the most challenging issues that Service Robotics is facing in the recent years is the need of high level interactions and collaborations between humans and robots. In such a robotic context, human language is one of the most natural ways of commu- nication as for its expressiveness and flexibility. However, an effective communication in natural language between humans and robots is challenging even for the different cognitive abilities involved during the interaction. In fact, for a robot to react to a simple command like “take the pillow on the couch”, a number of implicit assumptions should be met. First, at least two entities, a pillow and a couch, must exist in the environment and the speaker must be aware of such entities. Accordingly, the robot must have access to an inner representation of the objects, e.g. an explicit map of the

Dept. of Computer, Control and Management Engineering “Antonio Ruberti” - Via Ariosto 25, 00185 ∗ Rome, Italy. E-mail: [email protected] Dept. of Enterprise Engineering - Via del Politecnico 1, 00133 Rome, Italy. ∗∗ E-mail: [email protected] Dept. of Enterprise Engineering - Via del Politecnico 1, 00133 Rome, Italy. † E-mail: [email protected] Dept. of Computer, Control and Management Engineering “Antonio Ruberti” - Via Ariosto 25, 00185 ‡ Rome, Italy. E-mail: [email protected]

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1 environment. Second, mappings from lexical references to real world entities must be developed or made available. In this respect, the Grounding process (Harnad 1990) links symbols (e.g. words) to the corresponding perceptual information. Hence, robot interactions need to be grounded, as meaning depends on the state of the physical world and the interpretation crucially interplays with perception, as pointed out by psycho- linguistic theories (Tanenhaus et al. 1995). The integration of perceptual information derived from the robot’s sensors with an ontologically motivated description of the world has been adopted as an augmented representation of the environment, in the so-called semantic maps (Nüchter and Hertzberg 2008). In these maps, the existence of real world objects can be associated to lexical information, in the form of entity names given by a knowledge engineer or spoken by a user for a pointed object, as in Human- Augmented Mapping (Diosi, Taylor, and Kleeman 2005; Gemignani et al. 2016). While Spoken Language Understanding (SLU) for Interactive Robotics have been mostly carried out over the only evidences specific to the linguistic level (see, for example, (Chen and Mooney 2011; Matuszek et al. 2012)), we argue that such process should be context-aware, in the sense that both the user and the robot live in and make references to a shared environment. For example, in the above command, “taking” is the intended action whenever a pillow is actually on the couch, so that “the pillow on the couch” refers to a single argument. On the contrary, the command may refer to a “bringing” action, when no pillow is on the couch and the pillow and on the couch correspond to different semantic roles. We are interested in an approach for the interpretation of robotic spoken commands that is consistent with (i) the world (with all the entities composing it), (ii) the Robotic Platform (with all its inner representations and capabilities), and (iii) the linguistic information derived from the user’s utterance. We foster here the approach presented in (Bastianelli et al. 2016a), where a machine leaning method for Spoken Language Understanding forces the interpretations to be consistent with the environment: this is obtained by extending the linguistic evidences that can be extracted from the uttered commands with perceptual evidences directly de- rived by the semantic map of a robot. In particular, the interpretation process is modeled as a sequence labeling problem where the final labeler is trained by applying Structured Learning methods over realistic commands expressed in domestic environments, as in (Bastianelli et al. 2017). The resulting interpretations adhere to Frame Semantics (Fill- more 1985): this well-established theory provides a strong linguistic foundations to the overall process while enforcing its applicability, as it is made independent from the vast plethora of existing robotic platforms. Such methodologies have been implemented in a free and ready-to-use framework, here presented, whose name is LU4R - an adaptive spoken Language Understanding framework for(4) Robots. LU4R is entirely coded in Java and, thanks to its Client/Server architectural design, it is completely decoupled from the robot, enabling for an easy and fast deployment on every platform1. As the aforementioned approaches relies on realistic data, in this work we also present an extended version of HuRIC -aHuman Robot Interaction Corpus, originally introduced in (Bastianelli et al. 2014). HuRIC is a collection of realistic spoken com- mands that users might express towards generic service robots. In this resource, each sentence is labeled with morpho-syntactic and syntactic information (e.g. dependency relations, POS tags, . . . ), along with its correct interpretation in terms of semantic frames

1 LU4R can be downloaded at http://sag.art.uniroma2.it/lu4r.html

60 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

(Baker, Fillmore, and Lowe 1998). We present here a new version of HuRIC that has been enhanced in terms of (i) the number of annotated sentences in English and (ii) a brand new section, where Italian commands have been added (and aligned) to the already existing English counterparts. At the best of our knowledge this is the first dataset of spoken robotic commands in Italian2. The extended version of HuRIC supports a larger and more significant evaluation of LU4R, that highlights its robustness towards commands expressed through the investigated languages. Specifically, we observed very good performances w.r.t. both languages, whose outcomes are encouraging for the deployment of LU4R (and the underlying methods and psycho linguistic assumptions) in realistic applications. The rest of the paper is structured as follows. Section 2 provides a short survey of existing approaches to SLU in Human-Robot Interaction. Section 3 describes the semantic analysis process that represents the core of the LU4R system. In Section 4, an architectural description of the entire system is provided, as well as an overall intro- duction about its integration with a generic robot. Section 5 describes the new release of HuRIC, while in Section 6 we demonstrate the applicability of the proposed system in the interpretation of commands in English and Italian, by reporting our experimental results. Finally, Section 7 derives the conclusions.

2. Related Work

In Robotics, Spoken Language Understanding (SLU) has been usually treated by fol- lowing two orthogonal approaches: grammar-based and data-driven. Grammar-based systems for speech recognition model language phenomena through the definition of grammars. Moreover, they provide mechanisms to enrich the syntactic structure with semantic information, to build a semantic representation during the transcription process (Bos 2002; Bos and Oka 2007). In (Bastianelli et al. 2016b), SLU supporting manifold robotics tasks is performed jointly with speech recognition, through the definition of ad-hoc grammars. This is possible thanks to the Speech Recogni- tion Grammar Specification3, that allows to inject semantic attachment directly within the grammar specification. Other approaches are based on formal languages, as in (Kruijff et al. 2007), where Combinatory Categorial Grammar (CCG) are applied for spoken dialogues in the context of Human-Augmented Mapping, or exploit template-based algorithms (see (Perera and Veloso 2015)) to extract a semantic interpretation of robotic commands from the corresponding syntactic trees. Data-driven methods have been also applied to SLU for robotic application. Exam- ples are (MacMahon, Stankiewicz, and Kuipers 2006) and (Chen and Mooney 2011), where the parsing of route instructions is addressed as a Statistical Machine Transla- tion task between the human language and a synthesized robot language. The same approach is applied in (Matuszek, Fox, and Koscher 2010) to learn translation model between natural language and formal descriptions of paths. A probabilistic CCG is used in (Matuszek et al. 2012) to map natural navigational instructions into robot executable commands. The same problem is faced in (Kollar et al. 2010; Duvallet, Kollar, and Stentz 2013), where Spatial Description Clauses are parsed from sentences through sequence labeling approaches. In (Tellex et al. 2011), the authors address natural language in- structions about motion and grasping, that are mapped into Generalized Grounding

2 The extended version of HuRIC will be released at http://sag.art.uniroma2.it/huric.html 3 http://www.w3.org/TR/speech-grammar/

61 Italian Journal of Computational Linguistics Volume 3, Number 1

Graphs (G3). In (Fasola and Matari´c2013a, 2013b), SLU for pick-and-place instructions is performed through a Bayesian classifier trained over a specific corpus. In (Misra et al. 2016), the authors define a probabilistic approach to ground natural language instructions within a changing environment.

2.1 Contribution

On the one hand, LU4R embodies most of the capabilities in terms of linguistic gen- eralization characterizing the presented data-driven approaches. On the other hand, it introduces several novelties that are missing in the existing literature. First, the interpretation is performed and provided in terms of semantic frames, according to the Frame Semantics theory (Fillmore 1985). Hence, the resulting logic form representing the meaning of a command will be supported by a robust linguistic theory. Moreover, as both the proposed semantic parsing approach and the nature of such a theory are domain-independent, the development of a SLU in other domains will depend mostly on the existence of training data. Second, the interpretation process is context- dependent, whenever additional knowledge derived from perception is discriminating against multiple possible interpretations.

3. The Language Understanding Cascade

A command interpretation system for a robotic platform must produce interpretations of user utterances. In this paper, the understanding process is based on the theory of the Frame Semantics (Fillmore 1985); in this way, we aim at giving a linguistic and cognitive basis to the interpretations. In particular, we consider the formalization promoted in the FrameNet (Baker, Fillmore, and Lowe 1998) project, where actions expressed in user utterances can be modeled as semantic frames. Each frame represents a micro- theory about a real world situation, e.g. the actions of bringing or motion. Such micro- theories encode all the relevant information needed for their correct interpretation. This information is represented in FrameNet via the so-called frame elements, whose role is to specify the participating entities in a frame, e.g. the THEME frame element represents the object that is taken in a bringing action. As an example, let us consider the following sentence: “bring the pillow on the couch” (“porta il cuscino sul divano”, in Italian). This sentence can be intended as a command whose effect is to instruct a robot that, in order to achieve the task, has to: (i) move towards a pillow, (ii) pick it up, (iii) move to the couch and, finally, (iv) release the object on the couch. The language understanding cascade should produce its FrameNet- annotated version, that is:

[bring]Bringing [the pillow]THEME [on the couch]GOAL (1) or

[porta]Bringing [il cuscino]THEME [sul divano]GOAL (2) whenever the command is expressed through the Italian language. Semantic frames can thus provide a cognitively sound bridge between the actions expressed in the language and the implementation of such actions in the robot world, namely plans and behaviors.

62 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

LU4R

Hypotheses Morpho- Action Argument Argument Interpretation Syntactic Re-ranking Perceived Detection Identification Classification entities Analysis

Figure 1 The SLU cascade

ROOT PREP DOBJ POBJ DET DET bring the pillow on the couch VB DT NN IN DT NN

Figure 2 Example of a dependency graph and POS tags associated to “bring the pillow on the couch”

The whole SLU process has been designed as a cascade of reusable components, as shown in Figure 1. As we deal with vocal commands, their (possibly multiple) hypothesized transcriptions derived from an Automatic Speech Recognition (ASR) engine constitute the input of this process. It is composed by four modules, whose final output is the interpretation of a utterance, to be used to implement the corresponding robotic actions. First, Morpho-syntactic and syntactic analysis is performed over the available utterance transcriptions by applying morphological analysis, Part-of-Speech tagging and syntactic analysis. In particular, dependency trees are extracted from the sentence as well as POS tags, as shown in Figure 2. Then, if more than one transcription hypothesis is available, the Re-ranking module can be activated to compute a new ranking of the hypotheses, in order to get the best transcription out of the initial ranking. This module is realized through a learn-to-rank approach, where a Support Vector Machine exploiting a combination of linguistic kernels is applied, according to (Basili et al. 2013). Third, the best transcription is the input of the Action Detection (AD) component. The evoked frames in a sentence are detected, along with the corresponding evoking words, the so-called lexical units. Let us consider the recurring sentence: the AD should produce the following interpretation [bring]Bringing the pillow on the couch. The final step is the Argument Labeling, where a set of frame elements is retrieved for each frame. This process is realized in two sub-steps. First, the Argument Identification (AI) finds the spans of all the possible frame elements, producing the following form [bring]Bringing [the pillow][on the couch]. Then, the Argument Classification (AC) assigns the suitable label (i.e. the frame element) to each span thus returning the final tagging shown in the Example 1. The AD, AI and AC steps are modeled as a sequential labeling task, as in (Bastianelli et al. 2016a). The Markovian formulation of a structured SVM proposed in (Altun, Tsochantaridis, and Hofmann 2003) is applied to implement the sequential labeler, known as SVMhmm. In general, this learning algorithm combines a local discriminative model, which estimates the individual observation probabilities of a sequence, with a global generative approach to retrieve the most likely sequence, i.e. tags that better explain the whole sequence. In other words, given an input sequence x =(x1 ...xl) hmm ∈X of feature vectors x1 ...xl, SVM learns a model isomorphic to a k-order Hidden Markov Model, to associate x with a set of labels y =(y ...y ) . 1 l ∈Y

63 Italian Journal of Computational Linguistics Volume 3, Number 1

A sentence s is here intended as a sequence of words wi, each modeled through a feature vector xi and associated to a dedicated label yi, specifically designed for each interpretation process: in any case, features combine linguistic evidences from a targeted sentences, but also features derived from the semantic map (when available) in order to synthesize information about existence and position of entities around the robot, as discussed in more details in (Bastianelli et al. 2016a) During training, the SVM algorithm associates words to step-specific labels: linear kernel functions are applied to different types of features, ranging from linguistic to perception-based features, and linear combinations of kernels are used to integrate independent properties. At hmm classification time, given a sentence s =(w1 ...ws ), the SVM efficiently predicts | | the tag sequence y =(y1 ...ys ) using a Viterbi-like decoding algorithm. Notice that both the re-ranking| | and the semantic parsing phases can be realized in two different settings, depending on the type of features adopted in the labeling process. It is this possible to rely upon linguistic information to solve the given task, or also on perceptual knowledge coming from a semantic map. In the first case, that we call basic setting, the information used to solve the task comes from linguistic inputs, as the sentence itself or external linguistic resources. These models correspond to the methods discussed in (Bastianelli et al. 2017; Basili et al. 2013). In the second case, the simple setting, when perceptual information is made available to the chain, a context-aware interpretation is triggered, as in (Bastianelli et al. 2016a). Such perceptual knowledge is mainly exploited through a linguistic grounding mechanism. This lexically-driven grounding is estimated through distances between filler (i.e. argument heads) and entity names. Such a semantic distance integrates metrics over word vectors descriptions and phonetic similarity. Word semantic vectors are here acquired through corpus analysis, as in Distributional Lexical Semantic paradigms (Turney and Pantel 2010). They allow to map referential elements, such as lexical fillers, e.g. couch, to entities, e.g. a sofa, by thus modeling synonymy or co-hyponymy. Conversely, phonetic similarities are smoothing factors against possible ASR transcription errors, e.g. pitcher and picture, allowing to actually cope with spoken language. Once links between fillers and entities have been activated, the sequential labeler is made sensitive to additional features, that inject perceptual information both in the learning and the tagging process, e.g. the presence/absence of referred objects in the environment. As a side effect, the above mechanism provides the robot with the set of linguistically-motivated groundings, that can be potentially used for any further grounding process. This information can be crucial in the correct interpretation of ambiguous com- mands, which depends on the specific environmental setting in which the robot op- erates. A straightforward example is the command “bring the pillow on the couch in the living room”. Such a sentence may have two different interpretations, according to the configuration of the environment. In fact, whenever the couch is located into the living room, the goal of the Bringing action is the couch and interpretation will be:

[bring]Bringing [the pillow]THEME [on the couch in the living room]GOAL

Conversely, if the couch is outside the living room, it means that probably the pillow is already on the couch. Hence, the interpretation of the sentence will be different, due to different argument spans, and the couch becomes the goal of the Bringing action:

[bring]Bringing [the pillow on the couch]THEME [in the living room]GOAL

64 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

Such ambiguities are mostly cross-lingual. In fact, this phenomenon can be observed even in the corresponding Italian command “porta il cuscino sul divano in sala da pranzo”. However, the proposed approach is robust towards different languages, as the disam- biguation of the interpretation depends just on the configuration of the environment and not on the targeted language. Additional details about the pure linguistic approach can be found in (Bastianelli et al. 2017), whereas (Bastianelli et al. 2016a) provides a detailed description of the context aware SLU process.

4. LU4R - adaptive spoken Language Understanding 4 Robots

The architecture of the LU4R system considers two main actors, as shown in Figure 3: the Robotic Platform and the LU4R chain (or LU4R), where the processing cascade of the latter component has been introduced in the previous Section.

Robotic Platform LU4R List of (Client) (Server) Spoken hypotheses command LU4R ROS interface Hypotheses (SLU Orchestrator) Perceived entities Response

LU4R Response Android app (ASR) Grounding

Support Knowledge Base Interpretation

Domain Semantic User Platform Model Map Model Model

Figure 3 The LU4R architecture

The Client-Server communication schema between LU4R and the Robot allows for the independence from the Robotic Platform, in order to maximize the re-usability and integration in heterogeneous robotic settings. The SLU process exhibits semantic capabilities (e.g. disambiguation, predicate detection or grounding into robotic actions and environments) that are designed to be general enough to be representative of a large set of application scenarios. It is obvious that an interpretation process must be achieved even when no infor- mation about the domain/environment is available, i.e. a scenario involving a blind but speaking robot, or when the actions a robot can perform are not made explicit. This is the case when the command “take the pillow on the couch” is not paired with any additional information and the ambiguity with respect to the evoked frame, i.e. Taking vs. Bringing, cannot be resolved. At the same time, LU4R makes available methods to specialize its semantic interpretation process to individual situations where more information is available about goals, the environment and the robot capabilities. These methods are expected to support the optimization of the core SLU process against a specific interactive robotics setting, in a cost-effective manner. In fact, whenever more information about the environment perceived by the robot (e.g. a semantic map) or about its capabilities is provided, the interpretation of a command can be improved by exploiting a more focused scope. That is: whenever the sentence “take the pillow on the

65 Italian Journal of Computational Linguistics Volume 3, Number 1 couch” is provided along with information about the presence and possible positions of a pillow on a couch. In order to better describe the different operating modalities of LU4R, some assump- tions toward the Robotic Platform must be made explicit: this will allow to precisely establish functionalities and resources that the robot needs to provide to unlock the more complex processes. These information will be used to express the experience that the robot is able to share with the user (i.e. the perceptual knowledge about the environment where the linguistic communication occurs and some lexical information and properties about objects in the environment) and some level of awareness about its own capabilities (e.g. the primitive actions that the robot is able to perform, given its hardware components). In the following, each component of the architecture in Figure 3 will be discussed and analyzed.

4.1 The Robotic Platform

The LU4R system contemplates a generic Robotic Platform, whose task, domain and physical setting are not necessarily specified. In order to make the SLU process inde- pendent from the above specific aspects, we assume that the platform requires, at least, the following modules:

an Automatic Speech Recognition (ASR) system; r a SLU Orchestrator; r a Grounding and Command Execution Engine; r a Physical Robot. r In developing LU4R, we implemented both the ASR system and a simple SLU Orchestrator. The ASR is realized by the LU4R Android app, exploiting the Android environment, whereas the SLU orchestrator is implemented as a ROS node, through the LU4R ROS interface. Additionally, the optional component Support Knowledge Base is expected to main- tain and provide the contextual information discussed above. While the discussion about the Robotic Platform is out of the scope of this work, all the other components are hereafter shortly summarized.

LU4R Android app. An ASR engine allows to transcribe a spoken utterance into one or more transcriptions. In the latest release, the ASR is performed through an ad-hoc Android application, the LU4R Android app (Fig. 4). It relies on the official Google ASR API4 and offers valuable performances for an off-the-shelf solution. The main requirement of this solution is that the device hosting the software must feature an Internet connection, in order to provide transcriptions for the spoken utterance. The App can be deployed on both Android smartphones and tablets. In the latter case, even though the communication protocol remains the same, the tablet will be part of the robotic platform. The tablet can be provided with a directional condenser microphone and speakers.

4 http://goo.gl/4ZkdU

66 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

Figure 4 The LU4R Android app

The communication with the entire system is realized through TCP Sockets. In this setting, the LU4R Android app implements a TCP Client, feeding LU4R with lists of hypotheses through a middle-layer. To this end, the LU4R ROS interface has been integrated in the loop, acting as the TCP Server. Once a new sentence is uttered by the user, this component outputs a list of hypoth- esized transcriptions, that are forwarded to the LU4R ROS interface.

LU4R ROS interface. The LU4R ROS interface implements a TCP Server for the LU4R Android app, here coded as a ROS node waiting for Client requests. Once a new request is received (a list of transcriptions for a given spoken sentence), this module is in charge of extracting the perceived entities from a structured representation of the environment (here, a sub-component of the Support Knowledge Base) and sending the list of hypothesized transcriptions to LU4R along with the list of the perceived entities. The communication protocol requires the serialization of such information in two different JSON objects. However, in order to obtain the desired interpretation, only the list of transcription is mandatory. In fact, even though environmental information is essential for the perception-driven chain, whenever it is not provided, the chain operates in a blind setting. Moreover, this module has been decoupled from the LU4R chain as it can be employed for other purposes, such as tele-operating the robot by means of a virtual joypad coded into the Android App (Fig. 4). This component, mediating between the LU4R Android App, the LU4R Chain and the Robotic Platform, is provided along with the LU4R system, so that robustness in the communication is guaranteed. Hence, the robotic developers are in charge of: (i) the deployment of the ROS node into the target Robotic System; (ii) the definition of the policies for the acquisition of perceptual knowledge; and (iii) the manipulation of the structure representing the interpretation returned by the LU4R Chain. Even though this module is actually a TCP Server for the LU4R Android App, it represents also the Client interface toward the LU4R Chain.

Grounding and Command Execution. Even though the grounding process is placed at the end of the loop, it is discussed here as it is a component of the Robotic Platform. In fact, this process has been completely decoupled from the SLU process, as it may involve perception capabilities and information unavailable to LU4R or, in general, out

67 Italian Journal of Computational Linguistics Volume 3, Number 1 of the linguistic dimension. Nevertheless, this situation can be partially compensated by defining mechanisms to exchange some of the grounding information with the linguistic reasoning component. The grounding carried out by the robot is triggered by a logical form expressing one or more actions through logic predicates, that poten- tially correspond to specific frames. The output of LU4R embodies the produced logic form: this latter exposes the recognized actions that are then linked to specific robotic operations (primitive actions or plans). Correspondingly, the predicate arguments (e.g. objects and location involved in the targeted action) are detected and linked to the objects/entities of the current environment. A fully grounded command is obtained through the complete instantiation of the robot action (or plan) and its final execution.

4.2 The LU4R Chain

The LU4R component implements the language understanding cascade described in Section 3. It realizes the SLU service as a black-box component, so that the complexity of each inner sub-task is hidden to the user. It is entirely coded in Java and released as a single Jar file. Morpho-syntactic and syntactic analysis is realized through the Stanford CoreNLP suite (Manning et al. 2014) when English is the targeted language, and the Chaos parser (Basili and Zanzotto 2002) for Italian commands. Conversely, the SVMhmm algorithm for the three steps of the semantic analysis (namely, Action Detection, Argument Iden- tification and Argument Classification) is implemented through the KeLP framework (Filice et al. 2015). The LU4R Chain is a service that can be invoked through HTTP communication. Its implementation is realized through a server that keeps listening to natural language sentences and outputs an interpretation for them. The communication between the client of the service (the Robotic Platform) and the LU4R Chain is described in this Section. The LU4R Chain requires an initialization phase, where the process is run and initialized, followed by a service phase, where LU4R is ready to receive requests. The initialization phase corresponds to create an instance of the chain, among the ones defined in the previous Section, e.g. either basic or simple. The basic setting does not contemplate perceptual knowledge during the interpretation process. Conversely, the simple configuration relies on perceptual information, enabling a context-sensitive interpretation of the command at the predicate level. During the initialization, a specific output format can be chosen, among the available ones. For example, xdg is the default output format, where the interpretation is given in the XDG format eXtended Dependency Graph and XML compliant container (see (Basili and Zanzotto 2002)). In the amr format, the interpretation is given in the Abstract Meaning Representation (see (Banarescu et al. 2013)). Finally, cfr (Command Frame Representation) is a format for the predicates (frames) produced by the chain defined in (Schneider et al. 2014), in the context of RoCKIn competition. The language parameter allows to choose the operating language of LU4R. At the moment, only en (English) and it (Italian) versions are supported. Once the service has been initialized, it is possible to start asking for interpreting user utterances. The server thus waits for messages carrying the utterance transcriptions to be parsed. Each sentence here corresponds to a speech recognition hypothesis. Hence, it can be paired with the corresponding transcription confidence score, useful in the re-ranking phase. The body of the message must then contain the list of hypotheses encoded as a JSON array, called hypotheses, where each entry is a transcription paired with a confidence.

68 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

Additionally, when the simple configuration is selected, the input can include the list of entities populating the environment the robot is operating into (e.g. name of rooms or furnitures and objects of the rooms), again encoded as a JSON array. Despite of the representation of the environment adopted by the robot, this environment- dependent interpretation process requires the following information for each entity “perceived” by the robot:

the type of each entity; it reflects the class to which each specific entity r belongs (e.g. it is an object, such as a table, book, pillow, or a location, such as living_room or kitchen); the preferredLexicalReference used to refer to a class of objects; it is r crucial in order to enable a linguistic grounding between the commands uttered by the user and the entities within the environment. These labels are expected to be provided by the engineer initializing the robot. For example, an entity of the class couch can be referred by the string sofa. If no label is given, it is derived by the name of the corresponding class, so that couch can be used to refer to the objects of the class couch; in the case the engineer provides more than one label, these can be r specified through alternativeLexicalReference, as a list of alternative namings for a given entity; the position of the each entity is essential to determine shallow spatial r relations between entities (e.g. two object are near or far from each other). To this end, each entity is associated with its corresponding coordinate in the world, in terms of planar coordinates (x,y), elevation (z) and angle as the orientation.

5. HuRIC 2.0: a multilingual corpus of robotic command

The computational paradigms adopted in LU4R are based on machine learning tech- niques and depend strictly on the availability of training data. In order to properly train and test our framework, we are developing a collection of datasets that together form the Human-Robot Interaction Corpus5 (HuRIC), formerly presented in (Bastianelli et al. 2014). HuRIC is based on Frame Semantics and captures cognitive information about situations and events expressed in sentences. Differently from other corpora for Spo- ken Language Understanding in Human-Robot Interaction, it is not system or robot dependent both with respect to the kind of sentences and with respect to the adopted formalism. HuRIC contains information strictly related to Natural Language Semantics and it is decoupled from specific systems. The corpus exploits different situations representing possible commands given to a robot in a house environment. HuRIC is composed by different subsets, characterized by different order of complexity and they are designed to stress in different ways a possible architecture. Each dataset includes a set of audio files representing robot commands, paired with the correct transcription.

5 Available at http://sag.art.uniroma2.it/huric. The download page also contains a detailed description of the release format.

69 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 1 HuRIC: some statistics English Italian Number of examples 656 214 Number of frames 18 14 Number of predicates 767 241 Number of roles 34 27 Predicates per sentence 1.17 1.13 Sentences per frame 36.44 15.29 Roles per sentence 2.04 1.83

Table 2 Distribution of frames and frame elements in the English dataset Frame Ex Frame Ex Frame Ex Motion 143 Bringing 153 Cotheme 39 THEME 23 THEME 153 COTHEME 39 GOAL 129 GOAL 95 SPEED 1 DIRECTION 9 AGENT 39 MANNER 9 PATH 9 BENEFICIARY 56 THEME 4 MANNER 4 SOURCE 18 PATH 1 DISTANCE 1 MANNER 1 GOAL 8 AREA 2 AREA 1 AREA 1 SOURCE 1 Locating 90 Inspecting 29 Taking 80 PHENOMENON 89 GROUND 28 AGENT 8 GROUND 34 DESIRED_STATE 9 THEME 80 COGNIZER 10 INSPECTOR 5 SOURCE 16 PURPOSE 5 UNWANTED_ENTITY 2 PURPOSE 2 MANNER 2 Change_direction 11 Arriving 12 Giving 10 THEME 1 GOAL 11 RECIPIENT 10 DIRECTION 11 PATH 5 THEME 10 ANGLE 3 MANNER 1 DONOR 4 SPEED 1 THEME 1 REASON 1

Placing 52 Closure 19 Change_operational_state 49 THEME 52 CONTAINER_PORTAL 8 AGENT 17 GOAL 51 AGENT 7 DEVICE 49 AGENT 7 CONTAINING_OBJECT 11 OPERATIONAL_STATE 43 AREA 1 DEGREE 2 Being_located 38 Attaching 11 Releasing 9 THEME 38 ITEM 6 THEME 9 LOCATION 34 GOAL 11 GOAL 5 PLACE 1 ITEMS 1 Perception_active 6 Being_in_category 11 Manipulation 5 PHENOMENON 6 ITEM 11 ENTITY 5 MANNER 1 CATEGORY 11

Each sentence is then annotated with: lemmas, POS tags, dependency trees6 and Frame Semantics. Semantic frames and frame elements are used to represent the meaning of commands, as, in our view, they reflect the actions a robot can accomplish in a home environment. In this way, HuRIC can potentially be used to train all the modules of the processing chain presented in Section 4. With respect to the previous releases, in order to consider further robotic actions, the release of LU4R required an extension of HuRIC in terms of new frames, such as

6 At the moment of writing the dependency trees associated to the Italian Sentences are still under validation

70 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

Table 3 Distribution of frames and frame elements in the Italian dataset Frame Ex Frame Ex Frame Ex Motion 32 Locating 27 Inspecting 4 GOAL 28 MANNER 2 GROUND 2 MANNER 1 PHENOMENON 27 UNWANTED_ENTITY 2 THEME 3 GROUND 6 INSTRUMENT 1 PATH 2 PURPOSE 1 DESIRED_STATE_OF_AFFAIRS 2 SOURCE 1 DIRECTION 1 Bringing 59 Cotheme 13 Placing 18 THEME 60 COTHEME 13 THEME 18 GOAL 26 GOAL 5 GOAL 17 BENEFICIARY 31 MANNER 6 AREA 1 SOURCE 8 Closure 10 Giving 7 Change_direction 9 CONTAINING_OBJECT 5 RECIPIENT 6 DIRECTION 9 CONTAINER_PORTAL 6 THEME 7 ANGLE 3 DEGREE 1 DONOR 1 SPEED 1 Taking 22 Being_located 14 Being_in_category 4 THEME 22 LOCATION 14 ITEM 4 SOURCE 8 THEME 12 CATEGORY 4 Releasing 8 Change_operational_state 14 THEME 8 DEVICE 14 PLACE 3

CHANGE_DIRECTION and, in general, frame elements: at the moment the English subset of HuRiC contains 656 sentences. Most importantly, we extended HuRIC with a first set of 214 commands in Italian. Almost all Italian sentences are translations of the original commands in English and the corpus keeps also the alignment between those sentences. We believe these alignments will support further researches in further areas, such as in the context of Machine Translation. The number of annotated sentences, number of frames and further statistics are reported in Table 1. Detailed statistics about the number of sentences for each frame and frame elements are reported in the Tables 2 and 3 for the English and Italian subsets, correspondingly. The current release of HuRIC is provided with a novel XML-based format, whose extension is hrc. For each command we can store: (i) the whole sentence, (ii) the list of the tokens composing it, along with the corresponding lemma and POS tag, (iii) the dependency relations among tokens, and (iv) the semantics, expressed in terms of Frames and Frame elements.

6. Experimental Evaluation

In order to provide evidences about the effectiveness of the proposed solution, we report here an evaluation of the interpretation process of robotic commands in two languages, i.e. English and Italian, w.r.t the basic setting. Table 4 and 5 show the results obtained over the new version of the Human- Robot Interaction Corpus (HuRIC), presented in Section 5. In fact, the experiments have been performed on both languages, as HuRIC provides commands in both English and Italian. The results, expressed in terms of Precision, Recall and F1 measure, focus on the semantic interpretation process, in particular Action Detection (AD), Argument Identification (AI) and Argument Classification (AC) steps. In fact, F1 scores measure the quality of a specific module. While in the AD step the F1 refers to the ability to extract the correct frame(s) (i.e. robot action(s) expressed by the user) evoked by a

71 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 4 English dataset Precision Recall F1-Measure AD 95.14% 1.73 95.02% 0.37 95.07% 0.93 ± ± ± AI 89.95% 2.28 89.63% 2.00 89.78% 2.05 ± ± ± AC 92.15% 1.51 92.15% 1.51 92.15% 1.51 ± ± ±

sentence, in the AI step it evaluates to the correctness of the predicted argument spans. Finally, in the AC step the F1 measures the accuracy of the classification of individual arguments. The experiments have been performed in a random split setting, over 5 iterations. During each iteration, the dataset is shuffled and split into three subsets, containing 70%, 10% and 20% of the data, used as training, tuning and testing set, respectively. In this respect, Table 4 and 5 show also the standard deviations among the different iterations. We tested each sub-module in isolation, feeding each step with gold information provided by the previous step in the chain. Moreover, the evaluation has been carried out considering the correct transcriptions, i.e. not contemplating the error introduced by the Automatic Speech Recognition system. The results over both datasets refer to the basic setting of LU4R, that is the configuration in which just linguistic information are exploited. Results against the commands in English (Table 4) are encouraging for the applica- tion of LU4R in realistic scenarios. In fact, the F1 is higher than 95% in the recognition of semantic predicates used to express intended actions (AD). The system is able to recognize the involved entities (AC) with high accuracy as well, with a F1 higher than 92%. This result is surprising when analyzing the complexity of the task. In fact, the classifier is able to cope with a high level of uncertainty, as the amount of possible semantic roles is sizable, i.e. 34 total. The most challenging task seems to be the ability to recognize the spans composing a single frame element (AI), where the F1 settles just under the 90% (89.78%). One of the most frequent errors concerns the ambiguity of the “take” verb. In fact, as explained in the previous sections, the interpretation of such verb may be different (i.e. either Bringing or Taking), depending on the configuration of the environment. As the basic setting does not rely on any kind of perceptual knowledge, the system is not able to correctly discriminate among them. Hence, the resulting interpretation is more likely to be wrong, as it does not reflect the semantics that is motivated by the environment. In terms of F1 measure, this issue affects mainly the process of recognizing the argument spans (AI), rather than the ability to identify the action(s) (AD), as for each (possibly) wrong frame, there could be more than two (possibly) wrong arguments. For example, the sentence “take the pillow on the couch” will be probably recognized to be a Taking action, even though it is labeled as Bringing, i.e. the pillow and the couch are supposed to be far in the environment. While the AD step will receive just one penalty for the wrong recognized action, the AI step is penalized twice, as two arguments were expected by the gold standard annotation, i.e. the pillow as THEME and the couch as GOAL, instead of one, i.e. the pillow on the couch as a single THEME argument. Preliminary experiments in the perception-driven setting seem to show that, whenever such knowledge is in-

72 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots jected into the learning process, the system is able to mitigate the error rate over those phenomena. In addition, small values of standard deviation suggest that the system seems to be rather stable across the different iterations of the experiment and that the results do not depend on specific splits of the entire dataset.

Table 5 Italian dataset Precision Recall F1-Measure AD 93.59% 2.81 88.63% 4.25 91.01% 3.21 ± ± ± AI 82.80% 1.38 82.50% 3.47 82.64% 2.34 ± ± ± AC 89.93% 3.83 89.93% 3.83 89.93% 3.83 ± ± ±

The experiments over the Italian dataset reflect the observation that have been pointed out for the English setting. In fact, the system is able to recognize actions (AD) with an F1 measure of 91.01%. Again, valuable performances here suggest that the process of recognizing the intended action(s) is reliable enough to be applicable in real scenarios. As in the English setting, the most challenging step is the Argument Identification, where the F1 measure does not overstep the 83% (82.64%). The results are promising, when compared to the actual size of the dataset. In fact, the classifiers are trained on just the 80% of the entire Italian dataset, i.e. 170 sentences on average. Though lower, the accuracy in recognizing involved entities (AC) is in line with the English experiments, with a F1 score of 89.93%. It seems plausible that the gap in performances and standard deviations with respect to results against the English dataset is mainly due to the reduced size of the dataset. When looking at the errors, we observe again that the introduction of the perceptual information can be beneficial for the overall task, specially for the AI step. In fact, the command “porta il bicchiere sul tavolo in cucina” (i.e. bring the glass on the table in the kitchen in English) can not be correctly predicted without information about the involved entities, as two different interpretations are plausible. The intended action correspond to a Bringing one in both cases; nevertheless, the involved roles are substantially different. In fact, whenever the referred table (tavolo) is inside the kitchen (cucina), the table itself represents the goal of the action, whereas if the glass (bicchiere) is on the table, this latter is probably outside the kitchen that is, instead, the goal of the action. Hence, the lack of perceptual evidences can play a key role in producing these mis-classifications. Though the F1 measures are not directly comparable as capturing different phenomenon, this behavior could explain the bigger gap that is observed between AD and AI results in the Italian experiment (8.37%) than in the English one (5.29%). Notice that this phenomenon does not affect the AC task. In fact, as we said, during the experimental evaluation each step is fed with gold standard annotations. At the moment of writing we are pairing each sentence from both sub sets of HuRIC with semantic maps, in order to design proper systematic evaluations also for the simple, i.e. context-aware, setting.

7. Conclusions

In this paper, we presented a comprehensive framework for the robust implementation of natural language interfaces for Human-Robot Interaction (HRI). It is specifically

73 Italian Journal of Computational Linguistics Volume 3, Number 1 designed for the automatic interpretation of spoken commands towards robots in domestic environments. The solution proposed here relies on Frame Semantics and supports a structured learning approach to language processing able to map individual sentence transcriptions to meaningful commands. An hybrid discriminative and gener- ative learning method is proposed to map the interpretation process into a cascade of sentence annotation tasks. The overall framework and individual algorithms have been implemented in LU4R, a free and ready-to-use Java processing chain, designed for the cost-effective and rapid deployment of language interfaces in a wide range of robotic platforms. By implement- ing the approach presented in (Bastianelli et al. 2016a),LU4R’s command interpretation is made dependent on the robot’s environment; in fact the adopted training annota- tions not only express linguistic evidences from source utterances, but also account for specific perceptual knowledge derived from a reference map. In this way the seman- tic map aspects useful to interpretation are expressed via feature modeling with the structured learning mechanism applied. Such perceptual knowledge is thus derived from a semantically-enriched implementation of a robot map (i.e. its semantic map): it expresses information about the existence and position of entities surrounding the robot: as this is also available to the user, this information is crucial to disambiguate predicates and role assignments. The machine learning processes inside LU4R have been trained by using an ex- tended version of HuRIC, the Human Robot Interaction Corpus. This corpus, originally composed by example in English, now contains a subset of example in Italian: from the one hand, this novel corpus supports the development of LU4R in the Italian language but, most of all, it will support the research in natural language interfaces for Robots in such language. The empirical results obtained by LU4R over both languages are quite impressive (about 90% of F1 in almost all the evaluations). This (i) confirms the effectiveness of the proposed processing chain, (ii) the application of the same approach in different languages. Further effort is required to extend HuRIC with additional sentences, in order to consider a wider range of robotic actions. We are currently working to make it including semantic maps associated to each individual sentence: it will support a systematic evaluation of the interpretation process enhanced with perceptual information. Future research will also focus on the extension of the methodology proposed in (Bastianelli et al. 2016a), e.g. by considering spatial relations between entities in the environment or their physical characteristics, such as their color. In the medium/long term research, we believe that LU4R will support further and more challenging research topics in the context of HRI, such as in interactive question answering or dialogue with robots.

References Altun, Yasemin, Ioannis Tsochantaridis, and Thomas Hofmann. 2003. Hidden Markov support vector machines. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington D.C., USA, August 21-24. Baker, Collin F., Charles J. Fillmore, and John B. Lowe. 1998. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL ’98), pages 86–90, Montreal, Quebec, Canada, August 10-14. Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August.

74 Vanzo et al. LU4R: Adaptive Spoken Language Understanding for Robots

Basili, Roberto, Emanuele Bastianelli, Giuseppe Castellucci, Daniele Nardi, and Vittorio Perera. 2013. Kernel-based discriminative re-ranking for spoken command understanding in hri. In AI* IA 2013: Advances in Artificial Intelligence. Springer International, pages 169–180. Basili, Roberto and Fabio Massimo Zanzotto. 2002. Parsing engineering and empirical robustness. Natural Language Engineering, 8(3):97–120, June. Bastianelli, Emanuele, Giuseppe Castellucci, Danilo Croce, Roberto Basili, and Daniele Nardi. 2014. Huric: a human robot interaction corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May 26-31. Bastianelli, Emanuele, Giuseppe Castellucci, Danilo Croce, Roberto Basili, and Daniele Nardi. 2017. Structured learning for spoken language understanding in human-robot interaction. The International Journal of Robotics Research, 36(5-7):660–683. Bastianelli, Emanuele, Danilo Croce, Andrea Vanzo, Roberto Basili, and Daniele Nardi. 2016a. A discriminative approach to grounded spoken language understanding in interactive robotics. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), New York, New York, USA, 9-15 July. Bastianelli, Emanuele, Daniele Nardi, Luigia Carlucci Aiello, Fabrizio Giacomelli, and Nicolamaria Manes. 2016b. Speaky for robots: The development of vocal interfaces for robotic applications. Applied Intelligence, 44(1):43–66, January. Bos, Johan. 2002. Compilation of unification grammars with compositional semantics to speech recognition packages. In Proceedings of the 19th International Conference on Computational Linguistics (COLING ’02), volume 1, pages 1–7, Taipei, Taiwan, 26-30 August. Association for Computational Linguistics. Bos, Johan and Tetsushi Oka. 2007. A spoken language interface with a mobile robot. Artificial Life and Robotics, 11(1):42–47. Chen, David L. and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI-11), pages 859–865, San Francisco, California, USA, August 7-11. Diosi, Albert, Geoffrey R. Taylor, and Lindsay Kleeman. 2005. Interactive SLAM using laser and advanced sonar. In Proceedings of the 2005 International Conference on Robotics and Automation, pages 1103–1108, Barcelona, Spain, April 18-22. Duvallet, Felix, Thomas Kollar, and Anthony Stentz. 2013. Imitation learning for natural language direction following through unknown environments. In 2013 IEEE International Conference on Robotics and Automation, pages 1047–1053, Karlsruhe, Germany, May 6-10. Fasola, Juan and Maja J. Matari´c.2013a. Using semantic fields to model dynamic spatial relations in a robot architecture for natural language instruction of service robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 143–150, Tokyo, Japan, November 3-7. Fasola, Juan and Maja J. Matari´c.2013b. Using spatial semantic and pragmatic fields to interpret natural language pick-and-place instructions for a mobile service robot. In Social Robotics: 5th International Conference, ICSR 2013, Bristol, UK, October 27-29, 2013, Proceedings. Springer International Publishing, pages 501–510. Filice, Simone, Giuseppe Castellucci, Danilo Croce, and Roberto Basili. 2015. Kelp: a kernel-based learning platform for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL2015): System Demonstrations, Beijing, China, 26-31 July. Fillmore, Charles J. 1985. Frames and the semantics of understanding. Quaderni di Semantica, 6(2):222–254. Gemignani, Guglielmo, Roberto Capobianco, Emanuele Bastianelli, Domenico Daniele Bloisi, Luca Iocchi, and Daniele Nardi. 2016. Living with robots. Robotics and Autonomous Systems, 78(C):1–16, April. Harnad, Stevan. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346. Kollar, Thomas, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE International Conference on Human-robot Interaction (HRI ’10), pages 259–266, Osaka, Japan, March 2-5. Kruijff, Geert-Jan M., H. Zender, P. Jensfelt, and Henrik I. Christensen. 2007. Situated dialogue and spatial organization: What, where. . . and why? International Journal of Advanced Robotic Systems, 4(2).

75 Italian Journal of Computational Linguistics Volume 3, Number 1

MacMahon, Matt, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In proceedings of the 21st national conference on Artificial intelligence (AAAI’06), volume 2, pages 1475–1482. AAAI Press. Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, Baltimore, Maryland, USA, June 22-27. Matuszek, Cynthia, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE International Conference on Human-robot Interaction (HRI ’10), pages 251–258, Osaka, Japan, March 2-5. IEEE Press. Matuszek, Cynthia, Evan Herbst, Luke S. Zettlemoyer, and Dieter Fox. 2012. Learning to parse natural language commands to a robot control system. In Jaydev P. Desai, Gregory Dudek, Oussama Khatib, and Vijay Kumar, editors, Experimental Robotics: The 13th International Symposium on Experimental Robotics, volume 88 of Springer Tracts in Advanced Robotics, pages 403–415. Springer. Misra, Dipendra K., Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2016. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300. Nüchter, Andreas and Joachim Hertzberg. 2008. Towards semantic maps for mobile robots. Robotics and Autonomous Systems, 56(11):915–926. Perera, Vittorio and Manuela M. Veloso. 2015. Handling complex commands as service robot task requests. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), pages 1177–1183, Buenos Aires, Argentina, 25-31 July. Schneider, Sven, Frederik Hegger, Aamir Ahmad, Iman Awaad, Francesco Amigoni, Jakob Berghofer, Rainer Bischoff, Andrea Bonarini, Rhama Dwiputra, Giulio Fontana, Luca Iocchi, Gerhard Kraetzschmar, Pedro Lima, Matteo Matteucci, Daniele Nardi, and Viola Schiaffonati. 2014. The rockin@home challenge. In Proceedings of the 41st International Symposium on Robotics (ISR/Robotik 2014), pages 1–7, Munich, Germany, June 2-3. Tanenhaus, Michael K., Michael J. Spivey-Knowlton, Kathleen M. Eberhard, and Julie C. Sedivy. 1995. Integration of visual and linguistic information during spoken language comprehension. Science, 268:1632–1634. Tellex, Stefanie, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. 2011. Approaching the symbol grounding problem with probabilistic graphical models. AI Magazine, 34(4):64–76. Turney, Peter D. and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37(1):141–188, January.

76 For a Performance-oriented Notion of Regularity in Inflection: The Case of Modern Greek Conjugation

Stavros Bompolas∗ Marcello Ferro∗∗ Universitá di Patrasso ILC-CNR

Claudia Marzi∗∗ Franco Alberto Cardillo∗∗ ILC-CNR ILC-CNR

Vito Pirrelli∗∗ ILC-CNR

Paradigm-based approaches to word processing/learning assume that word forms are not ac- quired in isolation, but through associative relations linking members of the same word family (e.g. a paradigm, or a set of forms filling the same paradigm cell). Principles of correlative learning offer a set of equations that are key to modelling this complex dynamic at a considerable level of detail. We use these equations to simulate acquisition of Modern Greek conjugation, and we compare the results with evidence from German and Italian. Simulations show that different Greek verb classes are processed and acquired differentially, as a function of their degrees of formal transparency and predictability. We relate these results to psycholinguistic evidence of Modern Greek word processing, and interpret our findings as supporting a view of the mental lexicon as an emergent integrative system.

1. Introduction

Issues of morphological (ir)regularity have traditionally been investigated through the prism of morphological competence, with particular emphasis on aspects of the internal structure of complex words (Bloomfield 1933; Bloch 1947; Chomsky and Halle 1967; Lieber 1980; Selkirk 1984, among others). Within this framework, one of the most influ- ential theoretical positions is that morphologically, phonologically, and/or semantically transparent words are always processed on-line through their constituent elements, whereas irregular, idiosyncratic (non-transparent) forms are stored and retrieved as wholes in the lexicon (Pinker and Prince 1994). Likewise, Ullman and colleagues (1997) assume that the past tense formation of regular verbs in English requires on-line ap- plication of an affixation rule (e.g. walk > walk+ed), while irregular past tense forms, involving stem allomorphy (e.g. drink > drank), are retrieved from the lexicon. Modern Greek introduces an interesting variation in this picture. First, stem allo- morphy and suffixation are not necessarily mutually exclusive processes, but coexist

Laboratory of Modern Greek Dialects, University of Patras, University Campus, 265 04 Rio Patras, ∗ Greece. E-mail: [email protected] Istituto di Linguistica Computazionale "A. Zampolli", v. Moruzzi 1, Pisa, Italy. ∗∗ E-mail: [email protected]

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1 in the same inflected forms (e.g. ["lin-o] ‘I untie’ > ["e-li-s-a] ‘I untied’, [aGa"p(a)-o] ‘I love’ > [a"Gapi-s-a] ‘I loved’). Secondly, affixation rules may select unpredictable stem allomorphs: [aGa"p(a)-o] ‘I love’ > [a"Gapi-s-a] ‘I loved’, [fo"r(a)-o] ‘I wear’ > ["fore-s-a] ‘I wore’, [xa"l(a)-o] ‘I demolish’ > ["xala-s-a] ‘I demolished’. These cases suggest that inflectional (ir)regularity is not an all-or-nothing notion in Greek. Different inflectional processes may compound in the same words to provide a challenging word processing scenario (Tsapkini, Jarema, and Kehayia 2002b, 2004). We offer here a computational simulation of the process of acquiring the Modern Greek verb system from scratch, based on exposure to fully-inflected forms only, with no extra morpho-syntactic or morpho-semantic information being provided. The idea is to investigate how aspects of morphological regularity can impact on early stages of word processing, prior to full lexical access. Our goal is to provide a causal model of the micro-dynamic of morphology-driven, peripheral processing effects, as observed in the experimental and acquisitional evidence of Modern Greek (see section 2). The simulation is implemented with a particular family of artificial neural networks, named Temporal Self-Organising Maps (TSOMs). Unlike traditional multi-layered perceptrons, TSOMs simulate the dynamic spatial and temporal organisation of memory nodes supporting the processing of time-series of symbols, to allow monitoring the short-term and long-term processing behaviour of a serial memory exposed to an increasingly larger set of word forms. Thus, TSOMs are ideal tools for assessing the differential impact of several aspects of regularity (from formal transparency, to predictability and word typicality) on the behaviour of a connectionist framework. To anticipate some of our results, the paper provides a performance-oriented ac- count of inflectional regularity in morphology, whereby perception of morphological structure is not the by-product of the design of the human word processor (purportedly segregating rules from exceptions), but rather an emergent property of the dynamic self-organisation of stored lexical representations, contingent on the processing history of inflected word forms, inherently graded and probabilistic. The evidence is in line with what we know of word processing by human speakers and illustrates the potential of a single, distributed architecture for word processing (Alegre and Gordon 1999; Baayen 2007), to challenge more traditional, modular hypotheses of grammar-lexicon interaction.

2. The evidence

Modern Greek conjugation, like Italian and unlike English, is stem-based. Each fully inflected verb form requires obligatory suffixation of person, number and tense markers that attach to either a bare or a complex stem in both regular ([aGa"p-o] ‘I love’ ~ [a"Gapi- s-a] ‘I loved’) and irregular verbs (["pern-o] ‘I take’ ~ ["pir-a] ‘I took’). Unlike English speakers, Greek speakers must always resort to an inflectional process to understand or produce a fully inflected form, no matter how regular the form is (Terzi, Papapetropou- los, and Kouvelas 2005, page 310). Classifying a Greek verb as either regular or irregular thus requires observation of the stem formation processes whereby agreement and tense suffixes are selected. Accordingly, it is assumed that the presence or absence of the aspectual marker is a criterion for assessing the degree of regularity of a Greek verb. In particular, so- called “sigmatic” (from the Greek letter σ’sigma’) aorist (i.e. perfective) forms (e.g. [aGa"p-o] ~ [a"Gapi-s-a]) are traditionally considered to be regular, in that they involve a segmentable marker (-s-) combined with phonologically predictable or morphologically

78 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection systematic stem-allomorphs. “Asigmatic” past-tense forms (e.g. ["pern-o] ~ ["pir-a]), in contrast, exhibit typical properties of irregular inflection, since they involve unsys- tematic stem allomorphs (in some cases suppletive stems), and no segmentable affixes marking perfective aspect. This distinction has also been supported by psycholinguistic and acquisitional evidence (Stamouli 2000; Tsapkini, Jarema, and Kehayia 2001, 2002a, 2002b, 2002c, 2004; Mastropavlou 2006; Varlokosta et al. 2008; Stavrakaki and Clahsen 2009a, 2009b; Stavrakaki, Koutsandreas, and Clahsen 2012; Stathopoulou and Clahsen 2010; Konstantinopoulou et al. 2013, among others), leading some scholars to suggest that sigmatic past-tense forms are typically produced on-line by rules, and asigmatic forms are stored in, and accessed from the mental lexicon (Stavrakaki and Clahsen 2009a, 2009b; Stavrakaki, Koutsandreas, and Clahsen 2012). However, careful analysis of the Greek verb system appears to question such a sharp processing-storage divide. In particular, Greek data provide the case of a mixed inflectional system where both stored allomorphy and rule-based affixation are simulta- neously present in the formation of past tense forms. Ralli (1988, 2005, 2006) proposes a classification of verb paradigms based on two criteria; firstly, the presence vs. absence of the sigmatic affix and, secondly, the presence vs. absence of (systematic) stem allomor- phy. As a result, we can define the following three classes of aorist formation processes (Tsapkini, Jarema, and Kehayia 2001, 2002a, 2002b, 2002c, 2004): (i) an affix-based class, requiring the presence of the aspectual marker -s-, and including verbs with a predictable phonological stem-allomorph (e.g., ["lin- o] ‘I untie’ ~ ["e-li-s-a], ["Graf-o] ‘I write’ ~ ["e-Grap-s-a] ‘I wrote’); (ii) a mixed class where active perfective past tense forms are produced by affixation of the aspectual marker -s- to a systematic morphological stem- allomorph (e.g., [mi"l-o] ‘I speak’ ~ ["mili-s-a] ‘I spoke’); (iii) an idiosyncratic verb class whose forms are based on either non-systematic stem-allomorphy (including radical suppletion), or no stem-allomorphy and no (sigmatic) aspectual marker (e.g., ["pern-o] ‘I take’ ~ ["pir-a] ‘I took’, ["tro-o] ‘I eat’ ~ ["e-faG-a] ‘I ate’, ["krin-o] ‘I judge’ ~ ["e-krin-a] ‘I judged’). It should be noted that, in Greek regular verbs, transparency/systematicity and pre- dictability are not mutually implied (Ralli 2005, 2006). The morphologically-conditioned allomorphy of class-(ii) verbs requires a systematic pattern of perfective stem formation, namely X(a) ~ X + V (e.g. [aGap(a)-] > [aGapi-]), where ‘X’ is a variable standing for the bare stem, ‘V’ stands for a vowel, and the subscripted ‘(a)’ indicates an optional a form- ing a Modern Greek free variant of the imperfective stem (e.g. [aGa"po] ~ [aGa"pao]). The variable V in the perfective stem can be instantiated as i, e or a, and cannot be predicted, given the bare stem. On the other hand, the phonologically-conditioned allomorphs of class-(i) verbs (e.g. ["lin-] > ["e-li-s-]) are the outcome of exceptionless phonological rules, which nonetheless obfuscate a full formal correspondence (transparency) between the imperfective stem and the perfective stem. Psycholinguistic experiments and evidence from language acquisition provide strong empirical support to the hypothesis that the human lexical processor is sensitive to the more nuanced classification of the Greek verb system reported above. Morpho- logical regularity is not an epiphenomenon of the design of the human language faculty and the purported dualism between rule-based and memory-based routes. Rather, it is better understood as the graded result of the varying interaction of phonological, morphotactic and morpho-semantic factors. In particular, lack of full formal nesting between imperfective and perfective stems (as in [Du"lev-o] ‘I work’ ~ ["Dulep-s-a] ‘I

79 Italian Journal of Computational Linguistics Volume 3, Number 1 worked’) appears to have an extra processing cost for speakers. Tsapkini and colleagues (2002b) compared priming evidence obtained from Greek past-tense forms matched for orthographic overlap (50%) with present forms, in two different conditions of Stimulus- Onset Asynchrony1 (SOA): a short SOA (35 ms) and a long one (150 ms). In the initial stages of lexical access, facilitation is elicited for both irregular and regular verb primes (compared to unrelated primes). However, not all forms seem to benefit from the aug- mented processing time permitted by longer SOA: non fully-nested allomorphic stems (e.g. [Du"lev-o] ~ ["Dulep-s-a], or ["pern-o] ~ ["pir-a]) elicit significantly less priming than do fully-nested allomorphs (e.g.[mi"l-o] ~["mili-s-a]), with intermediate cases of formal overlap (e.g. ["lin-o] > ["e-li-s-a]) getting intermediate facilitation effects. This graded behaviour is confirmed by Voga and Grainger (2004), who found greater priming effects for inflectionally-related word pairs with larger onset overlap. Along the same lines, Or- fanidou and colleagues (2011) investigated the effects of formal transparency/opacity and semantic relatedness in both short and long SOA priming for derivationally-related Greek word pairs. They report evidence that semantically transparent primes produced more facilitation in delayed priming than semantically opaque primes, while ortho- graphically transparent primes produced more facilitation in shortened priming than orthographically opaque primes. To sum up, analysis of Greek data offers evidence of graded levels of morpho- logical regularity, based on the interaction between formal transparency (degrees of stem similarity) and (un)predictability of stem allomorphs. The evidence questions a dichotomous view of storage vs. rule-based processing mechanisms. In fact, no sharp distinction between affix processing and allomorph retrieval can possibly account for the interaction of formal transparency and predictability in Greek word processing. A growing number of approaches to inflection, from both linguistic and psycholin- guistic camps, developed the view that surface word relations represent a fundamental domain of morphological competence (Matthews 1991; Bybee 1995; Pirrelli 2000; Burzio 2004; Booij 2010; Baayen et al. 2011; Blevins 2016). Learning the morphology of a language amounts to acquiring relations between fully stored lexical forms, which are concurrently available in the speaker’s mental lexicon and jointly facilitate processing of morphologically related forms through patterns of emergent self-organisation. This view presupposes an integrative language architecture, where processing and storage, far from being conceived of as insulated and poorly interacting modules, are, respec- tively, the short-term and the long-term dynamics of the same underlying process of adaptive specialisation of synaptic connections. Such an integrative architecture, upheld by recent evidence of the neuro-anatomical bases of short-term and long-term memory processes (Wilson 2001; D’Esposito 2007), crucially hinges on Hebbian principles of synaptic plasticity, which are, in turn, in keeping with mathematical models of discrimi- native learning (Rescorla and Wagner 1972; Ramscar and Yarlett 2007; Ramscar and Dye 2011; Baayen et al. 2011). The approach strikes us as particularly conducive to modelling the intricacies of Modern Greek inflection and provides an explanatory framework to account for human processing evidence. In what follows, we offer a connectionist implementation of this view.

1 Stimulus-Onset Asynchrony (SOA) is a measure of the amount of time between the start of the priming stimulus and the start of the target word. By varying SOA, one can elicit information about the time course of lexical access, namely the routes and procedures that are involved in earlier and later stages.

80 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection

3. Computational Modelling

The advent of connectionism in the 80’s popularised the idea that the lexical processor consists of a network of parallel processing units selectively firing in response to sensory stimuli. Arguably, the most important contribution of connectionism to the theoretical debate on lexical modelling at the time is that it explicitly rejected the idea that word recognition and production require a dichotomous choice between storage and pro- cessing. However, in spite of the prima facie psycho-computational allure of this view of the lexicon, early connectionist models also embraced a number of unsatisfactory assumptions about word learning and processing: from wired-in conjunctive coding of input symbols in context, to output supervision required by the gradient descent algorithm, to a model of word production as a derivational function mapping one lexical base onto fully inflected forms. Later connectionist architectures have tried to address all these open issues. In par- ticular, recurrent neural networks have offered a principled solution to (i) the problem of representing time, and (ii) the problem of learning without supervision. In simple recurrent networks (Jordan 1986; Elman 1990), the input to the network at time t is represented by the current level of activation of nodes in the input layer (as in classical connectionist networks) augmented with the level of activation of nodes in the hidden layer at the previous time tick (t-1). In this way, the network keeps track of its activation states and develops a serial memory of previous inputs. Along the same lines, Temporal Self-Organising Maps (TSOMs) have recently been proposed to model the dynamic topological organisation of memory nodes selectively firing when specific symbols are input to the map in specific temporal contexts (Ferro, Marzi, and Pirrelli 2011; Marzi, Ferro, and Pirrelli 2014; Pirrelli, Ferro, and Marzi 2015). A temporal context is loosely defined as a temporal slot (position) in a time series of input symbols, or a window of surrounding symbols. Context-sensitive node speciali- sation is not wired in the map’s connections at the outset (as in traditional connectionist models), but it is something that emerges as a function of input exposure in the course of training (Marzi et al. 2016). High-frequency input sequences develop deeply entrenched connections and highly specialised nodes, functionally corresponding to human expec- tations for possible continuations. Low-frequency input sequences tend to fire blended node chains, i.e. sequences of nodes that respond to a class of partially overlapping sequences. This is what distinguishes holistic, dedicated memorisation of full forms from chunk-based storage of low-frequency forms, sharing memory chunks with other overlapping forms (Marzi and Pirrelli 2015). TSOMs offer an ecological way to conceptualise human word learning. As sug- gested by the psycholinguistic literature overviewed in section 2, children store all words they are exposed to, irrespective of degrees of regularity or morphological com- plexity. In addition to that, the self-organisation of items in the mental lexicon tends to reflect morphologically natural classes, be they inflectional paradigms, inflectional classes, derivational families or compound families, and this has a direct influence on morphological processing. In what follows, we provide a more formal outline of the architecture of TSOMs, to then explore their potential for modelling evidence from Greek inflection.

3.1 TSOMs

The core of a TSOM consists of an array of nodes with two weighted layers of synaptic connectivity (Figure 1). Input connections link each node to the current input stimulus

81 Italian Journal of Computational Linguistics Volume 3, Number 1

input input map temporal IAP layer connections nodes connections connections

# w 1 x 1 p w 2 x 2 p ...... pop x j o ...... x D ... w L $

symbol level word level Figure 1 Overview of TSOM architecture.

(e.g. a letter or a sound), represented as a vector of values in the [0, 1] interval, shown to the map at discrete time ticks. Temporal connections link each map node to the pattern of node activation of the same map at the immediately preceding time tick. In Figure 1, these connections are depicted as re-entrant directed arcs, leaving from and to map nodes. Nodes are labelled with the input characters that fire them most strongly. ‘#’ and ‘$’ are special characters, marking the beginning and the end of an input word respectively. Each time t a stimulus (e.g. an individual character or a phonological segment in a word) is presented in the input layer, activation propagates to all map nodes through input and temporal connections (short-term processing), and the most highly activated node, or Best Matching Unit (BMU), is calculated. Following this short-term step, node connections are made increasingly more sensitive to the current input symbol, by getting their weights wi,j (from the j-input value to the i-node) closer to the current input values x (t) . The resulting long-term increment w is an inverse function G ( ) j i,j I · of the topological distance di(t) between the node i and the current BMU(t), and a direct function of the map’s spatial learning rate γI (E) at epoch E. γI (E) is a dynamic parameter that decreases exponentially with learning epochs to define how readily the map can adjust itself to the input:

w (t)=γ (E) G (d (t)) [x (t) w (t)] (1)  i,j I · I i · j − i,j Likewise, temporal connections are synchronised to the activation state of the map at time t-1, by increasing the weights mi,h (from the h-node to the i-node) on the connections between BMU(t 1) and all other nodes of the map. The resulting long- − term increment m (t) is, again, an inverse function G ( ) of their topological distance i,h T · di(t) from BMU(t), and a direct function of the learning rate γT (E) at epoch E:

82 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection

m (t)=γ (E) G (d (t)) [1 m (t)] ; h = BMU(t 1) (2) i,j T · T i · − i,h − Given the BMU at time t, the temporal layer encodes the expectation of the current BMU for the node to be activated at time t+1. The strength of the connection between consecutively activated BMUs is trained through the following principles of correla- tive learning, compatible with Rescorla-Wagner (Rescorla and Wagner 1972) equations: Given the input bigram ab, the connection strength between BMU of a at time t and BMU of b at time t+1 will:

1. increase if a often precedes b in training (entrenchment) 2. decrease if b is often preceded by a symbol other than a (competition).

The interaction between entrenchment and competition in a TSOM accounts for im- portant dynamic effects of self-organisation of stored words (Marzi, Ferro, and Pirrelli 2014; Marzi et al. 2016). In particular, high-frequency words tend to recruit specialised (and stronger) chains of BMUs, while low-frequency words are responded to by more “blended” (and weaker) BMU chains. In what follows, we report how well a TSOM can learn the complexity of the Greek verb system, by controlling factors such as word frequency distribution, degrees of inflectional regularity and word length. Since our main focus here is on the dynamic of word processing, and on how this dynamic changes as the TSOM is exposed to more and more input words through learning, we will monitor both the developmental pace of acquisition (e.g. whether the map learns regulars more easily and quickly than irregulars or viceversa), and on the way the TSOM processes inflected forms at the final learning epoch. Other important issues, such as the ability to generalise to unknown forms (Marzi, Ferro, and Pirrelli 2014), or the ability to produce an inflected form on the basis of either a single uninflected base form (Ahlberg, Forsberg, and Hulden 2014; Nicolai, Cherry, and Kondrak 2015), or a pool of abstract morpho-lexical features (Malouf 2016) will not be addressed here.

4. The experiment

From the FREQcount section of the Greek SUBTLEX-GR corpus (Dimitropoulou et al. 2010), we selected the 50 top-ranked Greek paradigms by cumulative token frequency, and sampled 15 forms for each paradigm, for a total of 750 training items. For all 50 paradigms, forms were sampled from a fixed pool of paradigm cells: the full set of present indicative (6) and simple past tense (6) forms, and the singular forms of simple future (3). As we wanted to focus on effects of global paradigm-based organisation of active voice indicative conjugation, we excluded paradigms with systematic gaps, impersonal verbs, and deponent verbs. High-frequency paradigms with suppletive forms and/or non-systematic allomorphy (Ralli 2006, 2014) were included if they met our sampling criteria. Concerning systematic free variants (e.g. [aGa"po] ~ [aGa"pao]), the most frequent form for each cell was selected, to avoid making a distinction between basic and alternative verb forms (Voga, Giraudo, and Anastassiadis-Symeonidis 2012). The training dataset was administered to a 42x42 node map for 100 learning epochs. Upon each learning epoch, all 750 forms were randomly shown to the map as a function of their real word frequency distribution in the reference corpus, fitted in the 1-1001

83 Italian Journal of Computational Linguistics Volume 3, Number 1 range. To control for experimental variability, we repeated the experiment 5 times. After training, we assessed how well each map acquired the 750 input forms, using the task of word recall as a probe.

4.1 Word recall

In showing the string #pop$ one symbol at a time on the input layer (Figure 1), the activation pattern triggered by each symbol is incrementally overlaid with the patterns generated by all symbols in the string. The resulting integrated activation pattern (IAP ) is shown in Figure 1 by levels of node activation represented as shaded nodes. Integrated activation is calculated with the following equation:

yˆi = max yi(t) ; i =1, ..., N (3) t=1,...,k { }

where i ranges over the number N of nodes in the map, and t ranges over the symbol positions in an input string k characters long. Intuitively, each node in the IAP is associated with the maximum activation level reached by the same node in processing the entire input word. Note that, in Figure 1, the same symbol p, occurring twice in #pop$, activates two different BMUs depending on its position in the string. After presentation of #pop$, integrated levels of node activation are stored in the weights of a third level of IAP connectivity, linking the map nodes to the lexical map proper (rightmost vector structure in Figure 1). The resulting IAP is not only the short-term processing response of a map to #pop$. The long-term knowledge sitting in its lexical connections makes the current IAP a routinized memory trace of the map processing response. In fact, a TSOM can reinstate the string #pop$ from its IAP . We call this reverse process of outputting a string from its IAP word recall. The process consists in the following steps:

1. initialise: (a) activate the word IAP on the map (b) prompt the map with the start-of-word symbol # (c) integrate the IAP with the temporal expectations of # 2. calculate the next BMU and output its associated label 3. if the end-of-word symbol $ was not output: (a) integrate the IAP with the temporal expectations of the BMU (b) go back to step 2 4. stop

A word is recalled correctly from its IAP if all its symbols are output correctly in the appropriate left-to-right order. It should be appreciated that, even when it is applied to the same word items used during training, word recall from the IAP is not a trivial process. Whereas in training each input stimulus is presented with explicit timing information (symbols are administered to the map one at a time), a word IAP is a synchronous activation pattern, where timing information is encoded only implicitly. In fact, accurate recall requires that the TSOM has developed a fine-grained association of map nodes with time events in

84 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection training, apportioning specialised time-bound nodes to systematically occurring input sequences. We can thus make the further reasonable assumption that a word is acquired by a TSOM when the map is in a position to recall the word accurately and consistently from its IAP .

5. Data analysis

Average recall accuracy at epoch 100 turns out to be considerably high: 99.6 % (std = 0.1%). Results are analysed using Linear Mixed Effects (LME) models with experiment repetitions and training items used as random variables. First, we compared the pace in the acquisition of Greek verb forms with the pace of acquisition of two other conjugation systems of comparable complexity but different language family: Italian and German. Results for Italian and German were obtained with the same training protocol used for Greek verbs (Marzi et al. 2016): 50 top-frequency verb paradigms were selected for each of the two languages, and the same pool of 15 forms was sampled from each selected paradigm. Input forms were administered for 100 epochs according to a function of their frequency distribution fitted in the 1-1001 range. Each training experiment was repeated 5 times, and results are averaged over all repetitions. Figure 2 shows the marginal plot of the interaction between word length and regu- lar vs. irregular verb classes for German, Italian and Greek, using an LME model fitting word learning epochs, with (log) word frequency, inflectional class and word length as fixed effects. In German and Italian, the distinction between regular and irregular paradigms is based on the criterion of absence vs. presence of stem allomorphy across all forms of a paradigm (Marzi et al. 2016). In Greek, we consider regular all paradigms showing a sigmatic perfective stem, and irregular those with an asigmatic perfective stem. Unlike German and Italian (Figure 2, top and middle panels), where irregulars tend to be acquired systematically later than length-matched regulars are, and no significant interaction is found, Greek data (Figure 2, bottom panel) show an interesting crossing pattern: shorter irregulars are acquired earlier than length-matched regulars of compa- rable frequency, but long irregulars are acquired later than long regulars. Marzi and colleagues (2016) account for earlier learning epochs of both German and Italian regulars as an effect of stem transparency on cumulative input frequencies. With German and Italian regular verbs, stems are shown to the map consistently more often, since they are transparently nested in all forms of their own paradigm. This makes their acquisition quicker, due to specialised chains of stem-sensitive BMUs getting more quickly entrenched. Once a stem is acquired, it can easily be combined with a common pool of inflectional endings for tense and agreement, simulating an effect of (nearly) instantaneous (or paradigm-based, as opposed to item-based) acquisition. In contrast, Greek verb classes always present stem allomorphy throughout their paradigms, no matter whether allomorphy is systematic, phonologically motivated or unsystematic. In regular verbs, where perfective stem formation requires -s- affixation (verb classes i and ii above), perfective stems are systematically longer than their imperfective coun- terparts, and are acquired after them. Nonetheless, since imperfective stems are fully or partially nested in perfective stems, learning a long regular perfective form is easier (i.e. it takes a comparatively shorter time) than learning an irregular perfective form of comparable length (verb class iii above). This is, again, a regularity-by-transparency

85 Italian Journal of Computational Linguistics Volume 3, Number 1

German 26 irregulars 24 regulars

22

20

18

16 learning epoch 14

12

10 4 6 8 10 12 14 word length Italian 26 irregulars 24 regulars

22

20

18

16 learning epoch 14

12

10 4 6 8 10 12 14 word length Greek 26 irregulars 24 regulars

22

20

18

16 learning epoch 14

12

10 4 6 8 10 12 14 word length

Figure 2 Marginal plots of interaction effects between word length, log frequency and inflectional regularity in an LME model fitting word learning epochs in German (top), Italian (middle) and Greek (bottom). Solid lines = regulars, dotted lines = irregulars. See main text for details on criteria for inflectional regularity in the three languages.

86 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection

Greek 0.2

0.15

0.1 difficulty of recall 0.05 asigmatic irregular sigmatic morphological sigmatic phonological 0 4 6 8 10 12 14 word length Figure 3 Marginal plot of interaction effects between word length (x axis), log frequency and degrees of stem regularity in an LME model fitting “difficulty of recall” (y axis) by TSOMs trained on Greek verb forms. Difficulty of recall is a direct function of the amount of noise filtering required for an input string to be correctly recalled from its IAP . Intuitively, the higher the level of activation of non target nodes in the IAP , the more difficult the recall of target nodes. effect, and explains why long regular forms tend to be acquired (on average) more easily than long irregular forms. To further investigate the impact of degrees of formal transparency on the process- ing of Greek verb forms, we conducted an LME analysis of the interaction between word length and classes of (ir)regularity in word recall (Figure 3). When we control for word length, regular verbs with sigmatic morphological allomorphs (e.g. [aGa"p(a)-o] ~ [a"Gapi-s-a], solid line in the plot) are recalled more easily than regular verbs with sigmatic phonological allomorphs (e.g.[Du"lev-o] ~ ["Dulep-s-a], dashed line in the plot). Difficulty of recall is estimated here as a direct function of the amount of filtering on node activation levels required for a word form to be recalled accurately from its IAP (Marzi et al. 2016). Intuitively, low levels of filtering mean that all relevant BMUs in the IAP are activated considerably more strongly than other irrelevant co-activated nodes, and are thus easier to recall. High levels of filtering suggest that the overall memory trace is difficult to recall due to the high activation of many spurious competitors. Note further that both regular classes are easier to recall than asigmatic irregular verbs (dotted line in the plot), which show, in most cases, formally more opaque allomorphs (e.g. ["pern-o] ~ ["pir-a]). As shown by the difference in slope between the solid line and the other two lines of Figure 3, recall difficulty increases with word length, supporting our interpretation of the crossing pattern in the bottom panel of Figure 2. Finally, we assessed the role of predictability of both stems and affixes in Greek conjugation by classes of inflectional regularity. Figure 4 plots how easy it is for a map to predict an upcoming symbol at any position in the input string, given the preceding context. Our dependent variable, “ease of prediction” in the plot of Figure 4, is a function of how often a letter l, at distance k from the stem-ending boundary (to the left of the boundary for negative values, and to its right for positive values), is correctly predicted by the map. Note that ease of prediction is 1 if all letters at

87 Italian Journal of Computational Linguistics Volume 3, Number 1

Greek

asigmatic irregular 1 sigmatic morphological sigmatic phonological 0.8

0.6

0.4 ease of prediction

0.2

0 −8 −6 −4 −2 0 2 4 distance to morpheme boundary Figure 4 Marginal plot of interaction effects between the distance to the morpheme boundary (x axis, where x =0indicates the first symbol of the affix), and degrees of stem regularity in an LME model fitting “ease of prediction” (y axis) by TSOMs trained on Greek verb forms. Ease of prediction is 1 if all letters at position k from the morpheme boundary are correctly predicted by the TSOM, and it is 0 if no letters are predicted. The distance k has negative values if the letter precedes the boundary, and 0 or positive values if the letter follows the boundary. position k are always correctly predicted, and equals 0 if no letter in that position is predicted. Intuitively, the slope of the marginal plot approximates the average number of prediction hits at any position in the input string. In Figure 4, stem predictability and affix predictability are plotted as separate seg- ments for the three classes of inflectional regularity, with x =0marking the first symbol after the base stem (e.g. the vowel i in [a"Gapisa]). The general trend shows quicker rates in processing fully transparent stems compared with formally less transparent stems. Conversely, the slope of prediction scores for inflectional markers following transparent stems is less steep than the slope of inflectional markers following less transparent stems. It should be noted that the drop in prediction rate between the end of the stem and the start of the suffix in regular verbs gets smaller as we move to less transparent stem allomorphs. We take this level of uncertainty to be diagnostic for structural complexity: a drop in the level of expectation for an upcoming node at the morpheme boundary is the effect of a perceived morphological discontinuity on the temporal connectivity of the map. In keeping with distributional accounts of morpho- logical constituency (Hay 2001; Bybee 2002; Hay and Baayen 2005, among others), this effect is also modulated by word frequency. As expected, uncertainty is at its peak with fully transparent forms, due to the compounded effect of three factors: (i) nested base stems are shared by all inflected forms of a paradigm and are more easily predictable (due to entrenchment); (ii) for the same reason, they are trailed after by many different affixes; (iii) they undergo unpredictable perfective stem formation, which cannot easily be generalised across paradigms. Conversely, non transparent stems undergo several processes of stem al- ternation (either phonologically or morphologically motivated), and this makes it more difficult to predict their form during processing. However, uncertainty at the stem level

88 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection makes suffix selection easier across the morpheme boundary, when the map knows which stem allomorph occurs in input. In the end, stem allomorphy constrains the number of possible continuations across morpheme boundary, biasing the map’s ex- pectations. This processing dynamic accounts for an advantage in recalling verb forms with less transparent stems when these forms are comparatively shorter than those with more transparent stems (see Figure 2, bottom panel). The advantage progressively shrinks with word length, to become a disadvantage on longer forms.

6. General Discussion

Quantitative analysis of our experimental results highlights a hierarchy of regularity- by-transparency effects on morphological processing. In particular, the evidence offered here emphasises the role of formal preservation of the stem (or stem transparency) in the paradigm as a key facilitation factor for morphological processing. Our case study focused on a distinguishing characteristic of Greek conjugation: all verb paradigms, both regular and irregular ones, involve stem allomorphy in past- tense formation. Hence, the difference between regular and irregular verbs could not be attributed to the categorical presence or absence of stem allomorphy as is the case with other languages, such as English and Italian (and, to a lesser extent, German). Rather, it should be attributed to the type of stem allomorphy itself. Our findings that fully transparent stems facilitate initial processing of the word and increase perception of its morphological structure than more opaque stems do, are in keeping with a surface- oriented notion of morphological regularity, based on patterns of intra-paradigmatic formal redundancy. In addition, they consistently meet psycholinguistic evidence of human processing, and appear to be in good accord with research in Natural Morphol- ogy laying emphasis on effects of iconic preservation of stem forms in regular inflection (Dressler 1996). This lends support to the conclusion that the type of stem allomorphy is what determines the different levels of perceived morphological structure in Modern Greek, crucially involving a regularity-by-transparency interaction, with predictability playing second fiddle. The present analysis paves the way to a performance-oriented notion of inflectional regularity that may ultimately cut across traditional dichotomous classifications. It is noteworthy that dual-route models of lexical processing, which presuppose a sharp subdivision of work between storage and processing, crucially rely on a categorical, competence-based notion of morphological regularity fitting the inflectional systems of some languages only. Highly-inflecting languages such as Modern Greek appear to exhibit a range of processes of stem formation that are considerably more complex and graded than traditional classifications are ready to admit. In turn, this level of complexity calls for integrative, non modular architectures of the human lexical pro- cessor. We believe that TSOMs provide a promising implementation of such integrative architectures.

References Ahlberg, Malin, Markus Forsberg, and Mans Hulden. 2014. Semi-supervised learning of morphological paradigms and lexicons. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 569–578, Gothenburg, Sweden, 26-30 April, 2014. Alegre, Maria and Peter Gordon. 1999. Frequency effects and the representational status of regular inflections. Journal of memory and language, 40(1):41–61. Baayen, R. Harald. 2007. Storage and computation in the mental lexicon. In Gonia Jarema and Gary Libben, editors, The Mental Lexicon: core perspectives. Elsevier, Amsterdam, pages 81–104.

89 Italian Journal of Computational Linguistics Volume 3, Number 1

Baayen, R. Harald, Petar Milin, Dusica Filipovi´cÐurdevi´c,Peter¯ Hendrix, and Marco Marelli. 2011. An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological review, 118(3):438–481. Blevins, James P. 2016. Word and Paradigm Morphology. Oxford University Press, Oxford. Bloch, Bernard. 1947. English verb inflection. Language, 23(4):399–418. Bloomfield, Leonard. 1933. Language. Henry Holt and Co., London. Booij, Geert. 2010. Construction morphology. Language and Linguistics Compass, 4(7):543–555. Burzio, Luigi, 2004. Paradigmatic and syntagmatic relations in Italian verbal inflection, volume 258, pages 17–44. John Benjamins, Amsterdam-Philadelphia. Bybee, Joan. 1995. Regular morphology and the lexicon. Language and Cognitive Processes, 10(5):425–455. Bybee, Joan. 2002. Word frequency and context of use in the lexical diffusion of phonetically conditioned sound change. Language Variation and Change, 14(3):261–290. Chomsky, Noam and Morris Halle. 1967. The Sound Pattern of English. Harper and Row, New York. D’Esposito, Mark. 2007. From cognitive to neural models of working memory. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481):761–772. Dimitropoulou, Maria, Jon Andoni Duñabeitia, Alberto Avilés, Jose Corral, and Manuel Carreiras. 2010. Subtitle-based word frequencies as the best estimate of reading behavior: the case of greek. Frontiers in Psychology, 1(218):1–12. Dressler, Wolfgang U. 1996. A functionalist semiotic model of morphonology. In Rajendra Singh, editor, Trubetzkoy’s Orphan: Proceedings of the Montréal Roundtable on “Morphonology: contemporary responses” (Montréal, October 1994), volume 144 of Current Issues in Linguistic Theory. John Benjamins Publishing Company, Amsterdam-Philadelphia, pages 67–83. Elman, Jeffrey L. 1990. Finding structure in time. Cognitive Science, 14(2):179–211. Ferro, Marcello, Claudia Marzi, and Vito Pirrelli. 2011. A self-organizing model of word storage and processing: implications for morphology learning. Lingue e Linguaggio, 10(2):209–226. Hay, Jennifer. 2001. Lexical frequency in morphology: is everything relative? Linguistics, 39(6):1041–1070. Hay, Jennifer B. and R. Harald Baayen. 2005. Shifting paradigms: gradient structure in morphology. Trends in Cognitive Sciences, 9(7):342–348. Jordan, Michael. 1986. Serial order: A parallel distributed processing approach. Technical Report 8604, University of California. Konstantinopoulou, Polyxeni, Stavroula Stavrakaki, Christina Manouilidou, and Demetrios Zafeiriou. 2013. Past tense in children with focal brain lesions. Procedia-Social and Behavioral Sciences, 94:196–197. Lieber, Rochelle. 1980. On the organization of the lexicon. Ph.D. thesis, MIT, Cambridge. Malouf, Robert. 2016. Generating morphological paradigms with a recurrent neural network. San Diego Linguistics Papers, (6):122–129. Marzi, Claudia, Marcello Ferro, Franco Alberto Cardillo, and Vito Pirrelli. 2016. Effects of frequency and regularity in an integrative model of word storage and processing. Italian Journal of Linguistics, 28(1):79–114. Marzi, Claudia, Marcello Ferro, and Vito Pirrelli. 2014. Morphological structure through lexical parsability. Lingue e Linguaggio, 13(2):263–290. Marzi, Claudia and Vito Pirrelli. 2015. A neuro-computational approach to understanding the mental lexicon. Journal of Cognitive Science, 16(4):493–535. Mastropavlou, Maria. 2006. The effect of phonological saliency, and LF-interpretability in the grammar of Greek normally developing, and language impaired children. Ph.D. thesis, Aristotle University, Thessaloniki. Matthews, Peter H. 1991. Morphology. Cambridge University Press, Cambridge. Nicolai, Garrett, Colin Cherry, and Grzegorz Kondrak. 2015. Inflection generation as discriminative string transduction. In Proceedings of the Annual Conference of the North American Chapter of the ACL, Denver, Colorado, USA, May 31 - June 5, 2015. Orfanidou, Eleni, Matthew H. Davis, and William D. Marslen-Wilson. 2011. Orthographic and semantic opacity in masked and delayed priming: Evidence from greek. Language and Cognitive processes, 26(4-6):530–557. Pinker, Steven and Alan Prince. 1994. Regular, and irregular morphology, and the psychological status of rules of grammar. In Susan D Lima, Roberta Corrigan, and Gregory K Iverson, editors, The Reality of Linguistic Rules. John Benjamins, Amsterdam, pages 321–351.

90 Bompolas et al. A Performance-oriented Notion of Regularity in Inflection

Pirrelli, Vito. 2000. Paradigmi in morfologia. Un approccio interdisciplinare alla flessione verbale dell’italiano. Istituti Editoriali e Poligrafici Internazionali, Pisa. Pirrelli, Vito, Marcello Ferro, and Claudia Marzi. 2015. Computational complexity of abstractive morphology. In Matthew Baerman, Dustan Brown, and Greville G Corbett, editors, Understanding and Measuring Morphological Complexity. Oxford University Press, Oxford, pages 141–166. Ralli, Angela. 1988. Eléments de la Morphologie du Grec Moderne: La Structure du Verbe. Ph.D. thesis, University of Montreal. Ralli, Angela. 2005. Morfologia [Morphology]. Patakis, Athens. Ralli, Angela. 2006. On the role of allomorphy in inflectional morphology: evidence from dialectal variation. In Giandomenico Sica, editor, Open Problems in Linguistics and Lexicography. Polimetrica, Monza, pages 123–152. Ralli, Angela. 2014. Suppletion. In Georgios K Giannakis, editor, Encyclopedia of Ancient Greek language and linguistics. Brill, Leiden, pages 341–344. Ramscar, Michael and Melody Dye. 2011. Learning language from the input: Why innate constraints can’t explain noun compounding. Cognitive Psychology, 62(1):1–40. Ramscar, Michael and Daniel Yarlett. 2007. Linguistic self-correction in the absence of feedback: A new approach to the logical problem of language acquisition. Cognitive Science, 31(6):927–960. Rescorla, Robert A. and Allan R. Wagner. 1972. A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and non-reinforcement. In Abraham H. Black and William F. Prokasy, editors, Classical conditioning II: Current research and theory. Appleton-Century-Crofts, New York, pages 64–99. Selkirk, Elisabeth O. 1984. Phonology and Syntax. The MIT Press, Cambridge. Stamouli, Spyridoula. 2000. Simfonia, xronos ke opsi stin eliniki idiki glosiki diataraxi [agreement, tense, and aspect in specific language impairment in greek]. In Proceedings of the 8th symposium of the Panhellenic Association of Logopedists, Athens, Greece. Ellinika Grammata. Stathopoulou, Nikolitsa and Harald Clahsen. 2010. The perfective past tense in greek adolescents with down syndrome. Clinical Linguistics & Phonetics, 24(11):870–882. Stavrakaki, Stavroula and Harald Clahsen. 2009a. Inflection in williams syndrome: The perfective past tense in greek. The Mental Lexicon, 4(2):215–238. Stavrakaki, Stavroula and Harald Clahsen. 2009b. The perfective past tense in greek child language. Journal of Child Language, 36(01):113–142. Stavrakaki, Stavroula, Konstantinos Koutsandreas, and Harald Clahsen. 2012. The perfective past tense in greek children with specific language impairment. Morphology, 22(1):143–171. Terzi, Arhonto, Spyridon Papapetropoulos, and Elias D. Kouvelas. 2005. Past tense formation and comprehension of passive sentences in parkinson’s disease: Evidence from greek. Brain and Language, 94(3):297–303. Tsapkini, Kyrana, Gonia Jarema, and Eva Kehayia. 2001. Manifestations of morphological impairments in greek aphasia: A case study. Journal of Neurolinguistics, 14(2):281–296. Tsapkini, Kyrana, Gonia Jarema, and Eva Kehayia. 2002a. A morphological processing deficit in verbs but not in nouns: A case study in a highly inflected language. Journal of Neurolinguistics, 15(3):265–288. Tsapkini, Kyrana, Gonia Jarema, and Eva Kehayia. 2002b. Regularity revisited: Evidence from lexical access of verbs and nouns in greek. Brain and Language, 81(1):103–119. Tsapkini, Kyrana, Gonia Jarema, and Eva Kehayia. 2002c. The role of verbal morphology in aphasia during lexical access. In Elisabetta Fava, editor, Clinical Linguistics: Theory and Applications in Speech Pathology and Therapy, volume 227. Benjamins, Amsterdam-Philadelphia, pages 315–335. Tsapkini, Kyrana, Gonia Jarema, and Eva Kehayia. 2004. Regularity re-revisited: Modality matters. Brain and language, 89(3):611–616. Ullman, Michael T., Suzanne Corkin, Marie Coppola, Gregory Hickok, John H. Growdon, Walter J. Koroshetz, and Steven Pinker. 1997. A neural dissociation within language: Evidence that the mental dictionary is part of declarative memory, and that grammatical rules are processed by the procedural system. Journal of Cognitive Neuroscience, 9(2):266–276. Varlokosta, Spyridoula, Anastasia Arhonti, Loretta Thomaidis, and Victoria Joffe. 2008. Past tense formation in williams syndrome: evidence from greek. In Anna Gavarró and Maria João Freitas, editors, Proceedings of GALA 2007, pages 483–491, Cambridge.

91 Italian Journal of Computational Linguistics Volume 3, Number 1

Voga, Madeleine, Hélène Giraudo, and Anna Anastassiadis-Symeonidis. 2012. Differential processing effects within second group modern greek verbs. Lingue e Linguaggio, XI(2):215–234. Voga, Madeleine and Jonathan Grainger. 2004. Masked morphological priming with varying levels of form overlap: Evidence from greek verbs. Current Psychology Letters: Behaviour, Brain & Cognition, 2(13):1–9. Wilson, Margaret. 2001. The case for sensorimotor coding in working memory. Psychonomic Bulletin & Review, 8(1):44–57.

92 EVALITA Goes Social: Tasks, Data, and Community at the 2016 Edition

Pierpaolo Basile∗ Malvina Nissim∗ Università degli Studi di Bari Aldo Moro Rijksuniversiteit Groningen

Viviana Patti∗ Rachele Sprugnoli∗ Università degli Studi di Torino Fondazione Bruno Kessler and Università degli Studi di Trento

Francesco Cutugno∗ Università degli Studi di Napoli “Federico II"

EVALITA, the evaluation campaign of Natural Language Processing and Speech Tools for the Italian language, was organised for the fifth time in 2016. Six tasks, covering both re-reruns as well as completely new tasks, and an IBM-sponsored challenge, attracted a total of 34 submissions. An innovative aspect at this edition was the focus on social media data, especially Twitter, and the use of shared data across tasks, yielding a test set with layers of annotation concerning PoS tags, sentiment information, named entities and linking, and factuality infor- mation. Differently from the previous edition(s), many systems relied on a neural architecture, and achieved best results when used. From the experience and success of this edition, also in terms of dissemination of information and data, and in terms of collaboration between organisers of different tasks, we collected some reflections and suggestions that prospective EVALITA chairs might be willing to take into account for future editions.

1. Introduction

Shared tasks are a common tool in the Natural Language Processing community to set benchmarks for specific tasks and facilitate and promote the development of compa- rable systems. In practice, a group of researchers can set up a specific task, provide development and test data for it, and solicit the participation of research groups in the community, who will develop systems to address the task at hand. Such competitions often take place within larger frameworks, where multiple tasks are organised and coordinated at the same time. A prime example of such frameworks is SemEval1, a well- known series of evaluation campaigns with a specific focus on semantic phenomena. In this contribution, we describe a framework for coordinated evaluation campaigns which rather than being focused on specific language processing phenomena, is centred on a variety of phenomena for a single language, namely Italian.

Group - Address. E-mail: [email protected] 1∗ https://en.wikipedia.org/wiki/SemEval

© 2017 Associazione Italiana di Linguistica Computazionale Italian Journal of Computational Linguistics Volume 3, Number 1

EVALITA2 is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. Since its first edition in 2007, the aim of the campaign is to support the development and dissemination of NLP resources and technologies for Italian. To this end, many shared tasks, covering the analysis of both written and spoken language at various levels of processing, have been proposed within EVALITA. EVALITA is an initiative of the Italian Association for Computational Linguistics3 (AILC) and it is endorsed by the Italian Association of Speech Science4 (AISV) and by the NLP Special Interest Group of the Italian Association for Artificial Intelligence5 (AI*IA). Since 2014, EVALITA is organised in connection with the yearly Italian Con- ference on Computational Linguistics (CLiC-it), and co-located with it. In 2016, EVALITA was organised around a set of six shared tasks and an application challenge, and included several novelties compared to previous years. Most of these novelties were introduced on the basis of the outcome of two questionnaires and of the fruitful discussion that took place during the panel “Raising Interest and Collecting Suggestions on the EVALITA Evaluation Campaign” held in the context of the Second Italian Computational Linguistics Conference (CLiC-it 2015)6. For example, the 2016 edition saw a greater involvement of industrial companies in the organisation of tasks, the introduction of a task and a challenge that are strongly application-oriented, and the creation of cross-task shared data. Also, a strong focus has been placed on using social media data, so as to promote the investigation into the portability and adaptation of existing tools, up to now mostly developed for the news domain. In just a few words, what characterised the 2016 edition is that EVALITA went social, from a range of perspectives: most tasks used social media data, and the newly intro- duced IBM challenge dealt with web-based applications. Also ‘social’ aspects within the community were enhanced: task organisers were encouraged to collaborate on the creation of a shared test set across tasks, and to eventually share all resources with everyone — this has resulted in the creation of a repository that is already accessible7. In addition to the standard webpage, EVALITA also appeared on social channels for the first time, by means of the regular use of a Facebook page8 and a Twitter account 9 for updates and dissemination. We believe this has contributed to boost the number of interested teams and actual participants in the end.

Contributions. This paper offers an overview of the tasks at EVALITA 2016 including, for each, a brief description, a summary of the participating systems, and results, so as to provide a reliable overview of the state-of-the-art for Italian NLP in the targeted areas. For task-re-runs we also compare systems and results to those of previous years, and, whenever possible, also draw comparisons to similar tasks for other languages, especially within the SemEval campaigns. Additionally, we provide some general ob- servations on two of the major innovations in 2016, namely the use of shared data across tasks, and on the use of data from social media. Focusing on the ‘social’ flavour of EVALITA 2016, we also devote some space to discuss the development of the EVALITA

2 http://www.evalita.it 3 http://www.ai-lc.it/ 4 http://www.aisv.it/ 5 http://www.aixia.it/ 6 http://www.evalita.it/towards2016 7 https://github.com/evalita2016/data 8 https://www.facebook.com/evalita2016/ 9 https://twitter.com/EVALITAcampaign

94 Basile et al. EVALITA Goes Social community, in terms of chairs, organisers, and participating teams. On the basis of this experience as well as on the observations gathered from the questionnaire’s results (Sprugnoli, Patti, and Cutugno 2016), we finally offer some ideas and recommendations that the organisers of future EVALITA editions might want to take into account.

2. Tasks and Challenge

As in previous editions, both the tasks and the final workshop were collectively organ- ised by several researchers from the community working on Italian language resources and technologies (see Section 4 for more details). As visible in Figure 1, the 2016 edition featured two re-runs of EVALITA 2014, namely sentiment analysis (SENTIPOLC), and pos tagging (PoSTWITA). However, while the former was an almost exact replica of the previous year (see Section 2.1 for specific differences), the latter shifted its focus on social media data from previously used newswire texts, thereby making this a substantially innovative task in the EVALITA panorama (also because using universal tagset). The other four tasks, and the challenge, were all newly developed in the context of the 2016 edition, though for some there are connections to previous tasks .

Figure 1 Overview of the tasks organised at EVALITA campaigns 2007–2016.

In the remainder of this section, we provide detailed information about the EVALITA 2016 tasks. First, we describe the four evaluation exercises that shared the

95 Italian Journal of Computational Linguistics Volume 3, Number 1 test set (i.e., SENTIPOLC, PoSTWITA, NEEL-It, and FactA—see also Section 3). Next, we report on the speech task (Artiphon), followed by the application-oriented task (QA4FAQ) and lastly on the IBM Challenge. The names of the groups used in the following subsections are directly taken from the reports written by task participants and organisers. We provide a mapping for the used abbreviations in Table 1. Please note that different names can refer to groups formed by the same members (e.g., ILC- CNR and ItaliaNLP) and that the same affiliation can cover different departments of the same institution (e.g. MicroNeel and fbk4faq).

Table 1 Mapping between participating groups and institutions GROUP TASK INSTITUTION ILC-CNR PoSTWITA ItaliaNLP SENTIPOLC Consiglio Nazionale delle Ricerche (CNR) samskara SENTIPOLC ISTC ArtiPhon MicroNeel NEEL-it FBK-HLT-NLP NEEL-it Fondazione Bruno Kessler (FBK) fbk4faq QA4FAQ MARTIN IBM Challenge UniPI NEEL-it UniPisa PoSTWITA University of Pisa UniPI SENTIPOLC CoLingLab SENTIPOLC UniBologna PoSTWITA CoMoDI SENTIPOLC University of Bologna UniBO SENTIPOLC UniDuisburg PoSTWITA University of Duisburg-Essen MIVOQ PoSTWITA Mivoq Srl sisinflab NEEL-it Polytechnic University of Bari UNIMIB NEEL-it University of Milano-Bicocca UniGroningen PoSTWITA University of Groningen ILABS PoSTWITA Integris Srl EURAC PoSTWITA EURAC Research NITMZ PoSTWITA National Institute of Technology NLP-NITMZ QA4FAQ National Institute of Technology & IPN Mexico chiLab4It QA4FAQ University of Palermo ADAPT SENTIPOLC Adapt Centre INGEOTEC SENTIPOLC CentroGEO/INFOTEC CONACyT IntIntUniba SENTIPOLC University of Bari IRADABE SENTIPOLC Uni. Pol. de Valencia & Uni. de Paris & Uni. of Turin SwissCheese SENTIPOLC Zurich University of Applied Sciences tweet2check SENTIPOLC Finsa s.p.a. Unitor SENTIPOLC University of RomaTor Vergata Appetitoso ChatBot IBM Challenge Kloevolution S.r.l. & University of Trento Stockle IBM Challenge INRIA & SciLifeLab

2.1 SENTIPOLC 2.1.1 Task Description, Data, and Evaluation Metrics SENTIPOC (SENTIment POLarity Classification) is a sentiment analysis task where sys- tems are required to automatically annotate tweets with a tuple of boolean values indicating the message’s subjectivity, its polarity (positive or negative), and whether

96 Basile et al. EVALITA Goes Social it is ironic or not (Barbieri et al. 2016). The SENTIPOLC task is indeed organised along three subtasks:10

Task 1 – Subjectivity Classification: a system must decide whether a given message is r subjective or objective. In Table 3 the value related to this task is expressed as subj, and allows for values {0,1}. Task 2 – Polarity Classification: a system must decide whether a given message is r of positive, negative, neutral or mixed sentiment. In our data, positive and negative polarities are not mutually exclusive and each is annotated as a binary category. A tweet can thus be at the same time positive and negative, yielding a mixed polar- ity, or also neither positive nor negative, meaning it is a subjective statement with neutral polarity. Polarity is a valid field only in with subjectivity. See (Basile et al. 2014; Barbieri et al. 2016) for further details and examples. In Table 3, overall polarity values are expressed as opos and oneg, each of them allowing for presence or absence ({0,1}). Task 3 – Irony Detection: a system must decide whether a given message is ironic or not. r Twitter communications include a high percentage of ironic messages (Reyes and Rosso 2014), and because of the polarity-reversing effect that irony can have (one says something “good” to mean something “bad”), systems are heavily affected by this. (Bosco, Patti, and Bolioli 2013; Ghosh et al. 2015). In Table 3,this value is reported as iro and allows for values {0,1}.

The data includes an additional layer of annotation, which specifies the literal polarity of a tweet. In non-ironic cases the values correspond to the overall polarity, while in the ironic tweets, they could be different (see examples in Table 3, values reported as lpos and lneg). This layer is not used in any evaluation directly, but it was provided in case teams wanted to make use of it, especially in dealing with the polarity reversing property of irony.

Development and Test Data. The full dataset released for the shared task comprises the whole of the SENTIPOLC 2014 dataset (training and test, TW-SENTIPOLC14, 6421 tweets (Basile et al. 2014)), 1500 tweets from TWitterBuonaScuola (TW-BS, (Stranisci et al. 2016)), and two brand new sets: 500 tweets selected from the TWITA 2015 collection (TW-TWITA15, (Basile and Nissim 2013)), and 1000 (filtered to 989) tweets collected in the context of the NEEL-IT shared task (TW-NEELIT, (Basile et al. 2016)). The subsets of data extracted from existing corpora (TW-SENTIPOLC14 and TW- BS) were revised/enriched according to the new annotation guidelines specifically devised for this task (please consult (Barbieri et al. 2016) for details). The tweet from NEEL-IT and TWITA15, instead, were annotated completely from scratch using Crowd- Flower11, a crowdsourcing platform which has also been recently used for a similar annotation task (Nakov et al. 2016). The TWITA15 collection, which comprises the 301 tweets also used as test data in the PoSTWITA (Tamburini et al. 2016), NEEL-IT-it (Basile et al. 2016) and FactA (Minard, Speranza, and Caselli 2016) EVALITA 2016 shared tasks (see Section 3 for

10 The three tasks are meant to be independent. For example, a team could take part in the polarity classification task without tackling Task 1. 11 http://www.crowdflower.com/

97 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 2 Dataset breakdown for SENTIPOLC 2016. We specify if there was pre-existing annotation in the datasets we used (pre-annot), and what new annotations were added to comply with the SENTIPOLC 2016’s guidelines. We also distinguish whether the new annotation was performed by the Crowd (C) or by Experts (E). “l-polarity” stands for literal polarity (lpos)

SOURCE PRE-ANNOT ADDED ANNOTATIONS set SIZE TW-SENTIPOLC 2014 yes l-polarity for ironic tweets (E) train 6421 TW-NEELIT no all (C) train 989 total train 7410 TW-BS yes polarity to ironic tweets (E) test 1500 l-polarity to ironic tweets (E) potential subj to neutral (E) TW-TWITA15 no all (C + E) test 500 total test 2000 total 9410

further details on the cross-task shared data), was additionally annotated by experts, so that the resulting labels are a product of crowd and expert agreement.12

Table 3 Examples of tweets exhibiting a variety of annotation combinations according to the SENTIPOLC 2016 guidelines.

description and example tweet in Italian subj opos oneg iro lpos lneg subjective with neutral polarity and no irony 100000 Primo passaggio alla #strabrollo ma secondo me non era un iscritto subjective with mixed polarity and no irony Dati negativi da Confindustria che spera nel nuovo governo Monti. 111011 Castiglione: "Avanti con le riforme" http://t.co/kIKnbFY7 subjective with negative polarity, and an ironic twist Calderoli: Governo Monti? Banda Bassotti ..infatti loro erano quelli 101101 della Magliana.. #FullMonti #fuoritutti #piazzapulita subjective with negative polarity, an ironic twist, and positive literal polarity Ho molta fiducia nel nuovo Governo Monti. Più o meno la stessa 101110 che ripongo in mia madre che tenta di inviare un’email. subjective with negative polarity, an ironic twist, and neutral 101100 literal polarity arriva Mario #Monti: pronti a mettere tutti il grembiulino?

In Table 3 we show a few examples and their annotation. For a comprehensive set of examples and an explanation of allowed combinations, please refer to the SENTIPOLC 2016’s report (Barbieri et al. 2016).

12 The organisers of SENTIPOLC mention that the Crowdflower data had to undergo some post-validation for compliance with the guidelines. For all details, please refer to the SENTIPOLC 2016 report (Barbieri et al. 2016).

98 Basile et al. EVALITA Goes Social

Evaluation. Evaluation is performed using precision, recall, and f-score, and is defined per subtask.

Task1 — Systems are evaluated on the assignment of a 0 or 1 value to the subjec- r tivity field. A response is considered correct or wrong in comparison to the gold standard annotation. We compute precision (p), recall (r) and F-score (F) for each class (subj,obj), and the overall F-score is the average of the two F-scores. Task2 — The coding system allows for four combinations of opos and oneg r values: 10 (positive polarity), 01 (negative polarity), 11 (mixed polarity), 00 (no polarity). Accordingly, we evaluate positive and negative polarity independently by computing precision, recall and F-score for both classes (0 and 1). The F-score for the two polarity classes is the average of the F-scores of the respective pairs. Finally, the overall F-score for Task 2 is given by the average of the F-scores of the two polarities. Task3 — Systems are evaluated on their assignment of a 0 or 1 value to the r irony field. A response is considered fully correct or wrong when compared to the gold standard. We measure precision, recall and F-score for each class (ironic,non-ironic), similarly to the Task1. The overall F-score will be the average of the F-scores for ironic and non-ironic classes.

2.1.2 Participating Systems and Results A total of 13 teams from 6 different countries participated in at least one of the three SENTIPOLC tasks. Almost all teams participated to both subjectivity and polarity clas- sification subtasks. Each team had to submit at least a constrained run. Furthermore, teams were allowed to submit up to four runs (2 constrained and 2 unconstrained) in case they implemented different systems. Overall we have 19, 26, 12 submitted runs for the subjectivity, polarity, and irony detection tasks, respectively. Most of the submissions were constrained: three teams (UniPI, Unitor and tweet2check) participated with both a constrained and an unconstrained runs on the both the subjectivity and polarity subtasks. Unconstrained runs were submitted to the polarity subtask only by IntIntUniba.SentiPy and INGEOTEC.B4MSA. Differently from SENTIPOLC 2014, uncon- strained systems performed better than constrained ones, with the only exception of UniPI, whose constrained system ranked first for the polarity classification subtask. A single-ranking table was produced for each subtask, where unconstrained runs are properly marked. Notice that only the average F-score was used for global scoring and ranking. For each task, we ran a majority class baseline to set a lower-bound for performance. Table 4 shows results for the subjectivity classification task. All participant systems show an improvement over the baseline. The highest F-score is achieved by Unitor at 0.7444, which is also the best unconstrained performance (Castellucci, Croce, and Basili 2016). Among the constrained systems, the best F-score is achieved by samskara with F =0.7184 (Russo and Monachini 2016). Table 5 shows results for the polarity classification task, which was again the most popular subtask with 26 submissions from 12 teams. Also in this case, all participant systems show an improvement over the baseline. The highest F-score is achieved by UniPi at 0.6638 (Attardi et al. 2016a), which is also the best score among the constrained

99 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 4 Task 1 (Subjectivity Classification): F-scores for constrained “.c" and unconstrained runs “.u". After the deadline, two teams reported about a conversion error from their internal format to the official one. The resubmitted amended runs are marked with *. OTHER TEAM ID OBJ SUBJ F METHOD EMB RESOURCES Unitor.1.u 0.6784 0.8105 0.7444 CNN word Tw data, hm-Lex Unitor.2.u 0.6723 0.7979 0.7351 CNN word Tw data, hm-Lex samskara.1.c 0.6555 0.7814 0.7184 Naive Bayes - hm-Lex ItaliaNLP.2.c 0.6733 0.7535 0.7134 LSTM-SVM word Lex IRADABE.2.c 0.6671 0.7539 0.7105 SVM - hm-Lex INGEOTEC.1.c 0.6623 0.7550 0.7086 SVM - Tw data Unitor.c 0.6499 0.7590 0.7044 CNN word - UniPI.1/2.c 0.6741 0.7133 0.6937 CNN word Lex UniPI.1/2.u 0.6741 0.7133 0.6937 CNN word Tw data ItaliaNLP.1.c 0.6178 0.7350 0.6764 LSTM-SVM word Lex ADAPT.c 0.5646 0.7343 0.6495 - -- IRADABE.1.c 0.6345 0.6139 0.6242 DNN-SVN - hm-Lex tweet2check16.c 0.4915 0.7557 0.6236 - - yes (un) tweet2check14.c 0.3854 0.7832 0.5843 - - yes (un) tweet2check14.u 0.3653 0.7940 0.5797 - - yes (un) UniBO.1.c 0.5997 0.5296 0.5647 - -- UniBO.2.c 0.5904 0.5201 0.5552 - -- Baseline 0.0000 0.7897 0.3949 *SwissCheese.c_late 0.6536 0.7748 0.7142 CNN word - *tweet2check16.u_late 0.4814 0.7820 0.6317 - --

runs. As for unconstrained runs, the best performance is achieved by Unitor with F = 0.6620 (Castellucci, Croce, and Basili 2016) 13. Table 6 shows results for the irony detection task. which attracted 12 submissions from 7 teams. The highest F-score was achieved by tweet2check at 0.5412 (constrained run) (Di Rosa and Durante 2016). The only unconstrained run was submitted by Unitor achieving 0.4810 as F-score. While all participating systems show an improvement over the baseline (F = 0.4688), many systems score very close to it, highlighting the complexity of the task 14.

Methods. All systems, except CoMoDI, exploited machine learning techniques in a super- vised setting. Two main strategies emerged. One involves using linguistically principled approaches to represent tweets and provide the learning framework with valuable information to converge to good results. The other exploits state-of-the-art learning frameworks in combination with word embedding methods over large-scale corpora of tweets. On balance, the last approach achieved better results in the final ranks.

13 After the deadline, SwissCheese and tweet2check reported about a conversion error from their internal format to the official one. The resubmitted amended runs are shown in the table (marked by the * symbol), but the official ranking was not revised. 14 In all the tables above we marked with ’-’ all cases where the characteristic is not present or we have not clear information about its presence from the participants report. Moreover, notice that ADAPT, UniBO and twee2check didn’t provide details about their systems

100 Basile et al. EVALITA Goes Social

Table 5 Task 2 (Polarity Classification): F-scores for constrained ".c" and unconstrained runs ".u". Amended runs are marked with * . OTHER TEAM ID POS NEG F METHOD EMB RESOURCES UniPI.2.c 0.6850 0.6426 0.6638 CNN word - Unitor.1.u 0.6354 0.6885 0.6620 CNN word Tw data, hm-Lex Unitor.2.u 0.6312 0.6838 0.6575 CNN word Tw data, hm-Lex ItaliaNLP.1.c 0.6265 0.6743 0.6504 LSTM-SVM word Lex IRADABE.2.c 0.6426 0.6480 0.6453 SVM - hm-Lex ItaliaNLP.2.c 0.6395 0.6469 0.6432 LSTM-SVM word Lex UniPI.1.u 0.6699 0.6146 0.6422 CNN word Tw data UniPI.1.c 0.6766 0.6002 0.6384 CNN word - Unitor.c 0.6279 0.6486 0.6382 CNN word - UniBO.1.c 0.6708 0.6026 0.6367 - -- IntIntUniba.sentipy.c 0.6189 0.6372 0.6281 Linear SVC + - - Emoji Classifier IntIntUniba.sentipy.u 0.6141 0.6348 0.6245 - -- UniBO.2.c 0.6589 0.5892 0.6241 - - UniPI.2.u 0.6586 0.5654 0.6120 CNN word TW data CoLingLab.c 0.5619 0.6579 0.6099 SVM - hm-Lex IRADABE.1.c 0.6081 0.6111 0.6096 SVM, DNN - hm-Lex INGEOTEC.b4msa.u 0.5944 0.6205 0.6075 SVM - Tw data INGEOTEC.2.c 0.6414 0.5694 0.6054 SVM -- ADAPT.c 0.5632 0.6461 0.6046 - -- IntIntUniba.sentiws.c 0.5779 0.6296 0.6037 Rocchio - - Naive Bayes tweet2check16.c 0.6153 0.5878 0.6016 - -- tweet2check14.u 0.5585 0.6300 0.5943 - -- tweet2check14.c 0.5660 0.6034 0.5847 - -- samskara.1.c 0.5198 0.6168 0.5683 Naive Bayes - Lex Baseline 0.4518 0.3808 0.4163 *SwissCheese.c_late 0.6529 0.7128 0.6828 CNN word - *tweet2check16.u_late 0.6528 0.6373 0.6450 - --

However, with F-scores of 0.744 (unconstrained) and 0.7184 (constrained) in subjectivity recognition and 0.6638 (constrained) and 0.6620 (unconstrained) in polarity recognition, we are still far from having solved sentiment analysis on Twitter. For the future, we envisage the definition of novel approaches, for example by combining neural network- based learning with a linguistic-aware choice of features. Many teams adopted learning methods already investigated in SENTIPOLC 2014; in particular, Support Vector Machine (SVM) is the most adopted learning algorithm. The SVM is generally based over specific linguistic/semantic feature engineering, as discussed for example by ItaliaNLP, IRADABE, INGEOTEC or ColingLab. Micro- blogging specific features such as emoticons and hashtags are also adopted, for example by ColingLab, INGEOTEC) or CoMoDi. In addition, some teams (e.g. ColingLab) adopted Topic Models to represent tweets. Samskara also used feature modelling with a com- municative and pragmatic value. CoMoDi is one of the few systems that investigated irony-specific features. Other methods have been also used, as a Bayesian approach by samskara (achieving good results in polarity recognition) combined with linguistically motivated feature modelling. CoMoDi is the only participant that adopted a rule based approach in combination with a rich set of linguistic cues dedicated to irony detection. Approaches based on Convolutional Neural Networks (CNN) have been investigated at 2016 SENTIPOLC for the first time by a few teams. Deep learning methods adopted by

101 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 6 Task 3 (Irony detection): F-scores for constrained “.c" and unconstrained runs “.u". Amended runs are marked with *. OTHER TEAM ID NON IRO IRO F METHOD EMB RESOURCES tweet2check16.c 0.9115 0.1710 0.5412 - -- CoMoDI.c 0.8993 0.1509 0.5251 Rule-based - Lex tweet2check14.c 0.9166 0.1159 0.5162 - -- IRADABE.2.c 0.9241 0.1026 0.5133 SVM - hm-Lex ItaliaNLP.1.c 0.9359 0.0625 0.4992 LSTM-SVM word Lex ADAPT.c 0.8042 0.1879 0.4961 - -- IRADABE.1.c 0.9259 0.0484 0.4872 SVM,DNN - hm-Lex Unitor.2.u 0.9372 0.0248 0.4810 CNN word TW data Unitor.c 0.9358 0.0163 0.4761 CNN word - Unitor.1.u 0.9373 0.0084 0.4728 CNN word Tw data ItaliaNLP.2.c 0.9367 0.0083 0.4725 LSTM-SVM word Lex Baseline 0.9376 0.000 0.4688 *SwissCheese.c_late 0.9355 0.1367 0.5361 CNN word -

some teams, such as UniPi and SwissCheese required to model individual tweets through geometrical representation of tweets, i.e. vectors. Words from individual tweets are represented through Word Embeddings, mostly derived by using the Word2Vec tool or similar approaches. Unitor extends this representation with additional features derived from Distributional Polarity Lexicons. The majority of teams also used external resources, such as lexicons specific for Sentiment Analysis tasks. Some teams used already existing lexicons (referred as Lex in the tables above), such as Samskara, ItaliaNLP, CoLingLab, or CoMoDi, while others created their own task specific resources, such as Unitor, IRADABE, CoLingLab (referred as hm-Lex in the tables above).

Unconstrained runs. Some teams submitted unconstrained results, as they used addi- tional Twitter annotated data for training their systems (Tw data in the above tables). In particular, UniPI used a silver standard corpus made of more than 1M tweets to pre- train the CNN; this corpus is annotated using a polarity lexicon and specific polarised words. Also Unitor used external tweets to pre-train their CNN. This corpus is made of the contexts of the tweets populating the training material and automatically annotated using the classifier trained only over the training material, in a semi-supervised fashion. Moreover, Unitor used distant supervision to label a set of tweets used for the acquisi- tion of their so-called Distribution Polarity Lexicon. Distant supervision is also adopted by INGEOTEC to extend the training material for the their SVM classifier. For a deeper comparison between participating systems and approaches see (Barbieri et al. 2016). As a final note, we would like to mention that the distinction between constrained and unconstrained runs, that we still maintained in this edition, becomes less meaningful when we consider that, as shown the tables above, many constrained systems exploited word embeddings built on huge amounts of additional (Twitter) data. The traditional distinction, which normally focuses on using or not using additional training data annotated according to the task guidelines, was meant to guarantee a fair comparison.

102 Basile et al. EVALITA Goes Social

However, as this distinction might generally become a bit blurred, it is worth reflecting on whether it makes sense to use it in future editions.

2.1.3 Links to Other Shared Tasks Previous EVALITA tasks. SENTIPOLC 2016 is a re-run of SENTIPOLC 2014, which had been introduced then for the first time, and had attracted the highest number of par- ticipants among EVALITA tasks. The main differences between the 2014 and the 2016 editions lie in the data and in the best performing algorithms. Regarding annotation, two new annotation fields which express literal polarity have been added, in order to provide insights into the mechanisms behind polarity shifts in the presence of figurative usage. Also, a portion of the data was annotated via Crowdflower rather than by ex- perts. Regarding the source of data, the test set is drawn from Twitter, but it is composed of a portion of random tweets and a portion of tweets selected via keywords which do not exactly match the selection procedure that led to the creation of the training set. This was intentionally done as a novelty in 2016 to observe the portability of supervised systems, in line with what suggested in (Basile et al. 2015). Finally, concerning systems, for the first time at SENTIPOLC neural models were used with success, achieving best results especially in the open runs. Although eval- uated over a different dataset, the best systems also show better, albeit comparable, performance for subjectivity with respect to 2014’s systems, and outperform them for polarity (if we include late submissions). The use of a progress set, as already done at SemEval, would allow for a proper evaluation across the various editions, and would definitely be a welcome innovation at next edition. In contrast to the improvement in performances from 2014 to 2016, irony detec- tion appears truly challenging, and the systems’ performance drops in 2016 w.r.t the previous edition. The task’s complexity does not depend (only) on the inner structure of irony, but also on unbalanced data distribution (1 out of 7 examples is ironic in the training set). Examples in the dataset are probably not sufficient to generalise over the structure of ironic tweets. Future campaigns could consider including a larger and more balanced dataset of ironic tweets in future campaigns.

Non-EVALITA tasks. Sentiment classification on Twitter is by now an established task internationally. Such solid and growing interest is reflected in the fact that the Sentiment Analysis tasks at SemEval (where they constitute now a whole track) have attracted the highest number of participants in the last years (Rosenthal et al. 2014, 2015; Nakov et al. 2016). It is interesting to highlight that the Swiss team SwissCheese, which achieved the best score in polarity classification (including late submissions) at SENTIPOLC 2016 was the top-scoring team also at the ‘twin task’ for English at Semeval2016-Task4 (Nakov et al. 2016). Task 10 at SemEval 2015 was concerned with irony in Twitter, but rather than as an irony detection task, it was designed as a polarity detection task in tweets that were already known to be ironic. This is also an avenue that could be explored for Italian. More generally, anyway, tasks revolving around the use of non literal language are becoming more popular: two of the five tasks of the sentiment analysis track at SemEval in 2017 are organised around humor-related topics.

2.2 PoSTWITA 2.2.1 Task Description, Data and Evaluation Metrics PoSTWITA consists in developing systems for the Part-Of-Speech (PoS) tagging of tweets: in other words, it concerns with the domain adaptation of tools built for stan-

103 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 7 Size of PoSTWITA datasets and an annotated example Development Set Test Set Example ___579013335921885184___ # tweets 6,438 300 @LudovicaCagnino MENTION Grazie INTJ # tokens 114,967 4,759 AMORE PROPN

dardized texts to social media data (Tamburini et al. 2016). As for the other EVALITA 2016 tasks devoted to the automatic processing of Twitter posts, the final aim of PoST- WITA is to promote research in the automatic extraction of knowledge from social media texts written in Italian. In order to deal with these type of data, it is crucial to have annotation guidelines and resources that take into consideration the linguistic peculiarities of Twitter language. To this end, a specific tagset was defined and new datasets were released. As for the tagset, the main labels are inherited from the ones adopted within the Universal Dependencies (UD) project for Italian15 so to make the resources annotated for the task compliant with the UD treebanks. Anyway, novel tags are introduced to cover three cases: (i) articulated prepositions (ADP_A, della); (ii) clusters made by a verb and a clitic (VERB_CLIT, mangiarlo); (iii) Twitter-specific elements i.e., emoticons (EMO, :-)), web addresses (URL, http://www.site.it), email addresses (EMAIL, [email protected]), hashtags (HASHTAG, #staisereno) and mentions (MENTION, @someone). The use of the first two labels described above, allow to tokenize by maintaining words unsplitted rather than splitted as in the original UD format. Once defined the tagset, development (DS) and test sets (TS) have been collected and annotated. DS data are taken from the EVALITA 2014 SENTIPOLC dataset while TS is shared with other EVALITA 2016 tasks (see Section 3). As for the annotation, in the first step tokenisation and annotation were performed automatically then two expert annotators manually corrected the same tweets in parallel. At the end, an adjudicator re- solved disagreements. Table 7 reports the size of DS and TS together with an annotated example. No additional resource was distributed to the participants who, however, by following an open task approach, had the opportunity to use external data to develop their systems. Systems output is evaluated in terms of tagging accuracy, that is the number of correct PoS tags divided by the total number of tokens in TS.

2.2.2 Participating Systems and Results Although 16 teams registered to the task, only 9 submitted a run to be evaluated. Among these groups, 7 are affiliated to universities or other research centers located in Italy and abroad (India, The Netherlands, and Germany) and 2 are made by representatives of Italian private companies. Table 8 presents the official results of PoSTWITA runs with an accuracy ranging from 0.7600 to 0.9319. Three systems are based on traditional machine learning algorithms (i.e., CRF, HMM , and SVM) while all the others employ Deep Neural Networks: perceptron algorithm in one case and Long Short-Term Memo-

15 http://universaldependencies.org/it/pos/index.html

104 Basile et al. EVALITA Goes Social

Table 8 PoSTWITA results in terms of accuracy (ACC.) with details about the main characteristics of the participating systems OTHER TEAM ID ACC. METHOD EMBEDDINGS RESOURCES ILC-CNR 0.9319 two-branch Bi-LSTM NN word&char yes UniDuisburg 0.9286 CRF classifier - yes MIVOQ 0.9271 CRF + HMM + SVM - yes UniBologna 0.9246 Stacked Bi-LSTM NN + CRF word yes UniGroningen 0.9225 Bi-LSTM NN word yes UniPisa 0.9157 Bi-LSTM NN + CRF word&char yes ILABS 0.8790 Perceptron algorithm - yes NITMZ 0.8596 HMM bigram model - - EURAC 0.7600 LSTM NN word&char yes

ries (LSTM) in the remaining systems. More specifically, bidirectional LSTM (BiLSTM) proves to be optimal being used by 4 out of the 6 top-performing systems. In addition, experiments carried on during the development of the best-scoring system show that a two-branch architecture of BiLSTM performs clearly better than a single bi-LSTM with an improvement of about 0.5 points on DS and 0.84 on TS (Cimino and Dell’Orletta 2016). The majority of systems use word-level or character-level embeddings as inputs for their systems: with respect to the former, the latter performs well giving a finer rep- resentation of words and thus better coping with the noisy language of tweets (Attardi and Simi 2016). In all but one case, additional corpora or resources were employed: word clusters (Horsmann and Zesch 2016), morphological analysers (Tamburini 2016), external dictionaries (Paci 2016), annotated and non-annotated corpora (Stemle 2016). As for this last point, Plank and Nissim (2016) show that adding a small amount of in-domain data is more effective than using data form different genres. Error analysis on systems output highlights that the most challenging distinction in terms of tag assignment is between nouns and proper nouns. In addition, the perfor- mances on the DS and on the TS register a large difference: on DS, top systems reach an accuracy above 0.95 thus more than 0.3 points with respect to the results on the TS. Indeed, if compared with state-of-the art systems built for other languages (Owoputi et al. (2013) reports an accuracy of 0.93 for English), results on DS seems particularly good. This may be due to the strong homogeneity of the training data. On the other side, tweets in the TS covered different topics with respect to the DS: for example, there are only two mentions (@Youtube and @matteorenzi) and four very generic hashtags (#governo, #news, #rt, #lavoro) that are present in both the DS and the TS.

2.2.3 Links to Other Shared Tasks Previous EVALITA Tasks. Both EVALITA 2007 and EVALITA 2009 hosted an evaluation exercise on PoS tagging but their focus was on texts with standard forms and syntax. During the first edition of the campaign, the task was designed around a corpus of different genres (newspaper articles, narrative prose, academic and legal texts) and organized in two subtasks based on two tagsets: i.e., EAGLES and DISTRIB. The former has a long tradition in computational linguistics while the latter is distributionally and syntactically oriented (Monachini 1995; Bernardi et al. 2005). Among the 11 submitted systems, the majority uses SVM or a combination of taggers plus additional resources. In particular, morphological analysers play a crucial role in many of these systems.

105 Italian Journal of Computational Linguistics Volume 3, Number 1

Accuracy ranges from 0.8871 to 98.04 with the EAGLES tagset and from 0.9142 to 0.9768 with the DISTRIB tagset (Tamburini 2007). The 2009 task had a higher complexity due to the adoption of a larger tagset (37 tags with morphological variants and 336 morphed tags) and to the fact that the training and test corpora consisted of texts belonging to different genres (newspaper articles for training and Wikipedia pages for test). Accuracy was measured on morphed tags and also on the tags without morphology following both a closed and an open approach. As in the previous year, most of the 8 participating systems were based on a combination of taggers. Results in the open subtask are all above 0.95 while the closed subtask saw a greater variability with an accuracy ranging from 0.9164 to 0.9691. By looking at these previous EVALITA PoS evaluation exercise, it is easy to see that the performances of systems annotating tweets are lower than when applied to standardized texts thus there is room for improvements. It is also worth noting that in EVALITA 2016 has witnessed the breakthrough of deep learning techniques also for the PoS tagging task.

Non EVALITA Tasks. EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication (CMC) and social media had a subtask on PoS tagging of CMC discourse in German. The dataset is made of data from different CMC genres including tweets (Beißwenger et al. 2016). Performances are lower than the ones registered in EVALITA 2016 with the best system obtaining an accuracy of 0.9028.

2.3 NEEL-it 2.3.1 Task Description, Data and Evaluation Metrics The NEEL-it16 task consists in automatically annotating each named entity mention (belonging to the following categories: Thing, Event, Character, Location, Organization, Person and Product) in a tweet by linking it to the DBpedia knowledge base (Basile et al. 2016). Tweets represent a great wealth of information useful to understand recent trends and user behaviour in real-time. Usually, natural language processing techniques would be applied to such pieces of information in order to make them machine-understandable. Named Entity rEcongition and Linking (NEEL) is a particularly useful technique aiming to automatically annotate tweets with named entities. However, due to the noisy nature and shortness of tweets, this technique is more challenging in this context than elsewhere. NEEL-it followed a setting similar to the NEEL challenge for English Micropost on Twitter (Rizzo et al. 2016). Specifically, each task participant is required to: 1) recognize and typing each entity mention that appears in the text of a tweet; 2) disambiguate and link each mention to the canonicalized DBpedia 2015-10, which is used as referent Knowledge Base; 3) cluster together the non linkable entities, which are tagged as NIL, in order to provide a unique identifier for all the mentions that refer to the same named entity. In the annotation process, a named entity is a string in the tweet representing a proper noun that: 1) belongs to one of the categories specified in a taxonomy and/or 2) can be linked to a DBpedia concept. This means that some concepts have a NIL DBpedia reference17. From the annotation are excluded the preceding article (like il, lo, la, etc.) and any other prefix (e.g. Dott., Prof.) or post-posed modifier. Each participant is asked to produce an annotation file with multiple lines, one for each annotation. A line is a tab

16 http://neel-it.github.io/ 17 These concepts belong to one of the categories but they have no corresponding concept in DBpedia

106 Basile et al. EVALITA Goes Social separated sequence of tweet id, start offset, end offset, linked concept in DBpedia, and category. For example, given the tweet with id 288976367238934528: Chameleon Launcher in arrivo anche per smartphone: video beta privata su Galaxy Note 2 e Nexus 4: Chameleon Laun... the annotation process is expected to produce the output as reported in Table 9.

Table 9 Example of annotations. id begin end link type 288... 0 18 NIL Product 288... 73 86 http://dbpedia.org/resource/Samsung_Galaxy_Note_II Product 288... 89 96 http://dbpedia.org/resource/Nexus_4 Product 290... 1 15 http://dbpedia.org/resource/Carlotta_Ferlito Person

The annotation process is also expected to link Twitter mentions (@) and hashtags (#) that refer to a named entities, like in the tweet with id 290460612549545984: @CarlottaFerlito io non ho la forza di alzarmi e prendere il libro! Help me the correct annotation is also reported in Table 9. Participants were allowed to submit up to three runs of their system as TSV files. We encourage participants to make available their system to the community to facilitate reuse. As for the NEEL-IT corpus, it consists of both a development set (released to participants as training set) and a test set. Both sets are composed by two TSV files: (1) the tweet id file, i.e, a list of all tweet ids used for training; (2) the gold standard, containing the annotations for all the tweets in the development set following the format showed in Table 9. The development set was built upon the dataset produced by (Basile, Caputo, and Semeraro 2015). This dataset is composed by a sample of 1,000 tweets randomly selected from the TWITA dataset (Basile and Nissim 2013). We updated the gold standard links to the canonicalized DBpedia 2015-10. Furthermore, the dataset underwent another round of annotation performed by a second annotator in order to maximize the consistency of the links. Tweets that presented some conflicts were then resolved by a third annotator. Data for the test set was generated by randomly selecting 1,500 tweets from the SENTIPOLC test data (Barbieri et al. 2016). From this pool, 301 tweets were randomly chosen for the annotation process and represents our Gold Stan- dard (GS). The GS was choose in coordination with the task organizers of SENTIPOLC (Barbieri et al. 2016), POSTWITA (Tamburini et al. 2016) and FacTA (Minard, Speranza, and Caselli 2016) with the aim of providing a unified framework for multiple layers of annotations (see Section 3). The tweets were split in two batches, each of them was manually annotated by two different annotators. Then, a third annotator intervened in order to resolve those debatable tweets with no exact match between annotations. The whole process has been carried out by exploiting BRAT18 web-based tool (Stenetorp et al. 2012). By looking at the annotated data, Person results the most populated category among the NIL instances, along to Organization and Product. In the development

18 http://brat.nlplab.org/

107 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 10 NEEL-it results in terms of final score with details about the main characteristics of the participating systems TEAM ID SCORE METHOD EMBEDDINGS OTHER RESOURCES UniPI 0.5034 BiLSTM + text simi- word yes larity MicroNeel 0.4967 Tint + The Wiki Ma- no chine FBK-HLT-NLP 0.4932 EntityPro + News- yes Reader Sisinflab 0.3418 ensemble word no UNIMIB 0.2220 CRF + supervised word no linking

set, the least represented category is Character among the NIL instances and both Thing and Event between the linked ones. A different behaviour can be found in the test set where the least represented category is Thing in both NIL and linked instances. Each participant was asked to submit up to three different runs and the evaluation was based on the following three metrics: STMM (Strong_Typed_Mention_Match). This metrics evaluates the micro average F-1 score for all annotations considering the mention boundaries and their types. This is a measure of the tagging capability of the system. SLM (Strong_Link_Match). This metrics is the micro average F-1 score for annotations considering the correct link for each mention. This is a measure of the linking performance of the system. MC (Mention_Ceaf ). This metrics, also known as Constrained Entity-Alignment F- measure (Luo 2005), is a clustering metric developed to evaluate clusters of an- notations. It evaluates the F-1 score for both NIL and non-NIL annotations in a set of mentions. The final score for each system is a combination of the aforementioned metrics and is computed as follows:

score =0.4 MC +0.3 STMM +0.3 SLM. (1) × × × All the metrics were computed by using the TAC KBP scorer19.

2.3.2 Participating Systems and Results The task was well received by the NLP community and was able to attract 17 expres- sions of interest. Five groups participated actively to the challenge by submitting their system results, each group presented three different runs, for a total amount of 15 runs submitted. Table 10 summarizes the methodology followed by each group and the best performance achieved by each participant. The best result was reported by UniPI (Attardi et al. 2016b), while MicroNeel.base (Corcoglioniti et al. 2016) and FBK-HLT-NLP (Minard, Qwaider, and Magnini 2016) ob-

19 https://github.com/wikilinks/neleval/wiki/Evaluation

108 Basile et al. EVALITA Goes Social tain remarkable results very close to the best performance. It is interesting to notice that all these systems (UniPI, MicroNeel and FBK-HLT-NLP) developed specific techniques for dealing with Twitter mentions reporting very good results for the tagging metric (with values always above 0.46). All participants have made used of supervised algorithms at some point of their tagging/linking/clustering pipeline. UniPi, Sisinflab and UNIMIB have exploited word embeddings trained on the development set plus some other external resources (manual annotated corpus, Wikipedia, and Twita). UniPI and FBK-HLT-NLP built additional training data obtained by active learning and manual annotation. The use of additional resources is allowed by the task guidelines, and both the teams have contributed to develop additional data useful for the research community.

2.3.3 Links to Other Shared Tasks Previous EVALITA Tasks. This is the first edition of the NEEL-it task in EVALITA, how- ever several tasks about Named Entity Recognition (NER) were organized in previous editions of EVALITA (Speranza 2007, 2009; Bartalesi Lenzi, Speranza, and Sprugnoli 2013). The noisy nature and shortness of tweets make the NER task more difficult, in fact we report a mention ceaf of about 59%. During EVALITA 2007 and 2009 the best performance was about 82%, while in EVALITA 2011 the best performance was 61%. It is important to underline the in EVALITA 2011 the task was based on automatic transcription of broadcast news.

Non EVALITA Tasks. The #Microposts (Making Sense of Microposts) workshop organizes a NEEL challenge since 2014. The NEEL-it task is inspired by #Microposts and fellows its guidelines and annotation schema. Other tasks related to entity linking in tweets are the Knowledge Base Population (KBP2014) Entity Linking Track20 and the Entity Recog- nition and Disambiguation Challenge (ERD 2014) (Carmel et al. 2014). It is important to underline that the ERD challenge is not focused on tweets but the short text track is performed in the context of search engine queries.

2.4 FactA 2.4.1 Task Description, Data and Evaluation Metrics FactA (Event Factuality Annotation)21 aims at evaluating the automatic assignment of factuality values to events (Minard, Speranza, and Caselli 2016). In this task, factuality is defined as the committed belief expressed by relevant sources, either the utterer of direct and reported speech or the author of the text, towards the status of an event (Saurí and Pustejovsky 2012), i.e. a situation that happen or occur but also a state or a process (Pustejovsky et al. 2003). Specific linguistic markers and constructions help identifying the factuality of an event that can be classified according to 5 values: factual, counterfactual, non-factual, underspecified and no factuality. This classification is assigned by combining the values given to three attributes namely, Polarity, Certainty and Time. The first attribute expresses how sure the source is about the event, the second distinguishes future and underspecified events from all the others, and the third attribute specifies whether an event is affirmed or negated (Tonelli et al. 2014). The following example shows the factuality annotation of the event usciti/“went out”.

20 http://nlp.cs.rpi.edu/kbp/2014/ 21 http://facta-evalita2016.fbk.eu

109 Italian Journal of Computational Linguistics Volume 3, Number 1

Probabilmente i ragazzi sono usciti di casa tra le 20 e le 21. (“The guys went probably out between 8 and 9 pm.”)

Certainty: no_certain r Time: past/present r Polarity: positive r Factuality Value: non-factual FactAr is organised around a Main Task focusing on the newswire domain and a Pilot Task dedicated to a particular type of social media texts, i.e., tweets. The training set of the Main Task consists of 169 news of the Fact-ItaBank corpus (Minard, Marchetti, and Speranza 2014) while the test set is made of 120 Wikinews articles taken from the Italian section of the NewsReader MEANTIME corpus (Speranza and Minard 2015). As for the Pilot Task, the idea was to measure how well systems built for standard language perform on new text types, like tweets. For this reason only a test set of 301 tweets is provided and corresponds to a subsection of the EVALITA 2016 SENTIPOLC task (Barbieri et al. 2016). An official score and a baseline system are defined: the first one calculates the micro-average F1 score (corresponding to the accuracy) on the overall factuality value and also on the single attributes while the baseline system assigns the most frequent value per attribute on the basis of its frequencies in the training data (that is, certain, positive and past/present) thus all events are annotated as factual.

2.4.2 Participating Systems and Results Although 13 teams registered for the task, none of them took part in it by submit- ting a run. However, after the official evaluation, the system developed by one of the organisers was tested on both Main and Pilot data. The system, called FactPro, performs multi-class classification: three classifiers, one for each factuality attribute, are build using a Support Vector Machines algorithm exploiting lexical, syntactic and semantic features plus manually defined trigger lists (e.g., list of linguistic particles that are polarity markers). FactPro outperforms the baseline on both datasets reaching 0.72 F1 in the assignment of the factuality value on news (+ 0.5 points with respect to the baseline) and 0.56 F1 on tweets (+ 0.9 points with respect to the baseline). The first result can be compared with the performances of De Facto, the tool developed by Saurí and Pustejovsky (2012) that achieves an F1 of 0.80 (micro-averaging) and 0.70 (macro-averaging). Error analysis reveals four main source of errors: (i) the unbalanced distribution of some attribute values; (ii) the semantic complexity of some sentences; (iii) the incompleteness of the trigger lists; (iv) the analysis of nominal events. As for tweets, the peculiarities of their language, such as their fragmentary style and the high frequency of imperatives and interrogatives, pose additional challenges to the task. Indeed, the accuracy in the classification of each attribute registered a consistent drop (Polarity:-0.13, Certainty: -0.17, Time: -0.14) on Twitter data with respect to news data.

2.4.3 Links to Other Shared Tasks Previous EVALITA Tasks. FactA is the first evaluation task for factuality profiling of events in Italian thus it is not possible to made a direct comparison with any previous exercise organised in EVALITA. Anyway, FactA is strictly connected with the EVALITA 2014 EVENTI task (Caselli et al. 2014). EVENTI is the acronym of EValuation of Events aNd Temporal Information and its aim was to evaluate the performance of Temporal

110 Basile et al. EVALITA Goes Social

Table 11 Performance of FactPro ACCURACY polarity certainty time Factuality Value baseline 0.94 0.86 0.71 0.67 MAIN TASK FactPro 0.92 0.83 0.74 0.72 baseline 0.80 0.69 0.55 0.47 PILOT TASK FactPro 0.79 0.66 0.60 0.56

Information Processing systems on Italian texts. To this end, a corpus of news annotated with temporal expressions, events and temporal relations was released. This corpus is a revised and simplified version of Ita-TimeBank (Caselli et al. 2011) from which the training set of FactA was taken. In other words, EVENTI and FactA share the same broad notion of event as inherited from TimeML.

Non EVALITA Tasks. The task on Event Factuality Classification for Dutch was or- ganised in the context of CLIN26, the 26th Meeting of Computational Linguistics in the Netherlands, in 2016. This task and the one run in EVALITA, shared in part the same type of data (i.e. Wikinews articles) and the same annotation guidelines. How- ever, in CLIN26 participants were asked to classify only the certainty and polarity of events. Two groups developed rule-based systems and submitted results: the best system (RuGGed) obtained an F-score of 96.10 for certainty and of 88.20 for polarity (Minard et al. 2016). Other issues related to factuality such as subjectivity, hedging and modality, are instead the focus of other evaluation exercises. In 2005, the ACE Event Detection and Recognition Task22 asked participants to distinguish between asserted, hypothetical, desired, and promised events by assigning the correct modality value. This evaluation covered different types of texts, such as news and weblogs, both in English and Chinese. The same languages, plus Spanish, are taken into consideration in the TAC KBP Event Tracks organised in 2015, 2016 and 201723. In particular, the Event Nugget task is based on the Rich ERE Annotation Guidelines where each event has an attribute, called Realis, indicating whether or not that event occurred (Mitamura et al. 2015). Uncertainty, negation and speculation in the domain of biology have been addressed both in the CoNLL-2010 Shared Task (Farkas et al. 2010) and, since 2009, in the BioNLP evaluation (Kim et al. 2009) while the detection of modality of clinically significant events is one of the topic in the 2012 i2b2 challenge and the SemEval Clinical TempEval tasks in 2015, 2016 and 2017. The systems participating to these tasks adopt different machine learning approaches to detect clinical event modality (e.g., SVM and CRF) reaching an F1 of 0.86 both in the i2b2 challenge and in Clinical TempEval (Sun, Rumshisky, and Uzuner 2013; Bethard et al. 2016). In contrast, the detection of negation and speculation in BioNLP is still far from a level of practical use with the best system having an F1 below 0.30 (Kim, Wang, and Yasunori 2013).

22 http://www.itl.nist.gov/iad/mig/tests/ace/2005/ 23 https://tac.nist.gov/

111 Italian Journal of Computational Linguistics Volume 3, Number 1

2.5 ArtiPhon 2.5.1 Task Description, Data, and Evaluation Metrics In the Articulatory Phone Recognition (ArtiPhon) task (Badino 2016b), participants had to build a speaker-dependent phone recognition system. Train set is delivered in form of simple audio files and related orthographic transcriptions and, for each audio file, a series of articulatory data aligned with the acoustic stimulus is available. Artiphon task at EVALITA 2016 aimed at resolving two different dilemmas in speech sciences: the first one was to evaluate recognition systems trained with a speech corpus presenting a given speaking style (read speech in this case, where speaker were asked to keep a constant speech rate) and tested with a further corpus presenting a mismatched speaking style (test set ranged from slow and hyper-articulated speech to fast and hypo-articulated speech). Training and testing data were from the same speaker. The second goal, which motivates the presence of measured articulatory movements data in the train and test corpus, was to investigate to what extent the increase of representational power obtained by adding an articulatory feature set could help in building ASR systems that are more robust to the effects of the mismatch problem and to other noise source in ASR domain. Recently, Badino (2016a) and Badino et al. (2016) have proposed an “ar- ticulatory” ASR based on deep neural networks (DNNs) extending also to this specific domain the influence that DNN are having in many other fields of speech technologies. Task organizers also made available to participants a set of tools to apply DNNs to this challenge, please see the original paper (Badino 2016b) for further details on this. The evaluation metrics here used are taken from the SCLITE method, based on Levenstein distance evaluation algorithm and contained in the SCTK Toolkit24. In particular, results are expressed in terms of phoneme correct rate, percentage of substitutions, insertions, deletion, and average phone error rate (PER henceforth).

2.5.2 Participating Systems and Results Only one out of the 6 research groups that expressed an interest in the task actually participated. This system, henceforth ISTC (Cosi 2016), did not make use of articula- tory data. Two sets of results are presented for the evaluation of this system. In the first one, the system was considered as speaker-dependent as both training and test datasets were recorded by the same speaker. In the second set of runs, ISTC was trained using a different training corpus and tested on the ArtiPhon test set, thus producing a speaker independent answer set. ISTC adapted for Italian the KALDI ASR toolkit25 which contains two different DNNs, namely Dan’s and Karel’s (Vesely` et al. 2013). For both answer sets, a variety of features, feature combinations, and processing methods were used, giving rise to a complex scenario of possible evaluations. Obviously, results obtained in the case of Speaker Dependent sessions are higher than those obtained in the Speaker Independent one. In the former case, PER was brought down to 15.1% when a combination of Gaussian Mixture Model (GMM) and DNN (Dan’s) methods were used. In the Speaker Independent case, the combination GMM+DNN (Dan’s) also performs well in terms of PER, however, the absolute best result is obtained with Karel’s DNN (PER at 26.9%). A direct comparison between these results and the baseline obtained by the organiser is not simple because the set of phones used in the evaluation is different (i.e., greater) from that used by ISTC, furthermore the task organiser presented his re- sults considering performances across speaking style (from hyper- to hypo- articulated

24 https://www.nist.gov/itl/iad/mig/tools 25 http://kaldi-asr.org/

112 Basile et al. EVALITA Goes Social speech). This said, the best performance obtained by the organiser has a PER of 23.5% on hyper-articulated speech using both acoustic and actual articulatory features.

2.5.3 Links to Other Shared Tasks Previous EVALITA Tasks. ASR and speech technologies related tasks have been organ- ised within EVALITA since its second edition in 2009. Tasks ranged from continuous digit recognition to forced alignment (Matassoni, Brugnara, and Gretter 2013; Cutugno, Origlia, and Seppi 2013; Cosi et al. 2014). In these cases, however, the evaluation ex- ercises were approached in a more classic way, with the ultimate aim of establishing a benchmark that before the EVALITA times had never been available. The speaker independence dimension introduced at ArtiPhon this year, constitutes a novelty in speech tasks for Italian.

Non EVALITA Tasks. In recent years, ASR international evaluation campaigns have become sparser 26, and focused more on applications27 than on ASR in itself. In par- ticular, a great attention is now given to new domains such as emotion recognition and speech measurements for affective computing28, elimination of reverberation29, speech analytics30, automatic spoken translation 31. ASR involving articulatory features is still a developing field, and the evaluation is made difficult by the overall limited availability of annotated training and test sets.

2.6 QA4FAQ 2.6.1 Task Description, Data and Evaluation Metrics The goal of this task is to develop a system retrieving a list of relevant FAQs and corresponding answers related to a query issued by an user (Caputo et al. 2016). For defining an evaluation protocol, we need a set of FAQs, a set of user questions and a set of relevance judgements for each question. In order to collect these data, we exploit an application called AQP Risponde, developed by QuestionCube for the Acquedotto Pugliese. AQP Risponde provides a back-end that allows to analyze both the query log and the customers’ feedback to discover, for instance, new emerging problems that need to be encoded as FAQ. AQP received about 25,000 questions and collected about 2,500 user feedback. We rely on these data to build the dataset for the task. In particular, we provide: a knowledge base of 406 FAQs. Each FAQ is composed of a question, an answer, r and a set of tags; a set of 1,132 user queries. The queries are collected by analysing the AQP Risponde r system log. From the initial set of queries, we removed queries that contains personal data; a set of 1,406 pairs < query, relevantfaq > that are exploited to evaluate the r contestants. We build these pairs by analysing the user feedback provided by real users of AQP Risponde. We manually check the user feedback in order to remove

26 https://www.nist.gov/itl/iad/mig/past-hlt-evaluation-projects 27 See, for example, https://asru2017.org/Challenges.asp 28 http://emotion-research.net/sigs/speech-sig/is16-compare 29 https://reverb2014.dereverberation.com/ 30 https://www.nist.gov/itl/iad/mig/ nist-2017-pilot-speech-analytic-technologies-evaluation 31 http://www.iwslt.org

113 Italian Journal of Computational Linguistics Volume 3, Number 1

noisy or false feedback. The check was performed by two experts of the AQP customer support. We provided a little sample set for the system development and a test set for the evaluation. We did not provide a set of training data: AQP is interested in the devel- opment of unsupervised systems because AQP Risponde must be able to achieve good performance without any user feedback. Following, an example of FAQ is reported: Question “Come posso telefonare al numero verde da un cellulare?” How can I call the toll-free number by a mobile phone? Answer “È possibile chiamare il Contact Center AQP per segnalare un guasto o per un pronto intervento telefonando gratuitamente anche da cellulare al numero verde 800.735.735. Mentre per chiamare il Contact Center AQP per servizi com- merciali 800.085.853 da un cellulare e dall’estero è necessario comporre il numero +39.080.5723498 (il costo della chiamata è secondo il piano tariffario del chia- mante).” You can call the AQP Contact Center to report a fault or an emergency call without charge by the phone toll-free number 800 735 735... Tags canali, numero verde, cellulare For example, the previous FAQ is relevant for the query: “Si può telefonare da cellulare al numero verde?” Is it possible to call the toll-free number by a mobile phone? FAQs are provided in both XML and CSV format using “;” as separator. The file is encoded in UTF-8 format. Each FAQ is described by the following fields: id a number that uniquely identifies the FAQ question the question text of the current FAQ answer the answer text of the current FAQ tag a set of tags separated by “,” Test data are provided as a text file composed by two strings separated by the TAB character. The first string is the user query id, while the second string is the text of the user query. For example: “1 Come posso telefonare al numero verde da un cellulare?” and “2 Come si effettua l’autolettura del contatore?”. Moreover, we provided a simple baseline based on a classical information retrieval model. The baseline is built by using Apache Lucene (ver. 4.10.4)32. During the indexing for each FAQ, a document with four fields (id, question, answer, tag) is created. For searching, a query for each question is built taking into account all the question terms. Each field is boosted according to the following score question=4, answer=2 and tag=1. For both indexing and search the ItalianAnalyzer is adopted. The top 25 documents for each query are provided as result set. The baseline is freely available on GitHub33 and it was released to participants after the evaluation period. The participants must provide results in a text file. For each query in the test data, the participants can provide 25 answers at the most, ranked according by their systems. Each line in the file must contain three values separated by the TAB character: < queryid >< faqid >< score >. Systems are ranked according to the accuracy@1 (c@1). We compute the precision of the system by taking into account only the first correct answer. This metric is used for the final ranking of systems. In particular, we take into account also the number of

32 http://lucene.apache.org/ 33 https://github.com/swapUniba/qa4faq

114 Basile et al. EVALITA Goes Social

Table 12 QA4FAQ results in terms of accuracy with details about the main characteristics of the participating systems OTHER TEAM ID c@1 METHOD EMBEDDINGS RESOURCES chiLab4It 0.4439 QuASIt (cognitive model) Wiktionary baseline 0.4076 Lucene BM25 no fbk4faq 0.3746 Vector similarity word no NLP-NITMZ 0.2125 VSM + Apache Nutch no

unanswered questions, following the guidelines of the CLEF ResPubliQA Task (Peñas et al. 2009). The formulation of c@1 is:

1 n c@1 = (n + n R ) (2) n R U n

where nR is the number of questions correctly answered, nU is the number of questions unanswered, and n is the total number of questions. The system should not provide result for a particular question when it is not confident about the correctness of its answer. The goal is to reduce the amount of incorrect responses, keeping the number of correct ones, by leaving some questions unanswered. Systems should ensure that only the portion of wrong answers is reduced, maintaining as high as possible the number of correct answers. Otherwise, the reduction in the number of correct answers is punished by the evaluation measure for both the answered and unanswered questions.

2.6.2 Participating Systems and Results Thirteen teams registered in the task, but only three of them submitted the results for the evaluation. A short description of each system with its best performance is reported in Table 12. All the systems adopt different strategies, while only one system (chiLab4It) is based on a typical question answer module. The best performance is obtained by the chilab4it (Pipitone, Tirone, and Pirrone 2016) team, that is the only one able to outperform the baseline. Moreover, the chilab4it team is the only one that exploits question answering techniques: the good performance obtained by this team proves the effectiveness of question answering in the FAQ domain. All the other participants had results under the baseline. Another interesting outcome is that the baseline exploiting a simple VSM model achieved remarkable results. In (Fonseca et al. 2016), the authors have built a custom development set by para- phrasing original questions or generating a new question (based on original FAQ an- swer), without considering the original FAQ question. The interesting result is that their system outperformed the baseline on the development set. The authors underline that the development set is completely different from the test set which contains sometime short queries and more realistic user’s requests. This is an interesting point of view since one of the main challenge of our task concerns the variety of language expressions adopted by customers to formulate the information need.

115 Italian Journal of Computational Linguistics Volume 3, Number 1

Previous EVALITA Tasks. No other tasks related to QA4FAQ was organized in the previ- ous editions of EVALITA.

Non EVALITA Tasks. The QA4FAQ task is strongly related to the “Answer Selection in Community Question Answering” task recently organized in the context of Se- meval 2015 and 2016 (Nakov et al. 2015). This task helps to automate the process of finding good answers to new questions in a community-created discussion forum (e.g., by retrieving similar questions in the forum and by identifying the posts in the answer threads of similar questions that answer the original one as well). Moreover, the QA4FAQ has some common points with the Textual Similarity task (Agirre et al. 2015) that received an increasing amount of attention in recent years.

2.7 Application Challenge

In 2016, for the first time, EVALITA included an application challenge organised by IBM Italy34. The aim of the challenge was to award the most innovative apps employing the services available on Bluemix, the platform of IBM APIs. More specifically, participants were required to build their apps on top of at least one the Watson’s APIs for cognitive computing supporting the Italian language. Submitted systems were evaluated by a judge panel made by academics and IBM representatives by taking into consideration the creativity and viability of the use case, the intuitiveness of the user experience, the value of the app, the feasibility and uniqueness of the implementation. The first three best submissions, described below and listed following their final ranking position, received a monetary prize.

1. Stockle is a sentiment analysis API and web application focused on the stock trade market. The APIs of AlchemyData News, Yahoo Finance, reddit and Twitter are used to retrieve comments, tweets and news related to a set of companies selected by the user and the sentiment towards these companies is analysed by the sentiment analysis module of Alchemy API35. 2. MARTIN (Monitoring and Analysing Real-time Tweets in Italian Natural language)36 is a stand-alone application that allows to scan real-time information on Twitter, compare tweets by pairs of Twitter accounts, analyse the language of tweets and visualize the output of these analyses. To this end, Twitter APIs are combined with the natural language understanding modules for sentiment analysis and keyword extraction provided by Alchemy APIs. 3. Appetitoso ChatBot is a dialog system connected to the mobile application and the website of Appetitoso37, a recommendation service that searches for restau- rants on the basis of the dishes desired by the user. The IBM Watson Conversation Module is used to create a conversation between the user and the application through an exchange of written messages.

34 http://www.evalita.it/2016/tasks/ibm-challenge 35 https://gingerbeard.alwaysdata.net/stockle/ 36 https://dh.fbk.eu/technologies/martin 37 http://www.appetitoso.it/

116 Basile et al. EVALITA Goes Social

Table 13 Number of tweets of cross-task shared data. A * indicates instead the number of sentences from newswire documents. TRAIN SENTIPOLC NEEL-it PoSTWITA FactA SENTIPOLC 7410 989 6412 0 NEEL-it 989 1000 0 0 PoSTWITA 6412 0 6419 0 FactA 0 0 0 2723* TEST SENTIPOLC NEEL-it PoSTWITA FactA SENTIPOLC 2000 301 301 301 NEEL-it 301 301 301 301 PoSTWITA 301 301 301 301 FactA 301 301 301 597*+301

3. Cross-task Shared Data

One of the greatest benefits of evaluation campaigns is the creation of benchmark data for a variety of tasks. This requires quite some effort on the part of the organisers, with the development of guidelines and, mostly, with manual annotation towards the creation of gold standard sets. One little exploited advantage of such data creation efforts is the possibility of adding layers of annotation related to different phenomena over exactly the same data, so as to facilitate and promote the development and testing of end-to-end systems (Basile et al. 2015). With this in mind, we encouraged task organisers to share datasets so as to annotate the same instances, each task with their respective layer. The involved tasks were SENTIPOLC, PoSTWITA, NEEL-it and FactA. In this Section we provide an overview of the shared data in terms of number of instances and an example of how the different annotations of the same data look like like over a sample tweet. The matrix in Table 13 shows both the total number of test instances per task (diagonally) as well as the number of overlapping instances for each task pair. Please note that while the datasets of SENTIPOLC, NEEL-it, and PoSTWITA were composed entirely of tweets, both as training and test data, FactA included tweets only in one of their test set, as a pilot task. FactA’s training and standard test sets are composed of newswire data, which we report in terms of number of sentences. For this reason the number of instances in Table 13 is broken down for FactA’s test set: 597 newswire sentences and 301 tweets, the latter being the same as the other tasks. This first attempt at creating shared data across tasks was completely successful in terms of test data, as the testsets for all four tasks comprise exactly the same 301 tweets, although for SENTIPOLC and FactA this is only a portion of a larger test set. Regarding the training sets, which are obviously larger, there are overlaps, but these are not complete. Specifically, the training sets of PoSTWITA and NEEL-it are almost entirely subsets of SENTIPOLC. In addition, 989 tweets from the 1000 that make NEEL-it’s training set are in SENTIPOLC, and 6412 of PoSTWITA (out of 6419) also are

117 Italian Journal of Computational Linguistics Volume 3, Number 1

Table 14 Annotation of the tweet @juventusfc E come se vogliamo vincerla, forza ragazzi!!!!!!!, with id 601071129810309120 FactA id begin end element polarity certainty time 601... 31 39 EVENT POS NON_CERTAIN FUTURE

NEEL-it id begin end link type 601... 1 11 http://dbpedia.org/resource/Juventus_F.C. Organization

SENTIPOLC id subj opos oneg iro lpos lneg 601... 1 1 0 0 1 0

PoSTWITA 601... @juventusfc MENTION E CONJ come ADP se SCONJ vogliamo AUX vincerla VERB_CLIT , PUNCT forza NOUN ragazzi NOUN !!!!!!! PUNCT

included in the SENTIPOLC training set, and only the training data the SENTIPOLC shares with NEEL-it is not included in PoSTWITA. In Table 14 we show how the very same tweet – @juventusfc E come se vogliamo vincerla, forza ragazzi!!!!!!! (with id 601071129810309120, from the EVALITA 2016 dis- tribution) – has been annotated according to the guidelines of the four tasks. Currently, the annotations are overlapping but are still encoded on separate, un- connected files in practice. In the next future we plan to develop and share specific standards and tools that will allow for such annotations to be practically linked and knit together, so that current and future single annotations for different phenomena over the same data will be exploited simultaneously. We believe the cross-task data produced within EVALITA 2016 is an excellent start- ing point towards making more data that is enriched with as many layers of annotation as possible, especially related to the EVALITA shared tasks. In order to further promote this, we have also set up a repository to collect and keep track of data creation and development for Italian. All 2016 datasets are available online on the github account of EVALITA 2016: https://github.com/evalita2016/data.

4. The EVALITA Community

The tasks and the challenge of EVALITA 2016 attracted the interest of a large number of researchers, for a total of 96 single registrations. Overall, 34 teams composed of more than 60 individual participants submitted their results to one or more different tasks of the campaign. A breakdown of the figures per task is shown in Table 15. As for the

118 Basile et al. EVALITA Goes Social members of the teams, they were affiliated to more than 20 different institutions, 4% belong to private companies and 26% work outside Italy 38.

Table 15 Registered and actual participants of EVALITA 2016 task registered actual ARTIPHON 6 1 FactA 13 0 NEEL-IT 16 5 QA4FAQ 13 3 PoSTWITA 18 9 SENTIPOLC 24 13 IBM Challenge 6 3 total 96 34

With respect to the 2014 edition, we collected a significantly higher number of regis- trations (96 registrations vs 55 registrations collected in 2014), which can be interpreted as a signal that we succeeded in reaching a wider audience of researchers interested in participating in the campaign. This result could be also be positively affected by the novelties introduced in this edition to improve the dissemination of information on EVALITA, e.g. the use of social media such as Twitter and Facebook. Also the number of teams that actually submitted their runs increased in 2016 (34 teams vs 23 teams partici- pating in the 2014 edition), even if we reported a substantial gap between the number of actual participants and those who registered. In order to better investigate this issue and gather some insights on the reasons of the significant drop in the number of participants w.r.t. the registrations collected, we ran an online questionnaire specifically designed for those who did not submit any run to the task to which they were registered. In two weeks we collected 14 responses which show that the main obstacles to the actual participation in a task were related to personal issues (“I had an unexpected personal or professional problem outside EVALITA” or “I underestimated the effort needed”) or personal choices (“I gave priority to other EVALITA tasks”). As for this last point, NEEL-it and SENTIPOLC were preferred to FactA, which did not have any participant. Another problem mentioned by some of the respondents is that the evaluation period was too short: this issue is highlighted mostly by those who registered to more than one task. However, the gap between registered and actual participants is not new for EVALITA but affected also all the previous editions of the campaign as shown in Figure 2. Another general trend in EVALITA is the high percentage of new participants (always above 60%): in particular, Figure 3 highlights that in 2016 70% of the members of participating teams were at their first experience in the campaign. Twenty-five researchers took part in the task organization: 16% were affiliated to foreign institutions and 20% were representatives of private companies. This last per- centage demonstrates that it is easier to involve industrial companies in the organization of tasks than in the participation. This can be due to some reluctance on the part of companies to expose themselves to potential bad publicity if resulting in the lower portions of the ranks. Organizers had 18 affiliations and all tasks had at least two organizers from two different institutions. This indicates that organizing a task is a

38 In Brazil, France, Germany, India, Ireland, Mexico, The Netherlands, Spain, and Switzerland.

119 Italian Journal of Computational Linguistics Volume 3, Number 1

Figure 2 Figure 3 Registered and actual participants of EVALITA Percentage of new versus recurrent campaigns participants in EVALITA great way to boost the cooperation as shown in Figure 4 where node colour represents affiliations while edge colour indicates tasks39.

de Gemmis Lovecchio Caputo NODE LEGEND Uni Bari Uni Torino FBK Lops Manzari IIT Uni Federico II P. Basile ISBM IBM INRIA Cutugno AQP Badino Gentile QuestionCube Rizzo V. Basile Higy CELI Uni Pompeu Fabra Leo Uni Mannheim Caselli SudSistemi Minard Nissim Uni Groningen Mazzei Uni Tor Vergata Speranza Tamburini VUA Uni Bologna

Croce EDGE LEGEND

Patti SENTIPOLC Bolioli Bosco QA4FAQ NEEL-it PoSTWITA Barbieri FactA ArtiPhon Novielli

Figure 4 Co-organisation of EVALITA 2016 tasks. Node colours characterise affiliations, so that same colour means same institution.

5. Conclusions

EVALITA 2016 was a successful edition, surely in terms of participation and obtained results, but also in terms of data creation and dissemination, especially with the now available github repository, and in terms of collaboration across tasks.

39 A video showing a dynamic network of task organisers in EVALITA from 2007 to 2016 is available online: http://www.evalita.it/EVALITAcommunity

120 Basile et al. EVALITA Goes Social

As for future editions, we suggest to prospective EVALITA chairs to continue on the path started in 2016 by working towards an always stronger involvement of repre- sentatives from companies in the organisation of tasks, a balance between research and application tasks, and an ever-increasing development of shared and open data. These three aspects proved to be useful to boost the cooperation between different private and public institutions and to attract new researchers in the EVALITA community. Moreover, social media texts turned out to be a very attractive domain, and there is still plenty to be explored in this domain. Even within Twitter, sampling data using different strategies has proved potentially challenging for systems, especially for some tasks (e.g PoSTWITA), so that future editions could involve some more explicit domain adaptation tasks, still in the general domain of social media (see for example what has been done for author profiling at PAN 2016 (Rangel et al. 2016)). Obviously, domains other than social media could be explored as well. For instance, Humanities resulted as one of the most appealing domains in the questionnaires for industrial companies and former participants and other countries are organising evaluation exercises on it (see, for example, the Translating Historical Text shared task at CLIN 2740). Other innovations can be envisaged for the next campaigns, too, also from an organisational perspective. For example, different evaluation windows for different tasks could be planned instead of having the same evaluation deadlines for all. This flexibility would have an impact on the work load of the EVALITA’s organisers but, on the other side, would help teams to participate in multiple tasks without making them choose to concentrate their effort only on one task due to lack of time. As in the past editions, EVALITA 2016 served as the optimal forum for the creation and discussion of the most challenging tasks for Italian NLP. Additionally, collaboration between task organisers and between academia and industry was more fruitful than ever. We hope that this kind of active and open cooperation will be brought forward in future editions, too, and that the repository of shared data that was created in the context of the 2016 edition will continue to be populated, so as to form the reference benchmark for Italian data in a variety of Natural Language Processing tasks.

Acknowledgements

EVALITA 2016 would not have been possible without the invaluable work of those who proposed and organised the tasks, without the participants, and without the support of AILC and especially IBM, who organised and sponsored the challenge included in this edition. We are enormously grateful to all of them. We also would like to thank the organisers of CLiC-it 2016 for hosting the final workshop in Naples, co-located with the conference, and Walter Daelemans for being our invited speaker. Finally, reviewers provided insightful comments which we have benefited from in the final version of this paper.

References Agirre, Eneko, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. Semeval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pages 252–263, Denver, Colorado, USA, June 4-5.

40 http://www.ccl.kuleuven.be/CLIN27/

121 Italian Journal of Computational Linguistics Volume 3, Number 1

Attardi, Giuseppe, Daniele Sartiano, Chiara Alzetta, and Federica Semplici. 2016a. Convolutional neural networks for sentiment analysis on italian tweets. In Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Attardi, Giuseppe, Daniele Sartiano, Maria Simi, and Irene Sucameli. 2016b. Using Embeddings for Both Entity Recognition and Linking in Tweets. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Attardi, Giuseppe and Maria Simi. 2016. Character Embeddings PoS Tagger vs HMM Tagger for Tweets. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Badino, Leonardo. 2016a. Phonetic Context Embeddings for DNN-HMM Phone Recognition. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016): Understanding Speech Processing in Humans and Machines, San Francisco, California, USA, September 8-12. Badino, Leonardo. 2016b. The ArtiPhon Challenge at Evalita 2016. In Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Badino, Leonardo, Claudia Canevari, Luciano Fadiga, and Giorgio Metta. 2016. Integrating articulatory data in deep neural network-based acoustic modeling. Computer Speech & Language, 36:173–195. Barbieri, Francesco, Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and Viviana Patti. 2016. Overview of the EVALITA 2016 SENTiment POLarity Classification Task. In Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Bartalesi Lenzi, Valentina, Manuela Speranza, and Rachele Sprugnoli. 2013. EVALITA 2011: Description and Results of the Named Entity Recognition on Transcribed Broadcast News Task. In Evaluation of Natural Language and Speech Tools for Italian. Revised Selected Papers of the EVALITA 2011 International Workshop, pages 86–97. Springer. Rome, Italy, January 24-25, 2012. Basile, Pierpaolo, Valerio Basile, Malvina Nissim, and Nicole Novielli. 2015. Deep Tweets: from Entity Linking to Sentiment Analysis. In Cristina Bosco, Sara Tonelli, and Fabio Massimo Zanzotto, editors, Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015), pages 41–46, Trento, Italy, December 3-4. aAccademia University Press. Basile, Pierpaolo, Annalina Caputo, Anna Lisa Gentile, and Giuseppe Rizzo. 2016. Overview of the EVALITA 2016 Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) Task. In Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Basile, Pierpaolo, Annalina Caputo, and Giovanni Semeraro. 2015. Entity Linking for Italian Tweets. In Cristina Bosco, Sara Tonelli, and Fabio Massimo Zanzotto, editors, Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015), pages 36–40, Trento, Italy, December 3-4. aAccademia University Press. Basile, Valerio, Andrea Bolioli, Malvina Nissim, Viviana Patti, and Paolo Rosso. 2014. Overview of the Evalita 2014 SENTIment POLarity Classification Task. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 & the Fourth International Workshop

122 Basile et al. EVALITA Goes Social

EVALITA 2014, pages 50–57, Pisa, Italy, 9-11 December. Pisa University Press. Basile, Valerio and Malvina Nissim. 2013. Sentiment analysis on Italian tweets. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 100–107, Atlanta, Georgia, June. Association for Computational Linguistics. Beißwenger, Michael, Sabine Bartsch, Stefan Evert, and Kay-Michael Würzner. 2016. EmpiriST 2015: A shared task on the automatic linguistic annotation of computer-mediated communication and web corpora. In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, pages 44–56, Berlin, Germany, August 7-12. Bernardi, Raffaella, Andrea Bolognesi, Corrado Seidenari, and Fabio Tamburini. 2005. Automatic induction of a POS tagset for Italian. In Proceedings of the Australasian Language Technology Workshop, pages 176–183, Sydney, Australia, 10-11 December. Bethard, Steven, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Pustejovsky, and Marc Verhagen. 2016. Semeval-2016 task 12: Clinical tempeval. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1052–1062, San Diego, California, June. Association for Computational Linguistics. Bosco, Cristina, Viviana Patti, and Andrea Bolioli. 2013. Developing Corpora for Sentiment Analysis: The Case of Irony and Senti-TUT. IEEE Intelligent Systems, Special Issue on Knowledge-based Approaches to Content-level Sentiment Analysis, 28(2):55–63. Caputo, Annalina, Marco de Gemmis, Pasquale Lops, Franco Lovecchio, and Vito Manzari. 2016. Overview of the EVALITA 2016 Question Answering for Frequently Asked Questions (QA4FAQ) Task. In Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Carmel, David, Ming-Wei Chang, Evgeniy Gabrilovich, Bo-June Paul Hsu, and Kuansan Wang. 2014. ERD’14: entity recognition and disambiguation challenge. In ACM SIGIR Forum, volume 48, pages 63–77. ACM. Caselli, Tommaso, Valentina Bartalesi Lenzi, Rachele Sprugnoli, Emanuele Pianta, and Irina Prodanof. 2011. Annotating events, temporal expressions and relations in Italian: the It-TimeML experience for the Ita-TimeBank. In Proceedings of the 5th Linguistic Annotation Workshop, pages 143–151, Portland, Oregon, USA, June 23-24. Association for Computational Linguistics. Caselli, Tommaso, Rachele Sprugnoli, Manuela Speranza, and Monica Monachini. 2014. EVENTI: EValuation of Events and Temporal INformation at Evalita 2014. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 & the Fourth International Workshop EVALITA 2014, pages 27–34, Pisa, Italy, 9-11 December. Pisa University Press. Castellucci, Giuseppe, Danilo Croce, and Roberto Basili. 2016. Context-aware convolutional neural networks for Twitter sentiment analysis in Italian. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Cimino, Andrea and Felice Dell’Orletta. 2016. Building the state-of-the-art in POS tagging of Italian Tweets. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Corcoglioniti, Francesco, Alessio Palmero Aprosio, Yaroslav Nechaev, and Claudio Giuliano. 2016. MicroNeel: Combining NLP Tools to Perform Named Entity Detection and Linking on Microposts. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Cosi, Piero. 2016. Phone Recognition Experiments on ArtiPhon with KALDI. In Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December

123 Italian Journal of Computational Linguistics Volume 3, Number 1

5-7. aAccademia University Press. Cosi, Piero, Vincenzo Galatà, Francesco Cutugno, and Antonio Origlia. 2014. Forced Alignment on Children Speech. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 & the Fourth International Workshop EVALITA 2014, pages 124–126, Pisa, Italy, 9-11 December. Pisa University Press. Cutugno, Francesco, Antonio Origlia, and Dino Seppi. 2013. EVALITA 2011: Forced alignment task. In Evaluation of Natural Language and Speech Tools for Italian. Springer, pages 305–311. Di Rosa, Emanuele and Alberto Durante. 2016. Tweet2Check evaluation at Evalita Sentipolc 2016. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Farkas, Richárd, Veronika Vincze, György Móra, János Csirik, and György Szarvas. 2010. The CoNLL-2010 shared task: learning to detect hedges and their scope in natural language text. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning—Shared Task, pages 1–12, Uppsala, Sweden, 15-16 July. Association for Computational Linguistics. Fonseca, Erick R., Simone Magnolini, Anna Feltracco, Mohammed R. H. Qwaider, and Bernardo Magnini. 2016. Tweaking Word Embeddings for FAQ Ranking. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Ghosh, Aniruddha, Guofu Li, Tony Veale, Paolo Rosso, Ekaterina Shutova, Antonio Reyes, and Jhon Barnden. 2015. Semeval-2015 task 11: Sentiment analysis of figurative language in Twitter. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 470–475, Denver, Colorado, USA, June 4-5. Horsmann, Tobias and Torsten Zesch. 2016. Building a Social Media Adapted PoS Tagger Using FlexTag–A Case Study on Italian Tweets. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016). aAccademia University Press. Kim, Jin-Dong, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. 2009. Overview of BioNLP’09 shared task on event extraction. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pages 1–9, Boulder, Colorado, USA, June 4-5. Association for Computational Linguistics. Kim, Jin-Dong, Yue Wang, and Yamamoto Yasunori. 2013. The genia event extraction shared task, 2013 edition-overview. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 8–15, Sofia, Bulgaria, August 8-9. Association for Computational Linguistics. Luo, Xiaoqiang. 2005. On coreference resolution performance metrics. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 25–32, Vancouver, B.C., Canada, October 6-8. Association for Computational Linguistics. Matassoni, Marco, Fabio Brugnara, and Roberto Gretter. 2013. EVALITA 2011: Automatic Speech Recognition Large Vocabulary Transcription. In Evaluation of Natural Language and Speech Tools for Italian. Revised Selected Papers of the EVALITA 2011 International Workshop, pages 274–285. Springer. Rome, Italy, January 24-25, 2012. Minard, Anne-Lyse, Alessandro Marchetti, and Manuela Speranza. 2014. Event Factuality in Italian: Annotation of News Stories from the Ita-TimeBank. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014, Pisa, Italy, 9-10 December. Minard, Anne-Lyse, R. H. Mohammed Qwaider, and Bernardo Magnini. 2016. FBK-NLP at NEEL-IT: Active Learning for Domain Adaptation. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Minard, Anne-Lyse, Manuela Speranza, and Tommaso Caselli. 2016. The EVALITA 2016 Event Factuality Annotation Task (FactA). In Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on

124 Basile et al. EVALITA Goes Social

Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Minard, Anne-Lyse, Manuela Speranza, Marieke van Erp, Antske Fokkens, Marten Postma, Piek Vossen, Eneko Agirre, Itziar Aldabe, German Rigau, and Ruben Urizar. 2016. Evaluation tasks in open competitions Deliverable D10. 4. Technical report, NewsReader EU Project. Mitamura, Teruko, Yukari Yamakawa, Susan Holm, Zhiyi Song, Ann Bies, Seth Kulick, and Stephanie Strassel. 2015. Event nugget annotation: Processes and issues. In Proceedings of the 3rd Workshop on EVENTS at the NAACL-HLT, pages 66–76, Denver, Colorado, USA, May 31 - June 5. Monachini, M. 1995. ELM-IT: An Italian incarnation of the EAGLES-TS. definition of lexicon specification and classification guidelines. Technical report. Nakov, Preslav, Lluís Màrquez, Walid Magdy, Alessandro Moschitti, Jim Glass, and Bilal Randeree. 2015. SemEval-2015 Task 3: Answer Selection in Community Question Answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 269–281, Denver, Colorado, June. Association for Computational Linguistics. Nakov, Preslav, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. 2016. SemEval-2016 Task 4: Sentiment Analysis in Twitter. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 1–18, San Diego, California, June. Owoputi, Olutobi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL), pages 380–390, Atlanta, Georgia, USA, June, 9-15. Association for Computational Linguistics. Paci, Giulio. 2016. Mivoq EVALITA 2016 PosTwITA tagger. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Peñas, Anselmo, Pamela Forner, Richard Sutcliffe, Álvaro Rodrigo, Corina For˘ascu,Iñaki Alegria, Danilo Giampiccolo, Nicolas Moreau, and Petya Osenova. 2009. Overview of ResPubliQA 2009: question answering evaluation over European legislation. In Proceedings of the 10th Workshop of the Cross-Language Evaluation Forum, pages 174–196, Corfu, Greece, 30 September - 2 October. Springer. Pipitone, Arianna, Giuseppe Tirone, and Roberto Pirrone. 2016. ChiLab4It System in the QA4FAQ Competition. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Plank, Barbara and Malvina Nissim. 2016. When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Pustejovsky, James, José M Castano, Robert Ingria, Roser Sauri, Robert J Gaizauskas, Andrea Setzer, Graham Katz, and Dragomir R Radev. 2003. TimeML: Robust specification of event and temporal expressions in text. New directions in question answering, 3:28–34. Rangel, Francisco, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast, and Benno Stein. 2016. Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, pages 750–784, Évora, Portugal, 5-8 September. Reyes, Antonio and Paolo Rosso. 2014. On the Difficulty of Automatically Detecting Irony: Beyond a Simple Case of Negation. Knowledge and Information Systems, 40(3):595–614. Rizzo, Giuseppe, Marieke van Erp, Julien Plu, and Raphaël Troncy. 2016. Making Sense of Microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) Challenge. In Proceeding of the 6th Workshop on Making Sense of Microposts (#Microposts2016) co-located with

125 Italian Journal of Computational Linguistics Volume 3, Number 1

WWW 2016, Montreal, Canada, 11 April. Rosenthal, Sara, Preslav Nakov, Svetlana Kiritchenko, Saif M Mohammad, Alan Ritter, and Veselin Stoyanov. 2015. SemEval-2015 Task 10: Sentiment Analysis in Twitter. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), SemEval ’2015, Denver, Colorado, June. Rosenthal, Sara, Alan Ritter, Preslav Nakov, and Veselin Stoyanov. 2014. SemEval-2014 Task 9: Sentiment Analysis in Twitter. In Proc of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 73–80, Dublin, Ireland, August. Russo, Irene and Monica Monachini. 2016. Samskara minimal structural features for detecting subjectivity and polarity in italian tweets. In Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA), Naples, Italy, December 5-7. aAccademia University Press. Saurí, Roser and James Pustejovsky. 2012. Are you sure that this happened? assessing the factuality degree of events in text. Computational Linguistics, 38(2):261–299. Speranza, Manuela. 2007. EVALITA 2007: The Named Entity Recognition Task. Intelligenza Artificiale, 4(2):66–68. Proceedings of the EVALITA 2007 Final Workshop, Rome, Italy, September 10. Speranza, Manuela. 2009. The Named Entity Recognition Task at EVALITA 2009. In Poster and Workshop Proceedings of the 11th Conference of the Italian Association for Artificial Intelligence, Reggio Emilia, Italy, December 9-12. Speranza, Manuela and Anne-Lyse Minard. 2015. Cross-language projection of multilayer semantic annotation in the NewsReader Wikinews Italian Corpus (WItaC). In Cristina Bosco, Sara Tonelli, and Fabio Massimo Zanzotto, editors, Proceedings of the Second Italian Conference on Computational Linguistics (CLiC-it 2015), pages 252–257, Trento, Italy, December 3-4. aAccademia University Press. Sprugnoli, Rachele, Viviana Patti, and Franco Cutugno. 2016. Raising Interest and Collecting Suggestions on the EVALITA Evaluation Campaign. In Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Stemle, Egon W. 2016. bot. zen@ EVALITA 2016-A minimally-deep learning PoS-tagger (trained for Italian Tweets). In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Stenetorp, Pontus, Sampo Pyysalo, Goran Topi´c,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsujii. 2012. BRAT: A Web-based Tool for NLP-assisted Text Annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pages 102–107, Avignon, France. Association for Computational Linguistics. Stranisci, Marco, Cristina Bosco, Delia Irazú Hernández Farías, and Viviana Patti. 2016. Annotating Sentiment and Irony in the Online Italian Political Debate on #labuonascuola. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 2892–2899, Portoroz, Slovenia, 23-28 May. ELRA. Sun, Weiyi, Anna Rumshisky, and Ozlem Uzuner. 2013. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association, 20(5):806–813. Tamburini, Fabio. 2007. EVALITA 2007: The Part-of-Speech tagging task. Intelligenza artificiale, 4(2):57–73. Tamburini, Fabio. 2016. A BiLSTM-CRF PoS-tagger for Italian tweets using morphological information. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Tamburini, Fabio, Cristina Bosco, Alessandro Mazzei, and Andrea Bolioli. 2016. Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task. In Pierpaolo Basile, Anna Corazza,

126 Basile et al. EVALITA Goes Social

Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Naples, Italy, December 5-7. aAccademia University Press. Tonelli, Sara, Rachele Sprugnoli, Manuela Speranza, and Anne-Lyse Minard. 2014. Newsreader guidelines for annotation at document level. Technical report, Fondazione Bruno Kessler. Vesely,` Karel, Arnab Ghoshal, Lukás Burget, and Daniel Povey. 2013. Sequence-discriminative training of deep neural networks. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013): Speech in Life Sciences and Human Societies, pages 2345–2349, Lyon, France, 25-29 August.

127