Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons Proceedings of the Workshop (MWE-LEX 2020)
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Machine-Translation Inspired Reordering As Preprocessing for Cross-Lingual Sentiment Analysis
Master in Cognitive Science and Language Master’s Thesis September 2018 Machine-translation inspired reordering as preprocessing for cross-lingual sentiment analysis Alejandro Ramírez Atrio Supervisor: Toni Badia Abstract In this thesis we study the effect of word reordering as preprocessing for Cross-Lingual Sentiment Analysis. We try different reorderings in two target languages (Spanish and Catalan) so that their word order more closely resembles the one from our source language (English). Our original expectation was that a Long Short Term Memory classifier trained on English data with bilingual word embeddings would internalize English word order, resulting in poor performance when tested on a target language with different word order. We hypothesized that the more the word order of any of our target languages resembles the one of our source language, the better the overall performance of our sentiment classifier would be when analyzing the target language. We tested five sets of transformation rules for our Part of Speech reorderings of Spanish and Catalan, extracted mainly from two sources: two papers by Crego and Mariño (2006a and 2006b) and our own empirical analysis of two corpora: CoStEP and Tatoeba. The results suggest that the bilingual word embeddings that we are training our Long Short Term Memory model with do not improve any English word order learning by part of the model when used cross-lingually. There is no improvement when reordering the Spanish and Catalan texts so that their word order more closely resembles English, and no significant drop in result score even when applying a random reordering to them making them almost unintelligible, neither when classifying between 2 options (positive-negative) nor between 4 (strongly positive, positive, negative, strongly negative). -
D1.3 Deliverable Title: Action Ontology and Object Categories Type
Deliverable number: D1.3 Deliverable Title: Action ontology and object categories Type (Internal, Restricted, Public): PU Authors: Daiva Vitkute-Adzgauskiene, Jurgita Kapociute, Irena Markievicz, Tomas Krilavicius, Minija Tamosiunaite, Florentin Wörgötter Contributing Partners: UGOE, VMU Project acronym: ACAT Project Type: STREP Project Title: Learning and Execution of Action Categories Contract Number: 600578 Starting Date: 01-03-2013 Ending Date: 30-04-2016 Contractual Date of Delivery to the EC: 31-08-2015 Actual Date of Delivery to the EC: 04-09-2015 Page 1 of 16 Content 1. EXECUTIVE SUMMARY .................................................................................................................... 2 2. INTRODUCTION .............................................................................................................................. 3 3. OVERVIEW OF THE ACAT ONTOLOGY STRUCTURE ............................................................................ 3 4. OBJECT CATEGORIZATION BY THEIR SEMANTIC ROLES IN THE INSTRUCTION SENTENCE .................... 9 5. HIERARCHICAL OBJECT STRUCTURES ............................................................................................. 13 6. MULTI-WORD OBJECT NAME RESOLUTION .................................................................................... 15 7. CONCLUSIONS AND FUTURE WORK ............................................................................................... 16 8. REFERENCES ................................................................................................................................ -
Collocation Classification with Unsupervised Relation Vectors
Collocation Classification with Unsupervised Relation Vectors Luis Espinosa-Anke1, Leo Wanner2,3, and Steven Schockaert1 1School of Computer Science, Cardiff University, United Kingdom 2ICREA and 3NLP Group, Universitat Pompeu Fabra, Barcelona, Spain fespinosa-ankel,schockaerts1g@cardiff.ac.uk [email protected] Abstract Morphosyntactic relations have been the focus of work on unsupervised relational similarity, as Lexical relation classification is the task of it has been shown that verb conjugation or nomi- predicting whether a certain relation holds be- nalization patterns are relatively well preserved in tween a given pair of words. In this pa- vector spaces (Mikolov et al., 2013; Pennington per, we explore to which extent the current et al., 2014a). Semantic relations pose a greater distributional landscape based on word em- beddings provides a suitable basis for classi- challenge (Vylomova et al., 2016), however. In fication of collocations, i.e., pairs of words fact, as of today, it is unclear which operation per- between which idiosyncratic lexical relations forms best (and why) for the recognition of indi- hold. First, we introduce a novel dataset with vidual lexico-semantic relations (e.g., hyperonymy collocations categorized according to lexical or meronymy, as opposed to cause, location or ac- functions. Second, we conduct experiments tion). Still, a number of works address this chal- on a subset of this benchmark, comparing it in lenge. For instance, hypernymy has been modeled particular to the well known DiffVec dataset. In these experiments, in addition to simple using vector concatenation (Baroni et al., 2012), word vector arithmetic operations, we also in- vector difference and component-wise squared vestigate the role of unsupervised relation vec- difference (Roller et al., 2014) as input to linear tors as a complementary input. -
Comparative Evaluation of Collocation Extraction Metrics
Comparative Evaluation of Collocation Extraction Metrics Aristomenis Thanopoulos, Nikos Fakotakis, George Kokkinakis Wire Communications Laboratory Electrical & Computer Engineering Dept., University of Patras 265 00 Rion, Patras, Greece {aristom,fakotaki,gkokkin}@wcl.ee.upatras.gr Abstract Corpus-based automatic extraction of collocations is typically carried out employing some statistic indicating concurrency in order to identify words that co-occur more often than expected by chance. In this paper we are concerned with some typical measures such as the t-score, Pearson’s χ-square test, log-likelihood ratio, pointwise mutual information and a novel information theoretic measure, namely mutual dependency. Apart from some theoretical discussion about their correlation, we perform comparative evaluation experiments judging performance by their ability to identify lexically associated bigrams. We use two different gold standards: WordNet and lists of named-entities. Besides discovering that a frequency-biased version of mutual dependency performs the best, followed close by likelihood ratio, we point out some implications that usage of available electronic dictionaries such as the WordNet for evaluation of collocation extraction encompasses. dependence, several metrics have been adopted by the 1. Introduction corpus linguistics community. Typical statistics are Collocational information is important not only for the t-score (TSC), Pearson’s χ-square test (χ2), log- second language learning but also for many natural likelihood ratio (LLR) and pointwise mutual language processing tasks. Specifically, in natural information (PMI). However, usually no systematic language generation and machine translation it is comparative evaluation is accomplished. For example, necessary to ensure generation of lexically correct Kita et al. (1993) accomplish partial and intuitive expressions; for example, “strong”, unlike “powerful”, comparisons between metrics, while Smadja (1993) modifies “coffee” but not “computers”. -
Collocation Extraction Using Parallel Corpus
Collocation Extraction Using Parallel Corpus Kavosh Asadi Atui1 Heshaam Faili1 Kaveh Assadi Atuie2 (1)NLP Laboratory, Electrical and Computer Engineering Dept., University of Tehran, Iran (2)Electrical Engineering Dept., Sharif University of Technology, Iran [email protected],[email protected],[email protected] ABSTRACT This paper presents a novel method to extract the collocations of the Persian language using a parallel corpus. The method is applicable having a parallel corpus between a target language and any other high-resource one. Without the need for an accurate parser for the target side, it aims to parse the sentences to capture long distance collocations and to generate more precise results. A training data built by bootstrapping is also used to rank the candidates with a log-linear model. The method improves the precision and recall of collocation extraction by 5 and 3 percent respectively in comparison with the window-based statistical method in terms of being a Persian multi-word expression. KEYWORDS: Information and Content Extraction, Parallel Corpus, Under-resourced Languages Proceedings of COLING 2012: Posters, pages 93–102, COLING 2012, Mumbai, December 2012. 93 1 Introduction Collocation is usually interpreted as the occurrence of two or more words within a short space in a text (Sinclair, 1987). This definition however is not precise, because it is not possible to define a short space. It also implies the strategy that all traditional models had. They were looking for co- occurrences rather than collocations (Seretan, 2011). Consider the following sentence and its Persian translation1: "Lecturer issued a major and also impossible to solve problem." مدرس یک مشکل بزرگ و غیر قابل حل را عنوان کرد. -
Collocation Classification with Unsupervised Relation Vectors
Collocation Classification with Unsupervised Relation Vectors Luis Espinosa-Anke1, Leo Wanner2,3, and Steven Schockaert1 1School of Computer Science, Cardiff University, United Kingdom 2ICREA and 3NLP Group, Universitat Pompeu Fabra, Barcelona, Spain fespinosa-ankel,schockaerts1g@cardiff.ac.uk [email protected] Abstract Morphosyntactic relations have been the focus of work on unsupervised relational similarity, as Lexical relation classification is the task of it has been shown that verb conjugation or nomi- predicting whether a certain relation holds be- nalization patterns are relatively well preserved in tween a given pair of words. In this pa- vector spaces (Mikolov et al., 2013; Pennington per, we explore to which extent the current et al., 2014a). Semantic relations pose a greater distributional landscape based on word em- beddings provides a suitable basis for classi- challenge (Vylomova et al., 2016), however. In fication of collocations, i.e., pairs of words fact, as of today, it is unclear which operation per- between which idiosyncratic lexical relations forms best (and why) for the recognition of indi- hold. First, we introduce a novel dataset with vidual lexico-semantic relations (e.g., hyperonymy collocations categorized according to lexical or meronymy, as opposed to cause, location or ac- functions. Second, we conduct experiments tion). Still, a number of works address this chal- on a subset of this benchmark, comparing it in lenge. For instance, hypernymy has been modeled particular to the well known DiffVec dataset. In these experiments, in addition to simple using vector concatenation (Baroni et al., 2012), word vector arithmetic operations, we also in- vector difference and component-wise squared vestigate the role of unsupervised relation vec- difference (Roller et al., 2014) as input to linear tors as a complementary input. -
Dish2vec: a Comparison of Word Embedding Methods in an Unsupervised Setting
dish2vec: A Comparison of Word Embedding Methods in an Unsupervised Setting Guus Verstegen Student Number: 481224 11th June 2019 Erasmus University Rotterdam Erasmus School of Economics Supervisor: Dr. M. van de Velden Second Assessor: Dr. F. Frasincar Master Thesis Econometrics and Management Science Business Analytics and Quantitative Marketing Abstract While the popularity of continuous word embeddings has increased over the past years, detailed derivations and extensive explanations of these methods lack in the academic literature. This paper elaborates on three popular word embedding meth- ods; GloVe and two versions of word2vec: continuous skip-gram and continuous bag-of-words. This research aims to enhance our understanding of both the founda- tion and extensions of these methods. In addition, this research addresses instability of the methods with respect to the hyperparameters. An example in a food menu context is used to illustrate the application of these word embedding techniques in quantitative marketing. For this specific case, the GloVe method performs best according to external expert evaluation. Nevertheless, the fact that GloVe performs best is not generalizable to other domains or data sets due to the instability of the models and the evaluation procedure. Keywords: word embedding; word2vec; GloVe; quantitative marketing; food Contents 1 Introduction 3 2 Literature review 4 2.1 Discrete distributional representation of text . .5 2.2 Continuous distributed representation of text . .5 2.3 word2vec and GloVe ..............................7 2.4 Food menu related literature . .9 3 Methodology: the base implementation 9 3.1 Quantified word similarity: cosine similarity . 12 3.2 Model optimization: stochastic gradient descent . 12 3.3 word2vec ................................... -
Evaluating Language Models for the Retrieval and Categorization of Lexical Collocations
Evaluating language models for the retrieval and categorization of lexical collocations Luis Espinosa-Anke1, Joan Codina-Filba`2, Leo Wanner3;2 1School of Computer Science and Informatics, Cardiff University, UK 2TALN Research Group, Pompeu Fabra University, Barcelona, Spain 3Catalan Institute for Research and Advanced Studies (ICREA), Barcelona, Spain [email protected], [email protected] Abstract linguistic phenomena (Rogers et al., 2020). Re- cently, a great deal of research analyzed the degree Lexical collocations are idiosyncratic com- to which they encode, e.g., morphological (Edmis- binations of two syntactically bound lexical ton, 2020), syntactic (Hewitt and Manning, 2019), items (e.g., “heavy rain”, “take a step” or or lexico-semantic structures (Joshi et al., 2020). “undergo surgery”). Understanding their de- However, less work explored so far how LMs in- gree of compositionality and idiosyncrasy, as well their underlying semantics, is crucial for terpret phraseological units at various degrees of language learners, lexicographers and down- compositionality. This is crucial for understanding stream NLP applications alike. In this paper the suitability of different text representations (e.g., we analyse a suite of language models for col- static vs contextualized word embeddings) for en- location understanding. We first construct a coding different types of multiword expressions dataset of apparitions of lexical collocations (Shwartz and Dagan, 2019), which, in turn, can be in context, categorized into 16 representative useful for extracting latent world or commonsense semantic categories. Then, we perform two experiments: (1) unsupervised collocate re- information (Zellers et al., 2018). trieval, and (2) supervised collocation classi- One central type of phraselogical units are fication in context. -
Actes Des 2Èmes Journées Scientifiques Du Groupement De Recherche Linguistique Informatique Formelle Et De Terrain (LIFT)
Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT). Thierry Poibeau, Yannick Parmentier, Emmanuel Schang To cite this version: Thierry Poibeau, Yannick Parmentier, Emmanuel Schang. Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT).. Poibeau, Thierry and Parmentier, Yannick and Schang, Emmanuel. Dec 2020, Montrouge, France. CNRS, 2020. hal-03066031 HAL Id: hal-03066031 https://hal.archives-ouvertes.fr/hal-03066031 Submitted on 3 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LIFT 2020 Actes des 2èmes journées scientifiques du Groupement de Recherche Linguistique Informatique Formelle et de Terrain (LIFT) Thierry Poibeau, Yannick Parmentier, Emmanuel Schang (Éds.) 10-11 décembre 2020 Montrouge, France (Virtuel) Comités Comité d’organisation Thierry Poibeau, LATTICE/CNRS Yannick Parmentier, LORIA/Université de Lorraine Emmanuel Schang, LLL/Université d’Orléans Comité de programme Angélique Amelot LPP/CNRS -
Projecto Em Engenharia Informatica
UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento 2013 UNIVERSIDADE DE LISBOA Faculdade de Cienciasˆ Departamento de Informatica´ SPEECH-TO-SPEECH TRANSLATION TO SUPPORT MEDICAL INTERVIEWS Joao˜ Antonio´ Santos Gomes Rodrigues PROJETO MESTRADO EM ENGENHARIA INFORMATICA´ Especializac¸ao˜ em Interacc¸ao˜ e Conhecimento Projeto orientado pelo Prof. Doutor Antonio´ Manuel Horta Branco 2013 Agradecimentos Agradec¸o ao meu orientador, Antonio´ Horta Branco, por conceder a oportunidade de realizar este trabalho e pela sua sagacidade e instruc¸ao˜ oferecida para a concretizac¸ao˜ do mesmo. Agradec¸o aos meus parceiros do grupo NLX, por terem aberto os caminhos e cri- ado inumeras´ ferramentas e conhecimento para o processamento de linguagem natural do portugues,ˆ das quais fac¸o uso neste trabalho. Assim como pela imprescind´ıvel partilha e companhia. Agradec¸o a` minha fam´ılia e a` Joana, pelo apoio inefavel.´ Este trabalho foi apoiado pela Comissao˜ Europeia no ambitoˆ do projeto METANET4U (ECICTPSP:270893), pela Agenciaˆ de Inovac¸ao˜ no ambitoˆ do projeto S4S (S4S / PLN) e pela Fundac¸ao˜ para a Cienciaˆ e Tecnologia no ambitoˆ do projeto DP4LT (PTDC / EEISII / 1940 / 2012). As` tresˆ expresso a minha sincera gratidao.˜ iii Resumo Este relatorio´ apresenta a criac¸ao˜ de um sistema de traduc¸ao˜ fala-para-fala. O sistema consiste na captac¸ao˜ de voz na forma de sinal audio´ que de seguida e´ interpretado, tra- duzido e sintetizado para voz. Tendo como entrada um enunciado numa linguagem de origem e como sa´ıda um enunciado numa linguagem destino. -
New Kazakh Parallel Text Corpora with On-Line Access
New Kazakh parallel text corpora with on-line access Zhandos Zhumanov1, Aigerim Madiyeva2 and Diana Rakhimova3 al-Farabi Kazakh National University, Laboratory of Intelligent Information Systems, Almaty, Kazakhstan [email protected], [email protected], [email protected] Abstract. This paper presents a new parallel resource – text corpora – for Ka- zakh language with on-line access. We describe 3 different approaches to col- lecting parallel text and how much data we managed to collect using them, par- allel Kazakh-English text corpora collected from various sources and aligned on sentence level, and web accessible corpus management system that was set up using open source tools – corpus manager Mantee and web GUI KonText. As a result of our work we present working web-accessible corpus management sys- tem to work with collected corpora. Keywords: parallel text corpora, Kazakh language, corpus management system 1 Introduction Linguistic text corpora are large collections of text used for different language studies. They are used in linguistics and other fields that deal with language studies as an ob- ject of study or as a resource. Text corpora are needed for almost any language study since they are basically a representation of the language itself. In computer linguistics text corpora are used for various parsing, machine translation, speech recognition, etc. As shown in [1] text corpora can be classified by many categories: Size: small (to 1 million words); medium (from 1 to 10 million words); large (more than 10 million words). Number of Text Languages: monolingual, bilingual, multilingual. Language of Texts: English, German, Ukrainian, etc. Mode: spoken; written; mixed. -
Multilingual Semantic Role Labeling with Unified Labels
POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels Alan Akbik Yunyao Li IBM Research Almaden Research Center 650 Harry Road, San Jose, CA 95120, USA fakbika,[email protected] Abstract Semantic role labeling (SRL) identifies the predicate-argument structure in text with semantic labels. It plays a key role in un- derstanding natural language. In this pa- per, we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 differ- Figure 1: Example of POLYGLOT predicting English Prop- ent languages from 4 different language Bank labels for a simple German sentence: The verb “kaufen” is correctly identified to evoke the BUY.01 frame, while “ich” groups. The core of POLYGLOT are SRL (I) is recognized as the buyer,“ein neues Auto”(a new car) models for individual languages trained as the thing bought, and “dir”(for you) as the benefactive. with automatically generated Proposition Banks (Akbik et al., 2015). The key fea- Not surprisingly, enabling SRL for languages ture of the system is that it treats the other than English has received increasing atten- semantic labels of the English Proposi- tion. One relevant key effort is to create Proposi- tion Bank as “universal semantic labels”: tion Bank-style resources for different languages, Given a sentence in any of the supported such as Chinese (Xue and Palmer, 2005) and languages, POLYGLOT applies the corre- Hindi (Bhatt et al., 2009). However, the conven- sponding SRL and predicts English Prop- tional approach of manually generating such re- Bank frame and role annotation. The re- sources is costly in terms of time and experts re- sults are then visualized to facilitate the quired, hindering the expansion of SRL to new tar- understanding of multilingual SRL with get languages.