Insight Into the Slovak and Czech Corpus Linguistics
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
CLIB 2016 Proceedings
The Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) is organised within the Operation for Support for International Scientific Conferences Held in Bulgaria of the National Science Fund Grant № ДПМНФ 01/9 of 11 Aug 2016. National Science Fund CLIB 2016 is organised by: The Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences PUBLICATION AND CATALOGUING INFORMATION Title: Proceedings of the Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) ISSN: 23675675 (online) Published and The Institute for Bulgarian Language Prof. Lyubomir distributed by: Andreychin, Bulgarian Academy of Sciences Editorial address: Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences 52 Shipchenski prohod blvd., bldg. 17 Sofia 1113, Bulgaria +359 2/ 872 23 02 Copyright of each paper stays with the respective authors. The works in the Proceedings are licensed under a Creative Commons Attribution 4.0 International Licence (CC BY 4.0). Licence details: http://creativecommons.org/licenses/by/4.0 Proceedings of the Second International Conference Computational Linguistics in Bulgaria 9 September 2016 Sofia, Bulgaria PREFACE We are excited to welcome you to the second edition of the International Conference Computational Linguistics in Bulgaria (CLIB 2016) in Sofia, Bulgaria! CLIB aspires to foster the NLP community in Bulgaria and further the cooperation among researchers working in NLP for Bulgarian around the world. The need for a conference dedicated to NLP research dealing with or applicable to Bulgarian has been felt for quite some time. We believe that building a strong community of researchers and teams who have chosen to work on Bulgarian is a key factor to meeting the challenges and requirements posed to computational linguistics and NLP in Bulgaria. -
Collection of Usage Information for Language Resources from Academic Articles
Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD. -
Inflectional Suffix Priming in Czech Verbs and Nouns
Inflectional Suffix Priming in Czech Verbs and Nouns Filip Smol´ık ([email protected]) Institute of Psychology, Academy of Sciences of the Czech Republic Politickych´ vezˇ nˇu˚ 7, Praha 1, CZ-110 00, Czech Republic Abstract research suggested that only prefixes could be primed, but not suffixes (Giraudo & Graigner, 2003). Two experiments examined if processing of inflectional affixes Recently, Dunabeitia,˜ Perea, and Carreiras (2008) were is affected by morphological priming, and whether morpho- logical decomposition applies to inflectional morphemes in vi- able to show affix priming in suffixed Spanish words. Their sual word recognition. Target words with potentially ambigu- participants processed the suffixed words faster if they were ous suffixes were preceded by primes that contained identical preceded by words with the same suffixes. The effect was also suffixes, homophonous suffixes with different function, or dif- ferent suffixes. The results partially confirmed the observa- present when the primes contained isolated suffixes only, or tion that morphological decomposition initially ignores the af- suffixes attached to strings of non-letter characters. fix meaning. With verb targets and short stimulus-onset asyn- The literature thus indicates that affixes can be primed, chrony (SOA), homophonous suffixes had similar effects as identical suffixes. With noun targets, there was a tendency to even though there may be differences between prefixes and respond faster after homophonous targets. With longer SOA in suffixes in the susceptibility to priming. However, all re- verb targets, the primes with identical suffix resulted in shorter search sketched above worked with derivational affixes. It is responses than the primes with a homophonous suffix. -
Conference Abstracts
EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues. -
Morphological Doublets in Croatian: a Multi-Methodological Analysis
Morphological Doublets in Croatian: A multi-methodological analysis By: Dario Lečić A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy The University of Sheffield Faculty of Arts and Humanities Department of Russian and Slavonic Studies 20 January, 2017 Acknowledgments Many a PhD student past and present will agree that doing a PhD is a time-consuming process with lots of ups and downs, motivational issues and even a number of nervous breakdowns. Having experienced all of these, I can only say that they were right. However, having reached the end of the tunnel, I have to admit that the feeling is great. I would like to use this opportunity to thank all the people who made this possible and who have helped me during these four years spent researching the intricate world of morphological doublets in Croatian. First of all, I would like to express my immense gratitude to my primary supervisor, Professor Neil Bermel from the Department or Russian and Slavonic Studies at the University of Sheffield for offering his guidance from day one. Our regular supervisory meetings as well as numerous e-mail exchanges have been eye-opening and I would not have been able to do this without you. I hope this dissertation will justify all the effort you have put into me as your PhD student. Even though the jurisdiction of the second supervisor as defined by the University of Sheffield officially stretches mostly to matters of the Doctoral Development Programme, my second supervisor, Dr Dagmar Divjak, nevertheless played a major role in this research as well, primarily in matters of statistics. -
Berkeley Linguistics Society
PROCEEDINGS OF THE FORTY-FIRST ANNUAL MEETING OF THE BERKELEY LINGUISTICS SOCIETY February 7-8, 2015 General Session Special Session Fieldwork Methodology Editors Anna E. Jurgensen Hannah Sande Spencer Lamoureux Kenny Baclawski Alison Zerbe Berkeley Linguistics Society Berkeley, CA, USA Berkeley Linguistics Society University of California, Berkeley Department of Linguistics 1203 Dwinelle Hall Berkeley, CA 94720-2650 USA All papers copyright c 2015 by the Berkeley Linguistics Society, Inc. All rights reserved. ISSN: 0363-2946 LCCN: 76-640143 Contents Acknowledgments . v Foreword . vii The No Blur Principle Effects as an Emergent Property of Language Systems Farrell Ackerman, Robert Malouf . 1 Intensification and sociolinguistic variation: a corpus study Andrea Beltrama . 15 Tagalog Sluicing Revisited Lena Borise . 31 Phonological Opacity in Pendau: a Local Constraint Conjunction Analysis Yan Chen . 49 Proximal Demonstratives in Predicate NPs Ryan B . Doran, Gregory Ward . 61 Syntax of generic null objects revisited Vera Dvořák . 71 Non-canonical Noun Incorporation in Bzhedug Adyghe Ksenia Ershova . 99 Perceptual distribution of merging phonemes Valerie Freeman . 121 Second Position and “Floating” Clitics in Wakhi Zuzanna Fuchs . 133 Some causative alternations in K’iche’, and a unified syntactic derivation John Gluckman . 155 The ‘Whole’ Story of Partitive Quantification Kristen A . Greer . 175 A Field Method to Describe Spontaneous Motion Events in Japanese Miyuki Ishibashi . 197 i On the Derivation of Relative Clauses in Teotitlán del Valle Zapotec Nick Kalivoda, Erik Zyman . 219 Gradability and Mimetic Verbs in Japanese: A Frame-Semantic Account Naoki Kiyama, Kimi Akita . 245 Exhaustivity, Predication and the Semantics of Movement Peter Klecha, Martina Martinović . 267 Reevaluating the Diphthong Mergers in Japono-Ryukyuan Tyler Lau . -
Predicting Declension Class from Form and Meaning
This is a repository copy of Predicting declension class from form and meaning. White Rose Research Online URL for this paper: https://eprints.whiterose.ac.uk/161615/ Proceedings Paper: Williams, Adina, Pimentel, Tiago, McCarthy, Arya et al. (3 more authors) (2020) Predicting declension class from form and meaning. In: Proceedings of the 58th Annual Meeting for the Association of Computational Linguistics. Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ Predicting Declension Class from Form and Meaning Adina Williams@ Tiago PimentelD Arya D. McCarthyZ Hagen BlixË Eleanor ChodroffY Ryan CotterellD,Q @Facebook AI Research DUniversity of Cambridge ZJohns Hopkins University ËNew York University YUniversity of York QETH Zurich¨ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Abstract The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. -
Languages in the European Information Society Croatian
META-NET White Paper Series Languages in the European Information Society Croatian Early Release Edition META-FORUM 2011 27-28 June 2011 Budapest, Hungary The development of this white paper has been funded by the Seventh Framework Programme and the ICT Policy Support Programme of the European Commission under contracts T4ME (Grant Agreement 249119), CESAR (Grant Agreement 271022), METANET4U (Grant Agreement 270893) and META-NORD (Grant Agreement 270899). This white paper is for educators, journalists, politicians, language communities and others, who want to establish a truly multilingual Europe. This white paper is part of a series that promotes knowledge about language technology and its potential. The availability and use of language technology in Europe varies between languages. Conse- quently, the actions that are required to further support research and development of language technologies also differs for each language. The required actions depend on many factors, such as the complexity of a given language and the size of its community. META-NET, a European Commission Network of Excellence, has conducted an analysis of current language resources and technolo- gies. This analysis focused on the 23 official European languages as well as other important regional languages in Europe. The results of this analysis suggests that there are many significant research gaps for each language. A more detailed, expert analysis and as- sessment of the current situation will help maximise the impact of additional research and minimize any risks. META-NET consists of 44 research centres from 31 countries who are working with stakeholders from commercial businesses, gov- ernment agencies, industry, research organisations, software com- panies, technology providers and European universities. -
The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1. -
Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God
Tomislava Bošnjak Botica, Gordana Hržica, Overabundance in Croatian dual-class verbs FLUMINENSIA, god. 28 (2016), br. 1 Tomislava Bošnjak Botica, Gordana Hržica OVERABUNDANCE IN CROATIAN DUAL-CLASS VERBS dr. sc. Tomislava Bošnjak Botica, Institut za hrvatski jezik i jezikoslovlje, [email protected], Zagreb dr. sc. Gordana Hržica, Edukacijsko-rehabilitacijski fakultet, [email protected], Zagreb izvorni znanstveni članak UDK 811.163.42’367.625 rukopis primljen: 5. 4. 2016.; prihvaćen za tisak: 21. 6. 2016. Croatian verbal inflection morphology is typically described using verb class distinctions. The number of classes differs among approaches, but the basic criterion for class division is the presence or absence and the type of suppletion in verb stems. Generally, one verb belongs to one inflectional class or paradigm only. However, some verbs belong to two classes, i.e. they have two parallel sets of stems. In such dual-class verbs, one infinitive form is realizable in two present forms in all cells within a class, i.e. there is an overabundance (Thorton 2011). Inevitably, one of the stem forming paradigms is a class with categorial suppletion. The present stem of a categorial suppletion class has a greater phonological distance from the infinitive stem than the present stem of the other class. Using a different terminology one class can be described as more transparent, while the other is less transparent (more opaque) in forming the present stem. This study attempts to present overabundance in dual-class verbs and to determine whether competition in such forms can be explained by their tendency to conform to one default class or by other factors, specifically, by the phonological distance between the two paradigms of dual-class verbs. -
Predicting Declension Class from Form and Meaning
Research Collection Conference Paper Predicting Declension Class from Form and Meaning Author(s): Williams, Adina; Pimentel, Tiago; Blix, Hagen; McCarthy, Arya D.; Chodroff, Eleanor; Cotterell, Ryan Publication Date: 2020-07 Permanent Link: https://doi.org/10.3929/ethz-b-000462306 Originally published in: http://doi.org/10.18653/v1/2020.acl-main.597 Rights / License: Creative Commons Attribution 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Predicting Declension Class from Form and Meaning Adina Williams@ Tiago PimentelD Arya D. McCarthyZ Hagen BlixË Eleanor ChodroffY Ryan CotterellD;Q @Facebook AI Research DUniversity of Cambridge ZJohns Hopkins University ËNew York University YUniversity of York QETH Zurich¨ [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and its + meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize “strength” as measuring how much informa- Figure 1: The conditional entropies (H) and mutual in- tion, in bits, we can glean about declension formation quantities (MI) of form (W ), meaning (V ), class from knowing the form and meaning and declension class (C), given gender (G) in German of nouns. We know that form and mean- and Czech. ing are often also indicative of grammatical gender—which, as we quantitatively verify, can itself share information with declension not are few in number. -
Better Web Corpora for Corpus Linguistics and NLP
Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.