Insight Into the Slovak and Czech Corpus Linguistics

Total Page:16

File Type:pdf, Size:1020Kb

Insight Into the Slovak and Czech Corpus Linguistics INSIGHT INTO THE SLOVAK AND CZECH CORPUS LINGUISTICS Editor MÁRIA ŠIMKOVÁ Bratislava 2006 VEDA Publishing House of Slovak Academy of Sciences Ľudovít Štúr Institute of Linguistics of the SAS © Silvie Cinková, Radovan Garabík, Lucia Gianitsová-Ološtiaková, Markus Giger, Jan Hajič, Eva Hajičová, Veronika Kolářová-Řezníčková, Michal Křen, Markéta Lopatková, Karel Oliva, Jarmila Panevová, Vladimír Petkevič, Věra Schmiedtová, Miloslava Sokolová, Mária Šimková (ed.), Zdeňka Urešová Zborník Insight into Slovak and Czech Corpus Linguistics bol pôvodne koncipovaný ako súbor prednášok, ktoré odzneli od 4. 11. 2002 do 28. 6. 2004 na pôde oddelenia Slovenského národného korpusu v Jazykovednom ústave Ľ. Štúra Slovenskej akadémie vied v Bratislave (http://korpus. juls.savba.sk/activities/). Nie všetci prednášajúci však dodali príspevok na publikovanie, iní svoj príspevok poskytli v doplnenej a prepracovanej verzii. Viaceré prednášky technického charakteru a témy workshopov ani neboli plánované na vydanie v textovej podobe, niektoré časti z tých, ktoré sa v zborníku nachádzajú, sú už v čase vydania prekonané. Naopak, do zborníka pribudli štúdie o budovaní Slovenského národného korpusu a jeho lingvistickej (najmä morfologickej) anotácii. V tomto zložení predstavuje publikácia Insight into Slovak and Czech Corpus Linguistics vybraný súbor štúdií o stave korpusov a možnostiach i výsledkoch korpusovej lingvistiky v Slovenskej republike a v Českej republike v prvej polovici prvého desaťročia 21. storočia. Proceedings entitled Insight into Slovak and Czech Corpus Linguistics were to have been a collection of lectures that had been delivered since November 4th 2002 until June 28th 2004 at the department of the Slovak National Corpus of the Ľudovít Štúr Institute of Linguistics of the Slovak Academy of Sciences in Bratislava (http://korpus.juls.savba.sk/activities/). However, not every author had handed in his or her paper for publishing, others had sent them enlarged and rewritten. Several lectures being of more technical character as well as subject matter of workshops had not been even intended for publishing in print. Some parts of those, which are included into proceedings, are recently outdated. On the contrary, proceedings were enriched with papers on building of the Slovak National Corpus and on its linguistic (especially morphological) annotation. The content of the publication Insight into Slovak and Czech Corpus Linguistics represents a collection of selected papers reflecting the state and possibilities of corpora as well as the outcomes of Corpus Linguistics in the Slovak and Czech Republic in the first half of the 21st century. © Ľudovít Štúr Institute of Linguistics of the SAS, Slovak National Corpus, 2006 ISBN 80-224-0880-8 TABLE OF CONTENTS Věra Schmiedtová Corpora and Dictionaries 7 Michal Křen Frequency Dictionary of Czech: A Detailed Processing Description 16 Vladimír Petkevič Reliable Morphological Disambiguation of Czech: a Rule-Based Approach Is Necessary 26 Karel Oliva Discovering and Employing Ungrammaticality 45 Jan Hajič Complex Corpus Annotation: The Prague Dependency Treebank 54 Eva Hajičová Towards the Underlying Structure Annotation of a Large Corpus of Texts 74 Markéta Lopatková – Jarmila Panevová Recent Developments in the Theory of Valency in the Light of the Prague Dependency Treebank 83 Zdeňka Urešová Verbal Valency in the Prague Dependency Treebank from the Annotator’s Viewpoint 93 Silvie Cinková – Veronika Kolářová Nouns as Components of Support Verb Constructions in the Prague Dependency Treebank 113 Markus Giger On the Delimitation of Analytic Verbal Forms 140 Mária Šimková Slovak National Corpus – History and Current Situation 151 Radovan Garabík Processing XML Text with Python and ElementTree – a Practical Experience 160 Lucia Gianitsová Morphological Analysis of the Slovak National Corpus 166 Miloslava Sokolová Options for the Generation of a Corpus-Based Slovak Morphology (as Part of Corpus Morphosyntax) 179 Corpora and dictionaries VĚRA SCHMIEDTOVÁ This text was developed and has been used as a basic study material for courses on corpus linguistics taught at the Charles University in Prague over the last couple of years. In addition, it was presented at a workshop in 2003 at the Institute for the Slovak National Corpus in Bratislava. INTRODUCTION It is sometimes claimed that we are experiencing the age of dictionaries. In recent years, many new lexicographical organizations and institutes have organised specialized conferences where lexicologists from all over the world come together to evaluate and discuss theoretical as well as practical issues concerning their field of expertise. (There are such conferences as Euralex, Afrilex, Asialex, and DSNA – lexicological organizations of North America). Also, many books and papers whose focus is lexicology have been published. For example, a very important piece of work is the Encyclopaedia of Lexicography, which was edited by Franz Josef Hausmann and to which 349 collaborators contributed their work. New specialized lexicographical journals have started to appear (Dictionaries, Lexicographica, International Journal of Lexicography), etc. This boom has mainly affected the English-speaking countries, especially Great Britain. Naturally, other countries can learn from this turbulent development in documenting their languages, and can attempt to use the results already achieved. My contribution pays attention to monolingual dictionaries, and addresses the current language situation in the Czech Republic compared with the situation in Great Britain. After several years of improving conditions, modern lexicology is pausing for breath and all the data that have been collected and made ready by the Czech National Corpus are now ready to be processed and analysed. It is possible to say that, since the political changes in 1989, the Czech lexicology situation is still fragile; nevertheless, it is not as hopeless as previously was the case. THE TRADITIONAL WAY OF DEVELOPING A DICTIONARY Prior to the use of computers, it was necessary to collect a large amount of language material in order to compile a new dictionary. The basic prerequisite method was excerption. That is, recording individual words and collocations in their respective contexts of occurrence onto excerptions slips. For more detail, see below. HISTORY OF MODERN CZECH ACADEMIC DICTIONARIES We will not review all the dictionaries written in recent periods in the Czech language sphere, but instead focus on a few academic dictionaries, pertaining to Czech vocabulary from 1870 to 1978, whose origins goes back to the first years of the 20th century. 7 1 REFERENCE DICTIONARY OF THE CZECH LANGUAGE 19351957 PS 9 VOLUMES The lexicological archive where all the excerption slips are stored is located at the Institute for the Czech language in Prague. The archive has been operating since 1911 although preparations started in 1906. The main initiator of the entire project was Mr. František Pastrnek, who also became the president of the lexicological and dialectological committee of the Czech Academy of Science in 1911. In the same year, the Office of the Dictionary of the Czech language was established and, thereby, the first rules for scientific excerptions. The Office was situated in one room of the Hlávkův Palace in Jungmannova Street in Prague. The sole employee was an internal secretary by the name of František Trávníček. Soon afterwards, the Bureau expanded with new members from the ranks of high school teachers and university students. Interestingly, until the Bureau was changed into an academic institution, it was supported by the Ministry of Education by making high school teachers available for work on a new dictionary. The future dictionary was based on excerptions from Czech vocabulary from 1770. Initially, so-called “total excerption” was applied, i.e. every word in a given text was documented in all its contextual occurrences, collocations, and idioms. A word in its dictionary form was the header of an excerption slip. That means that declined or conjugated words were to be found in nominative singular or in infinitive in the header; other possible collocations, further sufficient context where the form of a given rdwo was underlined, and finally bibliographical information regarding the occurrence of the word were put in the commentary line. In 1917, a second improved edition of the excerption rules appeared. The total excerption style was enhanced by a partial excerption and by specialized excerption that was dedicated to capturing linguistic peculiarities. An example of an excerption slip: Authors of books selected for excerption were supposed to be “the good authors”. Thus, every year, between 10 and 15 books of were chosen. Subsequently, filters were set up. These were lists of words and their meanings that were already well documented. For these words, no additional excerption was required. In addition, for each period, a significant piece of work was singled out. Excerpts from such books formed the basis for a filter. For example, the novel Babička (Grandmother) by Božena Němcová was selected for the period 1854–1878. 8 The actual lexicographical work began in 1932. Originally, the idea was to publish a thesaurus of the Czech language, but in the end this plan was changed and it was decided to create the reference dictionary of the Czech language. The focus was on the current Czech language. For the description, excerpted data collected since 1870 were used. Older language material was used in
Recommended publications
  • CLIB 2016 Proceedings
    The Second International Conference Computational ​ Linguistics in Bulgaria (CLIB 2016) is organised within ​ the Operation for Support for International Scientific Conferences Held in Bulgaria of the National Science ​ Fund Grant № ДПМНФ 01/9 of 11 Aug 2016. National Science Fund ​ ​ CLIB 2016 is organised by: The Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences PUBLICATION AND CATALOGUING INFORMATION Title: Proceedings of the Second International Conference Computational Linguistics in Bulgaria (CLIB 2016) ISSN: 2367­5675 (online) Published and The Institute for Bulgarian Language Prof. Lyubomir ​ distributed by: Andreychin, Bulgarian Academy of Sciences ​ Editorial address: Institute for Bulgarian Language Prof. Lyubomir ​ Andreychin, Bulgarian Academy of Sciences ​ 52 Shipchenski prohod blvd., bldg. 17 Sofia 1113, Bulgaria +359 2/ 872 23 02 Copyright of each paper stays with the respective authors. The works in the Proceedings are licensed under a Creative Commons Attribution 4.0 International Licence (CC BY 4.0). Licence details: http://creativecommons.org/licenses/by/4.0 ​ Proceedings of the Second International Conference Computational Linguistics in Bulgaria 9 September 2016 Sofia, Bulgaria PREFACE We are excited to welcome you to the second edition of the International Conference Computational ​ Linguistics in Bulgaria (CLIB 2016) in Sofia, Bulgaria! ​ CLIB aspires to foster the NLP community in Bulgaria and further the cooperation among researchers working in NLP for Bulgarian around the world. The need for a conference dedicated to NLP research dealing with or applicable to Bulgarian has been felt for quite some time. We believe that building a strong community of researchers and teams who have chosen to work on Bulgarian is a key factor to meeting the challenges and requirements posed to computational linguistics and NLP in Bulgaria.
    [Show full text]
  • Collection of Usage Information for Language Resources from Academic Articles
    Collection of Usage Information for Language Resources from Academic Articles Shunsuke Kozaway, Hitomi Tohyamayy, Kiyotaka Uchimotoyyy, Shigeki Matsubaray yGraduate School of Information Science, Nagoya University yyInformation Technology Center, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan yyyNational Institute of Information and Communications Technology 4-2-1 Nukui-Kitamachi, Koganei, Tokyo, 184-8795, Japan fkozawa,[email protected], [email protected], [email protected] Abstract Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus. 1. Introduction Thesaurus1. In recent years, such language resources (LRs) as corpora • He also employed Roget’s Thesaurus in 100 words of and dictionaries are being widely used for research in the window to implement WSD.
    [Show full text]
  • Inflectional Suffix Priming in Czech Verbs and Nouns
    Inflectional Suffix Priming in Czech Verbs and Nouns Filip Smol´ık ([email protected]) Institute of Psychology, Academy of Sciences of the Czech Republic Politickych´ vezˇ nˇu˚ 7, Praha 1, CZ-110 00, Czech Republic Abstract research suggested that only prefixes could be primed, but not suffixes (Giraudo & Graigner, 2003). Two experiments examined if processing of inflectional affixes Recently, Dunabeitia,˜ Perea, and Carreiras (2008) were is affected by morphological priming, and whether morpho- logical decomposition applies to inflectional morphemes in vi- able to show affix priming in suffixed Spanish words. Their sual word recognition. Target words with potentially ambigu- participants processed the suffixed words faster if they were ous suffixes were preceded by primes that contained identical preceded by words with the same suffixes. The effect was also suffixes, homophonous suffixes with different function, or dif- ferent suffixes. The results partially confirmed the observa- present when the primes contained isolated suffixes only, or tion that morphological decomposition initially ignores the af- suffixes attached to strings of non-letter characters. fix meaning. With verb targets and short stimulus-onset asyn- The literature thus indicates that affixes can be primed, chrony (SOA), homophonous suffixes had similar effects as identical suffixes. With noun targets, there was a tendency to even though there may be differences between prefixes and respond faster after homophonous targets. With longer SOA in suffixes in the susceptibility to priming. However, all re- verb targets, the primes with identical suffix resulted in shorter search sketched above worked with derivational affixes. It is responses than the primes with a homophonous suffix.
    [Show full text]
  • Conference Abstracts
    EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Patronage of Ms Neelie Kroes, Vice-President of the European Commission, Digital Agenda Commissioner MAY 23-24-25, 2012 ISTANBUL LÜTFI KIRDAR CONVENTION & EXHIBITION CENTRE ISTANBUL, TURKEY CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis. Assistant Editors: Hélène Mazo, Sara Goggi, Olivier Hamon © ELRA – European Language Resources Association. All rights reserved. LREC 2012, EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Title: LREC 2012 Conference Abstracts Distributed by: ELRA – European Language Resources Association 55-57, rue Brillat Savarin 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected] Copyright by the European Language Resources Association ISBN 978-2-9517408-7-7 EAN 9782951740877 All rights reserved. No part of this book may be reproduced in any form without the prior permission of the European Language Resources Association ii Introduction of the Conference Chair Nicoletta Calzolari I wish first to express to Ms Neelie Kroes, Vice-President of the European Commission, Digital agenda Commissioner, the gratitude of the Program Committee and of all LREC participants for her Distinguished Patronage of LREC 2012. Even if every time I feel we have reached the top, this 8th LREC is continuing the tradition of breaking previous records: this edition we received 1013 submissions and have accepted 697 papers, after reviewing by the impressive number of 715 colleagues.
    [Show full text]
  • Morphological Doublets in Croatian: a Multi-Methodological Analysis
    Morphological Doublets in Croatian: A multi-methodological analysis By: Dario Lečić A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy The University of Sheffield Faculty of Arts and Humanities Department of Russian and Slavonic Studies 20 January, 2017 Acknowledgments Many a PhD student past and present will agree that doing a PhD is a time-consuming process with lots of ups and downs, motivational issues and even a number of nervous breakdowns. Having experienced all of these, I can only say that they were right. However, having reached the end of the tunnel, I have to admit that the feeling is great. I would like to use this opportunity to thank all the people who made this possible and who have helped me during these four years spent researching the intricate world of morphological doublets in Croatian. First of all, I would like to express my immense gratitude to my primary supervisor, Professor Neil Bermel from the Department or Russian and Slavonic Studies at the University of Sheffield for offering his guidance from day one. Our regular supervisory meetings as well as numerous e-mail exchanges have been eye-opening and I would not have been able to do this without you. I hope this dissertation will justify all the effort you have put into me as your PhD student. Even though the jurisdiction of the second supervisor as defined by the University of Sheffield officially stretches mostly to matters of the Doctoral Development Programme, my second supervisor, Dr Dagmar Divjak, nevertheless played a major role in this research as well, primarily in matters of statistics.
    [Show full text]
  • Berkeley Linguistics Society
    PROCEEDINGS OF THE FORTY-FIRST ANNUAL MEETING OF THE BERKELEY LINGUISTICS SOCIETY February 7-8, 2015 General Session Special Session Fieldwork Methodology Editors Anna E. Jurgensen Hannah Sande Spencer Lamoureux Kenny Baclawski Alison Zerbe Berkeley Linguistics Society Berkeley, CA, USA Berkeley Linguistics Society University of California, Berkeley Department of Linguistics 1203 Dwinelle Hall Berkeley, CA 94720-2650 USA All papers copyright c 2015 by the Berkeley Linguistics Society, Inc. All rights reserved. ISSN: 0363-2946 LCCN: 76-640143 Contents Acknowledgments . v Foreword . vii The No Blur Principle Effects as an Emergent Property of Language Systems Farrell Ackerman, Robert Malouf . 1 Intensification and sociolinguistic variation: a corpus study Andrea Beltrama . 15 Tagalog Sluicing Revisited Lena Borise . 31 Phonological Opacity in Pendau: a Local Constraint Conjunction Analysis Yan Chen . 49 Proximal Demonstratives in Predicate NPs Ryan B . Doran, Gregory Ward . 61 Syntax of generic null objects revisited Vera Dvořák . 71 Non-canonical Noun Incorporation in Bzhedug Adyghe Ksenia Ershova . 99 Perceptual distribution of merging phonemes Valerie Freeman . 121 Second Position and “Floating” Clitics in Wakhi Zuzanna Fuchs . 133 Some causative alternations in K’iche’, and a unified syntactic derivation John Gluckman . 155 The ‘Whole’ Story of Partitive Quantification Kristen A . Greer . 175 A Field Method to Describe Spontaneous Motion Events in Japanese Miyuki Ishibashi . 197 i On the Derivation of Relative Clauses in Teotitlán del Valle Zapotec Nick Kalivoda, Erik Zyman . 219 Gradability and Mimetic Verbs in Japanese: A Frame-Semantic Account Naoki Kiyama, Kimi Akita . 245 Exhaustivity, Predication and the Semantics of Movement Peter Klecha, Martina Martinović . 267 Reevaluating the Diphthong Mergers in Japono-Ryukyuan Tyler Lau .
    [Show full text]
  • Predicting Declension Class from Form and Meaning
    This is a repository copy of Predicting declension class from form and meaning. White Rose Research Online URL for this paper: https://eprints.whiterose.ac.uk/161615/ Proceedings Paper: Williams, Adina, Pimentel, Tiago, McCarthy, Arya et al. (3 more authors) (2020) Predicting declension class from form and meaning. In: Proceedings of the 58th Annual Meeting for the Association of Computational Linguistics. Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ Predicting Declension Class from Form and Meaning Adina Williams@ Tiago PimentelD Arya D. McCarthyZ Hagen BlixË Eleanor ChodroffY Ryan CotterellD,Q @Facebook AI Research DUniversity of Cambridge ZJohns Hopkins University ËNew York University YUniversity of York QETH Zurich¨ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Abstract The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties.
    [Show full text]
  • Languages in the European Information Society Croatian
    META-NET White Paper Series Languages in the European Information Society Croatian Early Release Edition META-FORUM 2011 27-28 June 2011 Budapest, Hungary The development of this white paper has been funded by the Seventh Framework Programme and the ICT Policy Support Programme of the European Commission under contracts T4ME (Grant Agreement 249119), CESAR (Grant Agreement 271022), METANET4U (Grant Agreement 270893) and META-NORD (Grant Agreement 270899). This white paper is for educators, journalists, politicians, language communities and others, who want to establish a truly multilingual Europe. This white paper is part of a series that promotes knowledge about language technology and its potential. The availability and use of language technology in Europe varies between languages. Conse- quently, the actions that are required to further support research and development of language technologies also differs for each language. The required actions depend on many factors, such as the complexity of a given language and the size of its community. META-NET, a European Commission Network of Excellence, has conducted an analysis of current language resources and technolo- gies. This analysis focused on the 23 official European languages as well as other important regional languages in Europe. The results of this analysis suggests that there are many significant research gaps for each language. A more detailed, expert analysis and as- sessment of the current situation will help maximise the impact of additional research and minimize any risks. META-NET consists of 44 research centres from 31 countries who are working with stakeholders from commercial businesses, gov- ernment agencies, industry, research organisations, software com- panies, technology providers and European universities.
    [Show full text]
  • The Translation Equivalents Database (Treq) As a Lexicographer’S Aid
    The Translation Equivalents Database (Treq) as a Lexicographer’s Aid Michal Škrabal, Martin Vavřín Institute of the Czech National Corpus, Charles University, Czech Republic E-mail: [email protected], [email protected] Abstract The aim of this paper is to introduce a tool that has recently been developed at the Institute of the Czech National Corpus, the Treq (Translation Equivalents) database, and to explore its possible uses, especially in the field of lexicography. Equivalent candidates offered by Treq can also be considered as potential equivalents in a bilingual dictionary (we will focus on the Latvian–Czech combination in this paper). Lexicographers instantly receive a list of candidates for target language counterparts and their frequencies (expressed both in absolute numbers and percentages) that suggest the probability that a given candidate is functionally equivalent. A significant advantage is the possibility to click on any one of these candidates and immediately verify their individual occurrences in a given context; and thus more easily distinguish the relevant translation candidates from the misleading ones. This utility, which is based on data stored in the InterCorp parallel corpus, is continually being upgraded and enriched with new functions (the recent integration of multi-word units, adding English as the primary language of the dictionaries, an improved interface, etc.), and the accuracy of the results is growing as the volume of data keeps increasing. Keywords: InterCorp; Treq; translation equivalents; alignment; Latvian–Czech dictionary 1. Introduction The aim of this paper is to introduce one of the tools that has been developed recently at the Institute of the Czech National Corpus (ICNC) and which could be especially helpful to lexicographers: namely, the Treq translation equivalents database1.
    [Show full text]
  • Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God
    Tomislava Bošnjak Botica, Gordana Hržica, Overabundance in Croatian dual-class verbs FLUMINENSIA, god. 28 (2016), br. 1 Tomislava Bošnjak Botica, Gordana Hržica OVERABUNDANCE IN CROATIAN DUAL-CLASS VERBS dr. sc. Tomislava Bošnjak Botica, Institut za hrvatski jezik i jezikoslovlje, [email protected], Zagreb dr. sc. Gordana Hržica, Edukacijsko-rehabilitacijski fakultet, [email protected], Zagreb izvorni znanstveni članak UDK 811.163.42’367.625 rukopis primljen: 5. 4. 2016.; prihvaćen za tisak: 21. 6. 2016. Croatian verbal inflection morphology is typically described using verb class distinctions. The number of classes differs among approaches, but the basic criterion for class division is the presence or absence and the type of suppletion in verb stems. Generally, one verb belongs to one inflectional class or paradigm only. However, some verbs belong to two classes, i.e. they have two parallel sets of stems. In such dual-class verbs, one infinitive form is realizable in two present forms in all cells within a class, i.e. there is an overabundance (Thorton 2011). Inevitably, one of the stem forming paradigms is a class with categorial suppletion. The present stem of a categorial suppletion class has a greater phonological distance from the infinitive stem than the present stem of the other class. Using a different terminology one class can be described as more transparent, while the other is less transparent (more opaque) in forming the present stem. This study attempts to present overabundance in dual-class verbs and to determine whether competition in such forms can be explained by their tendency to conform to one default class or by other factors, specifically, by the phonological distance between the two paradigms of dual-class verbs.
    [Show full text]
  • Predicting Declension Class from Form and Meaning
    Research Collection Conference Paper Predicting Declension Class from Form and Meaning Author(s): Williams, Adina; Pimentel, Tiago; Blix, Hagen; McCarthy, Arya D.; Chodroff, Eleanor; Cotterell, Ryan Publication Date: 2020-07 Permanent Link: https://doi.org/10.3929/ethz-b-000462306 Originally published in: http://doi.org/10.18653/v1/2020.acl-main.597 Rights / License: Creative Commons Attribution 4.0 International This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Predicting Declension Class from Form and Meaning Adina Williams@ Tiago PimentelD Arya D. McCarthyZ Hagen BlixË Eleanor ChodroffY Ryan CotterellD;Q @Facebook AI Research DUniversity of Cambridge ZJohns Hopkins University ËNew York University YUniversity of York QETH Zurich¨ [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract The noun lexica of many natural languages are divided into several declension classes with characteristic morphological properties. Class membership is far from deterministic, but the phonological form of a noun and its + meaning can often provide imperfect clues. Here, we investigate the strength of those clues. More specifically, we operationalize “strength” as measuring how much informa- Figure 1: The conditional entropies (H) and mutual in- tion, in bits, we can glean about declension formation quantities (MI) of form (W ), meaning (V ), class from knowing the form and meaning and declension class (C), given gender (G) in German of nouns. We know that form and mean- and Czech. ing are often also indicative of grammatical gender—which, as we quantitatively verify, can itself share information with declension not are few in number.
    [Show full text]
  • Better Web Corpora for Corpus Linguistics and NLP
    Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods.
    [Show full text]