Insight Into the Slovak and Czech Corpus Linguistics

INSIGHT INTO THE SLOVAK AND CZECH CORPUS LINGUISTICS Editor MÁRIA ŠIMKOVÁ Bratislava 2006 VEDA Publishing House of Slovak Academy of Sciences Ľudovít Štúr Institute of Linguistics of the SAS © Silvie Cinková, Radovan Garabík, Lucia Gianitsová-Ološtiaková, Markus Giger, Jan Hajič, Eva Hajičová, Veronika Kolářová-Řezníčková, Michal Křen, Markéta Lopatková, Karel Oliva, Jarmila Panevová, Vladimír Petkevič, Věra Schmiedtová, Miloslava Sokolová, Mária Šimková (ed.), Zdeňka Urešová Zborník Insight into Slovak and Czech Corpus Linguistics bol pôvodne koncipovaný ako súbor prednášok, ktoré odzneli od 4. 11. 2002 do 28. 6. 2004 na pôde oddelenia Slovenského národného korpusu v Jazykovednom ústave Ľ. Štúra Slovenskej akadémie vied v Bratislave (http://korpus. juls.savba.sk/activities/). Nie všetci prednášajúci však dodali príspevok na publikovanie, iní svoj príspevok poskytli v doplnenej a prepracovanej verzii. Viaceré prednášky technického charakteru a témy workshopov ani neboli plánované na vydanie v textovej podobe, niektoré časti z tých, ktoré sa v zborníku nachádzajú, sú už v čase vydania prekonané. Naopak, do zborníka pribudli štúdie o budovaní Slovenského národného korpusu a jeho lingvistickej (najmä morfologickej) anotácii. V tomto zložení predstavuje publikácia Insight into Slovak and Czech Corpus Linguistics vybraný súbor štúdií o stave korpusov a možnostiach i výsledkoch korpusovej lingvistiky v Slovenskej republike a v Českej republike v prvej polovici prvého desaťročia 21. storočia. Proceedings entitled Insight into Slovak and Czech Corpus Linguistics were to have been a collection of lectures that had been delivered since November 4th 2002 until June 28th 2004 at the department of the Slovak National Corpus of the Ľudovít Štúr Institute of Linguistics of the Slovak Academy of Sciences in Bratislava (http://korpus.juls.savba.sk/activities/). However, not every author had handed in his or her paper for publishing, others had sent them enlarged and rewritten. Several lectures being of more technical character as well as subject matter of workshops had not been even intended for publishing in print. Some parts of those, which are included into proceedings, are recently outdated. On the contrary, proceedings were enriched with papers on building of the Slovak National Corpus and on its linguistic (especially morphological) annotation. The content of the publication Insight into Slovak and Czech Corpus Linguistics represents a collection of selected papers reflecting the state and possibilities of corpora as well as the outcomes of Corpus Linguistics in the Slovak and Czech Republic in the first half of the 21st century. © Ľudovít Štúr Institute of Linguistics of the SAS, Slovak National Corpus, 2006 ISBN 80-224-0880-8 TABLE OF CONTENTS Věra Schmiedtová Corpora and Dictionaries 7 Michal Křen Frequency Dictionary of Czech: A Detailed Processing Description 16 Vladimír Petkevič Reliable Morphological Disambiguation of Czech: a Rule-Based Approach Is Necessary 26 Karel Oliva Discovering and Employing Ungrammaticality 45 Jan Hajič Complex Corpus Annotation: The Prague Dependency Treebank 54 Eva Hajičová Towards the Underlying Structure Annotation of a Large Corpus of Texts 74 Markéta Lopatková – Jarmila Panevová Recent Developments in the Theory of Valency in the Light of the Prague Dependency Treebank 83 Zdeňka Urešová Verbal Valency in the Prague Dependency Treebank from the Annotator’s Viewpoint 93 Silvie Cinková – Veronika Kolářová Nouns as Components of Support Verb Constructions in the Prague Dependency Treebank 113 Markus Giger On the Delimitation of Analytic Verbal Forms 140 Mária Šimková Slovak National Corpus – History and Current Situation 151 Radovan Garabík Processing XML Text with Python and ElementTree – a Practical Experience 160 Lucia Gianitsová Morphological Analysis of the Slovak National Corpus 166 Miloslava Sokolová Options for the Generation of a Corpus-Based Slovak Morphology (as Part of Corpus Morphosyntax) 179 Corpora and dictionaries VĚRA SCHMIEDTOVÁ This text was developed and has been used as a basic study material for courses on corpus linguistics taught at the Charles University in Prague over the last couple of years. In addition, it was presented at a workshop in 2003 at the Institute for the Slovak National Corpus in Bratislava. INTRODUCTION It is sometimes claimed that we are experiencing the age of dictionaries. In recent years, many new lexicographical organizations and institutes have organised specialized conferences where lexicologists from all over the world come together to evaluate and discuss theoretical as well as practical issues concerning their field of expertise. (There are such conferences as Euralex, Afrilex, Asialex, and DSNA – lexicological organizations of North America). Also, many books and papers whose focus is lexicology have been published. For example, a very important piece of work is the Encyclopaedia of Lexicography, which was edited by Franz Josef Hausmann and to which 349 collaborators contributed their work. New specialized lexicographical journals have started to appear (Dictionaries, Lexicographica, International Journal of Lexicography), etc. This boom has mainly affected the English-speaking countries, especially Great Britain. Naturally, other countries can learn from this turbulent development in documenting their languages, and can attempt to use the results already achieved. My contribution pays attention to monolingual dictionaries, and addresses the current language situation in the Czech Republic compared with the situation in Great Britain. After several years of improving conditions, modern lexicology is pausing for breath and all the data that have been collected and made ready by the Czech National Corpus are now ready to be processed and analysed. It is possible to say that, since the political changes in 1989, the Czech lexicology situation is still fragile; nevertheless, it is not as hopeless as previously was the case. THE TRADITIONAL WAY OF DEVELOPING A DICTIONARY Prior to the use of computers, it was necessary to collect a large amount of language material in order to compile a new dictionary. The basic prerequisite method was excerption. That is, recording individual words and collocations in their respective contexts of occurrence onto excerptions slips. For more detail, see below. HISTORY OF MODERN CZECH ACADEMIC DICTIONARIES We will not review all the dictionaries written in recent periods in the Czech language sphere, but instead focus on a few academic dictionaries, pertaining to Czech vocabulary from 1870 to 1978, whose origins goes back to the first years of the 20th century. 7 1 REFERENCE DICTIONARY OF THE CZECH LANGUAGE 19351957 PS 9 VOLUMES The lexicological archive where all the excerption slips are stored is located at the Institute for the Czech language in Prague. The archive has been operating since 1911 although preparations started in 1906. The main initiator of the entire project was Mr. František Pastrnek, who also became the president of the lexicological and dialectological committee of the Czech Academy of Science in 1911. In the same year, the Office of the Dictionary of the Czech language was established and, thereby, the first rules for scientific excerptions. The Office was situated in one room of the Hlávkův Palace in Jungmannova Street in Prague. The sole employee was an internal secretary by the name of František Trávníček. Soon afterwards, the Bureau expanded with new members from the ranks of high school teachers and university students. Interestingly, until the Bureau was changed into an academic institution, it was supported by the Ministry of Education by making high school teachers available for work on a new dictionary. The future dictionary was based on excerptions from Czech vocabulary from 1770. Initially, so-called “total excerption” was applied, i.e. every word in a given text was documented in all its contextual occurrences, collocations, and idioms. A word in its dictionary form was the header of an excerption slip. That means that declined or conjugated words were to be found in nominative singular or in infinitive in the header; other possible collocations, further sufficient context where the form of a given rdwo was underlined, and finally bibliographical information regarding the occurrence of the word were put in the commentary line. In 1917, a second improved edition of the excerption rules appeared. The total excerption style was enhanced by a partial excerption and by specialized excerption that was dedicated to capturing linguistic peculiarities. An example of an excerption slip: Authors of books selected for excerption were supposed to be “the good authors”. Thus, every year, between 10 and 15 books of were chosen. Subsequently, filters were set up. These were lists of words and their meanings that were already well documented. For these words, no additional excerption was required. In addition, for each period, a significant piece of work was singled out. Excerpts from such books formed the basis for a filter. For example, the novel Babička (Grandmother) by Božena Němcová was selected for the period 1854–1878. 8 The actual lexicographical work began in 1932. Originally, the idea was to publish a thesaurus of the Czech language, but in the end this plan was changed and it was decided to create the reference dictionary of the Czech language. The focus was on the current Czech language. For the description, excerpted data collected since 1870 were used. Older language material was used in

Insight Into the Slovak and Czech Corpus Linguistics

CLIB 2016 Proceedings

Collection of Usage Information for Language Resources from Academic Articles

Inflectional Suffix Priming in Czech Verbs and Nouns

Conference Abstracts

Morphological Doublets in Croatian: a Multi-Methodological Analysis

Berkeley Linguistics Society

Predicting Declension Class from Form and Meaning

Languages in the European Information Society Croatian

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

Overabundance in Croatian Dual-Class Verbs FLUMINENSIA, God

Predicting Declension Class from Form and Meaning

Better Web Corpora for Corpus Linguistics and NLP