<<

Russ Linguist (2012) 36:91–119 DOI 10.1007/s11185-011-9083-x

Morphosyntactic variation and syntactic constructions in Czech : corpus frequency and native-speaker judgments

Морфосинтакcическая вариативность и синтаксические конструкции в склонении чешских существительных: частотность в корпусе и oценки носителей языка

Neil Bermel · Ludekˇ Knittl

Published online: 3 January 2012 © Springer Science+Business Media B.V. 2012

Abstract Data from the Czech National Corpus and a large-scale survey of acceptability judgments are used to investigate the scope of morphosyntactic variation in two cases (genitive singular and locative singular) of a pattern. The syntactic construction in which a form is found is shown to have a significant interaction with its frequency in the corpus and with its acceptability rating. We conclude that the pattern of acceptability preferences lends support to the entrenchment hypothesis and in general to emergentist approaches to .

Аннотация В настоящей статье рассматриваются отношения между данными из Национального Корпуса чешского языка и широким опросом оценки языковой приемлемости. Целью работы является рассмотрение масштабов морфосинтакси- ческой вариативности в двух чешских падежах (в родительном и локативном падежах единственного числа). Согласно результатам нашего анализа, синтаксическая конструкция, в которой имеется данная форма, состоит в тесном взаимодействии с ее частотностью в корпусе и с оценкой ее приемлемости. Таким образом, общая модель оценок приемлемости подтверждает гипотезу об «усилении» употребляемости более частых форм и в целом сходится с так называемыми «эмергентными» подходами к языку, т.е. с такими подходами, согласно которым созидание языковых структур происходит в ходе освоения языка.

N. Bermel () · L. Knittl Department of Russian and Slavonic Studies, School of Modern and Linguistics, University of Sheffield, Sheffield, UK e-mail: n.bermel@sheffield.ac.uk 92 N. Bermel, L. Knittl

1 Introduction1

Czech nominal declension patterns present a formidable amount of variation. Not only do some exhibit variation between patterns, but within patterns there is often a choice of desinential morphs. The entire system is thus a fertile area for considering how we study variation, and how frequency information from large-scale annotated text databases (corpora) relates to the acceptability of forms for native speakers.2 The question is an important one, because we now possess a wealth of corpus data. In Czech, which was the first Slavic language to have a large tagged corpus, scholars have been mining corpora for information about the relative proportions of competing forms. In an earlier paper (Bermel and Knittl forthcoming), we used the results of an acceptability study to examine the overall relationship between corpus data and accept- ability judgments, arguing in the process that there is a significant relationship between them, but that only certain kinds of corpus data allow us to generalize about the ratings native speakers are likely to give one or another form. The current contribution attempts to answer several questions about the influence of syntactic and non-linguistic factors on acceptability judgments: 1. What do corpus data tell us about the scope of morphosyntactic variation in two Czech cases (genitive singular and locative singular), where this variation seems to be persis- tent and is not predictable from simple rules? 2. Does the syntactic construction in which a form appears play a role in influencing people’s evaluation of variant forms? 3. If so, what model best explains the distribution of forms in the syntactic environments studied?

2 Czech morphosyntax

Modern Czech inherits the full complement of Slavic nominal paradigms. A degree of opacity in the assignment of nouns to particular paradigms, at least from a synchronic perspective (due to phonological changes over the last thousand years) is characteristic for Czech, as is significant variation within paradigms, where we find that the case endings in use depend on the word chosen. The variation we are examining falls within the so-called masculine hard inanimate pattern. This paradigm arises, as elsewhere in Slavic, through the merger of the early Slavic u-stem and o-stem paradigms, which had distinct endings in a number of cases. The merger meant that successor classes had variant endings at their disposal in certain cases, as detailed in Brown (2007); in Czech the endings from the smaller, more peripheral

1Data collection and analysis for this article was funded under British Academy research grant SG-50275. The authors are grateful to Ewa Dąbrowska, Dagmar Divjak, Jean Russell and Marcin Szczerbiński for their assistance and advice at various stages during this project. 2Both terms, ‘acceptability’ and ‘grammaticality’, are used in the literature, as Schütze (1996) notes. ‘Gram- maticality’ implies that the judgment is grounded in syntactic wellformedness (Featherston 2005, 673–674; Bader and Häussler 2009, 4–6), and it tends to imply a binary state of affairs (grammatical/ungrammatical), although some scholars take pains to stress that there is a gradient from grammaticality to ungrammaticality (e.g. Kempen and Harbusch 2008). Because in this particular study we wish to avoid the presumption that speakers are necessarily judging a matter of grammar as opposed to one of usage, we follow studies such as McKoon and Macfarland (2000) in preferring ‘acceptability’. Morphosyntactic variation in Czech 93 u-stem class have in some instances spread and become productive markers of the successor classes (masculine hard inanimate and masculine hard animate). The shape of the paradigm can be seen in Table 1:

Table 1 The masculine hard inanimate paradigm in Czech (pattern hrad ‘castle’)

Case Endings Primary pattern Alternate pattern 1 Alternate pattern 2

Nom. sg. Øhotel-Ø Gen. sg. -u/-a hotel-u svět-a ‘world’ jazyk-u, jazyk-a ‘language’ Dat. sg. -u hotel-u Acc. sg. Ø/-a hotel-Ø šlofík-a ‘nap’ buřt-Ø, buřt-a ‘wiener’ Voc. sg. -e/-u hotel-e zámk-u ‘stately home’ Loc. sg. -u/-ě hotel-u ovčín-ě ‘sheepfold’ hrad-u, hrad-ě ‘castle’ Inst. sg. -em hotel-em

In the contemporary language, assignment of certain case endings is conditioned by phone- mic environment or membership in a lexical subclass. For example, use of the -u ending in the vocative is conditioned by a stem ending in a (hotel-e but zámk-u). In the accusative, the appearance of a so-called ‘facultative animate’ form has been linked to groups of semantically similar nouns (foods, dances, card games, drinks, cars), to ex- pressivity or to foreignness (see e.g. Šulc 2001). In the genitive and locative singular, however, we find a different sort of distribution. In both cases, the ending found with the vast majority of nouns is -u.Thereisamuch smaller group of nouns that exclusively use the old o-stem endings -a and -ě respectively. A third group of nouns show variation between the two endings available for the paradigm, cf. (1): (1) Genitive and locative singular endings in the hrad paradigm in the SYN2005 corpus a) Genitive singular in -u only: hrad-u b) Genitive singular in -a only: svět-a c) Genitive singular in both -a or -u: jazyk-a/jazyk-u d) Locative singular in -u only: hotel-u e) Locative singular in -ě only: ovčín-ě f) Locative singular in both -ě or -u: hrad-ě/hrad-u3 Descriptions of Czech grammar rarely dwell on this sort of variation. It is mentioned in prescriptive manuals, but most of these are, at best, based on previous handbooks and

3This problem is arguably an artifact of convention, i.e. of the fact that we accept the traditional assignment of all these nouns to a single paradigm, and perhaps the problem would disappear if this conventional unit were abolished in favor of smaller units within which the availability of endings was clearly marked. Bermel (2010, 138–139) used corpus data to explore one alternative: designating a series of subparadigms, based on which ending or endings are found with nouns in the genitive, accusative and locative singular. Such an approach yields a confusing 15 different subparadigms. For any subparadigms where multiple endings are found, the proportions of one ending to another can run from 1:99 all the way to 99:1, casting doubt on the uniform character of the subparadigm. The alternative explanation thus does not yield any further clarity of description. 94 N. Bermel, L. Knittl dictionaries.4 These in turn are based either on intuition or on material from excerpt files, which contain a mixture of citations from newspapers and literary sources over the past century. Three articles summarize the explanations found in such manuals: Cummins (1995), Rusínová (1992) and Sedláček (1982). As for attempts to ground these findings in competence or performance, three further pre-corpus studies stand out. Klimeš (1953)used excerption from books, while Kasal (1992) took soundings into native speakers’ usage and Bermel (1993) asked native speakers to evaluate variant forms. Later studies (Bermel 2004; Štícha 2009) have looked at the extent of variation in the Czech National Corpus.5 Grammars of Czech frequently suggest that there are syntactic and semantic factors conditioning the choice of forms in the genitive and locative singular. The data for this are not cited; assertions tend to be handed down from one manual to the next. The 1986 Academy Grammar Mluvnice češtiny (Petr 1986, 305) gives the following descriptions: In the gen. sg. the declension formant is -u or -a, in places the doublet -u/-a. In the contemporary standard language the majority of inanimate masculine nouns have the formant -u. Disproportionately fewer inanimate masculine nouns have the ending -a (characteristic of the hard subtype of animate nouns) or the doublet -u/-a. The distri- bution of forms depends on various factors: word-formational, syntactic, semantic, sometimes on the phonemic composition of the word as well. In morphophono- logical terms the formants -u and -a are restricted to the environment following a non-soft consonant (after soft consonants, as a rule, only -e appears). In a further case—the loc. sg.—the polymorphous formants are -u and -ě(e), but their distribution is not firmly delineated—as distinct from the distribution of gen. sg. formants. For the same and in the same conditions both forms are used; morphological variation is found here to a significant degree. The syntactic angle can provide a certain motivation for the more frequent use of one of the variants, i.e. whether it is an object [of the –NB/LK] or an adverbial function; sometimes the provenance of the word has an effect (e.g. borrowed words as a rule take the ending -u). These factors aside, differing usage in daily speech in Bohemia (-u)and Moravia (-ě(e)) also has an effect, and this appears in standard discourse in various ways. (Translation NB/LK) The 1995 grammar Příruční mluvnice češtiny (Karlík, Nekula and Rusínová 1995, 252f.) describes the situation in similar terms, although it asserts the influence of syntactic and dialectal factors more strongly than its predecessor did:

4A happy exception here is Cvrček et al. (2010), which cites corpus data. It does not address questions of how these forms are used, except to say that the locative forms in -ě occur in ‘frequent set phrases’. 5All of these studies have methodological aspects that require consideration. Klimeš (1953) is based on a survey of 40 works of literature written after 1920, and it provides a first glimpse into literary usage in the interwar and immediate postwar period. The study found examples of 400 nouns from this paradigm and primarily records where variation was found and where it was not. In some instances of variation it suggests possible rationales for the choice of form, but it is unclear what sort of data this is based on; only single examples are cited. Kasal (1992) conducted a gap-filling study on philology students and teachers. His data are valuable, but the particular set of respondents and the structure of the study limit its generalizability. Bermel (1993) is a small-scale study of a limited group of nouns, focusing on their immediate syntactic context; the questionnaire structure could have been more varied, with more attention paid to randomization. Of the corpus studies, Bermel (2004) again covers only a small subset of nouns, while Štícha (2009) is far more comprehensive, but combines data from two representative corpora and one much larger non-representative corpus. Morphosyntactic variation in Czech 95

The endings -u/-a alternate in the gen. sg. of an indeterminate number of mascu- line nouns of the hrad type, primarily according to the syntactic function that the word fulfils in the sentence. Nouns with the adverbial meaning of place (or, less frequently, of time or manner) are often connected with the ending -a. As a genitive object (or in other meanings) the same nouns tend to have the ending -u: do kalicha/u ‘into the chalice’ [place]—dotkl se kalichu ‘he touched the chalice’ [object], do roka ‘within a year’ [time]—dožil se jednoho roku ‘he lived a year’ [object], do kouta ‘into the corner’ [place]—nebylo jediného koutu ‘there was not even a corner’ [nega- tion], do rybníka ‘into the pond’ [place]—nebylo jednoho rybníku ‘there was not a single pond’ [negation]. In the loc. sg. there exists competition between the endings -u/-ě. The ending -ě is the rule with nouns in -ov, -ín (vKyjově‘in Kyjov’, v Kojetíně ‘in Kojetín’) and with words ending similarly: na ostrově, na venkově, vkomíně, Libušíně, Londýně ‘on the island, in the country, in the chimney, in Libušín, in London’. The ending -ě requires alternations for the consonants g, h, ch, k, r, d, t, n; therefore with nouns ending in them the ending -u is more frequent: v Duisburgu, v Zábřehu (v Zábřeze is found only locally), v kožichu, vprachu, na trhu, v tichu, v jazyku//vjazyce, na potoku//na potoce, v Hamburku, na Mělníku//na Mělníce, ve Žd’áru//ve Žd’áře, v papíru//na papíře, v Berouně ‘in Duisburg, in Zábřeh, in a fur coat, in the dust, at the market, in silence, in the language, on the stream, in Hamburg, in Mělník, in Žd’ár, in paper/on paper, in Beroun’, etc. The prevailing ending is -u. Variability in usage is influenced by meaning, just as in the gen. sg.: in adverbial meanings of place the ending -ě often appears; in objective meaning the ending -u appears: na rybníce – o rybníku ‘on the pond – about the pond’, vMnichově – oMnichovu‘in Munich – about Munich (more frequently in reference to the historical event)’, v Betlémě – o betlému ‘in Bethlehem – about the nativity crèche’, v chrámě – o chrámu ‘in the cathedral – about the cathedral’, na hradě – o hradu ‘in the castle – about the castle’, vregále– oregálu‘on the shelf – about the shelf’, etc. Through endings different meanings can be differentiated, e.g. the meaning of place from the meaning of time, concrete nouns from abstract ones: vprůjezdě (domu) × při průjezdu kolony městem ‘in the passage (through the building) × during the convoy’s passage through the city’, na západě × po západu slunce ‘in the west × after sunset’, na východě × po východu slunce ‘in the east × after sunrise’, ve výkladě × při výkladu ‘in the explanation × in the shop window’ etc. The variants with -ě are more frequent in Moravia than in Bohemia. (Emphasis original; translation and clarifying notes in square brackets—NB/LK)

Two studies look at the details of this variation. Bermel (1993, 2004)examinedvaria- tion in the with respect to syntactic context. The findings brought to light evidence from both corpus data and native-speaker acceptability judgments that, even in the absence of polysemy as above (e.g. západ ‘west/sunset’), syntactic contexts that are more ‘canonical’ for the locative case (prepositions of location, simple PPs consisting of preposition + noun) show more frequent use of and higher acceptability for the -ě ending, while ‘non-canonical’ contexts (prepositions showing subject matter or duration, ‘empty’ locative prepositions dictated by the verb, and PPs consisting of preposition + one or more + noun) show more frequent use of and higher acceptability for the -u ending. 96 N. Bermel, L. Knittl

3 Corpus data and the Czech National Corpus

Over the past fifteen years a series of corpora have been authored by the Czech National Corpus Institute (www.korpus.cz) and affiliated bodies. The first was released in 2000, making Czech the first Slavonic language to have a large-scale corpus publicly available. The Czech National Corpus (CNC) is the common name given to this series of annotated, tagged databases of various sizes and shapes. The core consists of three mixed-genre corpora of contemporary published Czech texts, named SYN2000, SYN2005 and SYN2010 for their release dates. Each has 100 million tokens (word forms or symbols). There are also smaller corpora of spoken Czech (PMK, BMK, ORAL2006, ORAL2008, with 300k–1m tokens), a diachronic corpus (DIAKORP), large-scale corpora of journalistic texts (SYN2006PUB, SYN2009PUB, with 300m–700m tokens), a parallel-language corpus (INTERCORP) and a number of single-author corpora. In creating a searchable text database, corpus authors typically try to ensure that the data they choose for inclusion give as full and accurate coverage as possible—whether of a single text, a single author, a single period, a genre or the production of an entire language. The simplest way to ensure accuracy is to provide 100% coverage of the text type in question, i.e. to include all texts fitting the description. This approach is feasible when the set of extant texts is closed and relatively small—for example, the works of a single author, or a period in the (usually distant) past. It is impractical for any broader overview of language use, and so the question arises of how in those cases to select texts for inclusion. Certain corpora are built to be balanced or representative. A balanced corpus includes texts chosen according to a certain rationale, usually so as to show a distribution across a range of types, periods, genres, authors, or speaker characteristics. A representative corpus similarly offers a range of texts whose distribution and selection has been guided by information about language use, making the corpus a microcosm of that segment of language production. The ORAL2008 corpus within the CNC is a good example of a balanced (vyvážený) corpus. It contains excerpts from the speech of a variety of native speakers of Czech, taking in different regions, ages, educational attainment and genders. It has been balanced to ensure that there is sufficient data from each ‘cell’, but it does not claim that the proportions in which these texts appear bear any attested relation to the proportions of these speakers in the speech community. The SYN2000, SYN2005 and SYN2010 corpora within the CNC are examples of representative corpora. Their creation was preceded by soundings into the reading habits of the Czech public in the 1990s and early in the new millennium (on the earlier soundings see e.g. Čermák et al. 1997, 121f.; on the later ones see Králík and Šulc 2005). The types of texts included and the weightings of the various genres are said to reflect the findings of this research. In this way, the corpora are said to offer a snapshot of the world of published texts surrounding the average Czech citizen. Data from these representative corpora clearly tell us something about the sort of language produced across the domains of published Czech texts. It is less immediately clear how much they can contribute to our knowledge of how Czechs assess and evaluate the linguistic options open to them (this is the local instantiation of the performance vs. competence debate). We decided to investigate this, using data from the SYN2005 corpus as our starting point. The choice of SYN2005 was dictated by a number of factors. Its composition is signifi- cantly different from that of the earlier SYN2000 corpus, containing a more even spread of Morphosyntactic variation in Czech 97 text types: 40% fiction and other literature (i.e. biography, memoirs and popular history), 33% journalistic texts, and 27% technical and specialist writing.6 It represents a more compact time span, with fewer texts from before 1990 that might not reflect contempo- rary usage, and as compared to SYN2000, its journalistic texts are more evenly balanced by source and date (see www.korpus.cz). The available corpora of spoken Czech (PMK, BMK, ORAL2006, ORAL2008) do not make claims to representativity, and proved to be too small (<1 m tokens) to give the kind of fine-grained data on variation in individual forms that we needed. We therefore confined our study to the use of SYN2005.7 The three SYN corpora have a further advantage for linguistic research: they are tagged and lemmatized. Through the use of automatic tagging and lemmatizing software, each word form in the corpus is aligned to a head word or lemma, equivalent to its “dictionary entry” (e.g. the infinitive for , the nominative singular for nouns), and is assigned a label or tag describing its grammatical characteristics. The form knihou, for example, has the associated lemma and tag kniha/NNFS7------,meaning:‘formoftheword kniha/noun, non-proper, feminine, singular, ’ (the remaining tag spaces are empty because they are not relevant to this lexeme). Because both tags are attached to the word form, the form is retrieved by searches for the exact form (knihou), the lemma (kniha), or the tag (NNFS7)—or any part thereof. Researchers can search and categorize by these attributes, e.g. all examples of a word in whatever form, or examples of a particular case form across a large section of lexemes.

4 Using the SYN2005 corpus

In the current study, we investigated the variation between two endings in each of two cases: the genitive singular and locative singular of masculine hard inanimate nouns. Our work began with identifying genitive and locative singular forms of inanimate masculine nouns in the SYN2005 corpus that belong to this paradigm. We searched for all tokens with a particular ending whose lemma was a common noun (i.e. the lemma did not begin with a capital letter) and whose tag indicated that it was of the requisite gender, number and case. We then sifted the results manually to remove errors.8 We ‘cleaned up’ the results in the following fashion: For the locative, we examined all examples of all forms. For the genitive, we examined all examples of lexemes found exclusively with -a and of those where -u is in variation with -a. For the vast number of lexemes—over 13,000—that appeared in the genitive with the -u ending only, we examined all lexemes that presented potential anomalies due to their meaning or to what appeared to be errors in the lemmatized forms.9 We also looked at lexemes where overlap with a similar noun of a different declension pattern may cause tagging and lemmatizing errors, e.g. van (m.) ‘van, gust’ vs. vana (f.) ‘bathtub’. We

6This difference signals that data on reading habits can evidently be interpreted in widely varying ways. 7SYN2010 has the same structure but was not available at the time we conducted our research. 8Sometimes very-low-frequency animate nouns that are not in the automatic tagger’s word list are mistakenly assigned to the inanimate category; this inflates the number of nouns in this group found with the genitive ending -a. There are other sporadic tagging errors, such as incorrect case assignment; these are uncommon in the locative, but somewhat more frequent in the genitive. 9Such errors are important because they signal that the word is not in the dictionary of lemmas or that a deficiency in the program prevented the token’s correct lemmatization. The possibility of erroneous assignment and tagging rises dramatically in this group. 98 N. Bermel, L. Knittl eliminated lexemes erroneously assigned to this group, such as proper nouns (e.g. Vimperk, a town in south Bohemia, frequently gets assigned to the lemma vimperk ‘decorative shield’) and removed animate nouns mistagged as inanimates. In addition, we arranged nouns under one single lemma that had been mistakenly grouped under two or more lemmas, attached misspelled forms to their correctly spelled head words, and reconciled examples with differing hyphenation practice (sex-shop vs. sexshop). Roughly a third of the lexemes were examined in this way. Certain judgment calls were nonetheless needed, and so we followed these principles: (1) To decide whether a foreign word should be retained in our counts, we looked at whether it appeared in multiple texts; whether it was used with or without explanatory devices, such as quotation marks or explanations of its meaning, and where and how often it appeared. We gave more weight to appearance in original Czech works, as opposed to translations, and less weight to works that are deliberately stylized to have eccentric language, such as Pan Kaplan má svou třídu stále rád, a well-known translation of Leo Rosten’s The Education of H*Y*M*A*N K*A*P*L*A*N. (2) We left dialectal forms separate from their standard equivalents, unless the difference was merely one of vowel length. (3) Because the corpus covers a period of change in the official spelling rules and the spelling of some recent loan words is not stable, some nouns have two or more distinct spellings; the corpus sometimes aligns these under a single lemma, and other times fails to.10 We reconciled all such examples where possible and brought together alternate spellings that represent the same pronunciation for the lemma form.11 (4) We treated certain names, such as car brands, as common nouns, but did not extend this down to the level of individual car models. Much depended on whether the examples were regularly found in lower case, or whether they only appeared in stylized texts. One frequent problem was that masculine inanimate nouns of very low frequency may be mislemmatized as neuters or feminines. These were picked up to the extent possible, when some other clue—such as a lemma with an improperly inserted fleeting —led us to suspect the word was not in the lemmatizer’s dictionary and thus we needed to look up the target form manually. However, we did not actively pursue this group in other ways, such as trawling all neuter or feminine lemmas to find examples of mistagged masculines. Targeted searches combining features of the token, lemma and tag thus provide a largely accurate ‘ballpark’ view of the situation. Our manual error correction altered the final numbers, but left the overall proportional picture of the situation unchanged.

5 The scope of variation in the SYN2005 corpus

The -u ending was found with the vast majority of nouns in both cases (upwards of 7k each). Four sets of nouns are found in SYN2005 with the less-common -a and -ě endings: Genitive -a only: andělík, antisvět, apríl, (k)belík, betl, bolševík, březen, buick, búr, červen, chléb, chlév, chlív, cybersvět, dřívějšek, duben, dvanáctiúhelník, exot, ferbl, frt’an, hřbitov, hřebčín, interhelp, jinosvět, joint, kasárník, korovník, kozáček, kravín, krchov,

10The record was four such permissible spellings for the chemical compound trichlorethylene: trichloretylen, trichlorethylen, trichloretylén, trichlóretylén. The vowels in question are said to represent a pronunciation intermediate between long and short, regardless of how they are marked, and the th does not differ in pronunciation from t. 11For example, trekking and treking were counted as a single entry, as were trauler and trawler, trénink and tréning, but batoh and bat’oh ‘backpack’ were kept separate. Morphosyntactic variation in Czech 99 křivisk, kulich, květen, leden, les, lesíček, makrojazyk, makrosvět, malinovník, malosvět, matjes, med’our, metajazyk, mikrosvět, minisvět, neživot, nynějšek, ovčín, oves, pérák, pod- večer, poloostrov, posledek, prales, prasvět, předvčerejšek, pseudoživot, rozpadlík, říjen, sejr, srpen, svatvečer, svět, technojazyk, tripl, tupolev, turek, tuzér, valiant, venkov, vizour, všehomír, všesvět, všeživot, vystrkov, zapadákov, zejtřek, žigul, život (N = 79) Genitive variation -a ≈ -u:seeTable4 below (N = 112) Locative -ě only: balíkov, bezčas, blbákov, blivajz, dobromysl, dvojzápas, dvojzávod, ekodům, fajrunt, forhont, fous, holobyt, hřebčín, hunt, infokanál, infosvět, jinosvět, ko- courkov, masozávod, mezisvět, mikrosvět, minibyt, minisvět, minizápas, monstrkoncert, or- dinát, ovčín, pasát, pasport, pér, polosvět, prales, prdelákov, předzápas, pronárod, protikus, pseudosvět, ribstol, rodopis, sólozávod, talón, trojzápas, úbyt, velehrad, vepřín, videosvět, vikariát, vodopis, všesvět, vystrkov, zapadákov, zastrkov (N = 53) Locative variation -ě ≈ -u:seeTable5 below (N = 392) The descriptive statistics for these can be seen in Table 2.

Table 2 Lexemes in the genitive and locative singular

Lexeme has Tokens Lexemes Mean frequency Median frequency

Gen. -u only 1,222,637 11,701 104.50 4 Gen. -a only 116,027 79 1468.70 10 Gen. -a ≈ -u 153,245 112 1368.26 120.5 Loc. -u only 503,750 6803 74.05 3 Loc. -ě only 494 53 9.32 1 Loc. -ě ≈ -u 316,462 392 807.30 110

From the descriptive statistics the following facts stand out: • The -u ending is employed in both cases with the vast majority of nouns. • Only a few nouns of low frequency exclusively take -ě in the locative case. • The number of nouns that exclusively take -a in the is small, but they have a very high mean frequency and a higher median than nouns in -u, indicating that many of them form part of the language’s core vocabulary.12 • The number of lexemes exhibiting variation between the two endings in either case is not high, but they have a very high mean frequency and median, indicating that they are among the most frequently used nouns in the language. • There is considerably greater variation in the locative singular than in the genitive singular. The distribution of forms in lexemes with variation is shown in Table 3. Genitive forms in -a thus constitute less than 12% of all genitive forms, whereas locative forms in -ě constitute almost a third of all forms, despite occurring in only a limited portion of the lexical stock.13

12We report both mean and median because mean is the most commonly known measure, but it can be distorted by a few ‘outliers’ with frequencies in the tens of thousands; the median is less susceptible to this, as it simply reports the point at which half the lemmas retrieved had higher frequencies and half had lower frequencies. 13As mentioned in Sect. 4 above, the figures are as close as we could get using a combination of highly targeted global searches and manual correction, but should not be regarded as definitive; we acknowledge that there will be mistagged and/or mislemmatized items that were not uncovered during our manual checks. 100 N. Bermel, L. Knittl

Table 3 Genitive and locative singular endings in the masculine hard inanimate paradigm

Genitive -a Tokens % Locative -ě Tokens %

-a only 116,027 7.78 -ě only 494 0.06 -a in variation 61,709 4.14 -ě in variation 254,704 31.03 Total -a 177,736 11.91 Total -ě 255,198 31.09

Genitive -u Locative -u -u only 1,222,637 81.95 -u only 503,750 61.38 -u in variation 91,536 6.14 -u in variation 61,758 7.52 Total -u 1,314,173 88.09 Total -u 565,508 68.91

All genitive forms 1,491,909 100 All locative forms 820,706 100

6 Syntactic constructions in the SYN2005 corpus

The genitive and locative cases in Czech serve a variety of functions, and, as seen earlier in Sect. 2, the handbooks assert that the syntactic construction can influence the choice of endings. For each case, we focused on four common syntactic environments in which each form can appear; these are discussed below:14 Adnominal genitive. In one common construction, the genitive indicates or characteristics inherent to an object (N = 2,138,562):15 (2) Hladina rybníka/rybníkugen.sg se za svitu měsíce leskla jako kovové zrcadlo. ‘The surface of the pond shone in the moonlight like a metal mirror.’ Genitive of motion. In a second, common environment, the genitive indicates motion to or away from a destination (following the prepositions do ‘to’, od ‘away from’, z ‘out/off’, N = 732,084):16 (3) Otevřela dveře a vstoupila doprep pokojíka/pokojíkugen.sg. ‘She opened the door and walked into the little room.’ Prepositional genitive. There are a broad variety of non-motion prepositions that require a genitive object, but for the sake of comparability we focused on several common ones that express physical or metaphorical location (uprostřed ‘amidst’, vedle ‘next to’, kromě ‘besides’, u ‘next to, at’) and various aspects of the path (kolem ‘past’, podél ‘along’). If

14For ease of reference, the forms in the examples are given here consistently in the order -a/-u and -ě/-u, but in half the questionnaires the order was, of course, reversed. The examples cited are those used in the surveys, which (as discussed elsewhere) are sometimes slightly shortened, simplified or clarified to avoid unwanted side effects. 15The figures here and elsewhere in Sect. 6 are purely for orientational purposes; they represent the result of corpus searches that have not been corrected by manually trawling the outputs. The query was for all adnominal masc. inanimate nouns in the genitive singular: ([tag = “NN.*”] [tag = “NNIS2.*”]). 16The query was for a phrase consisting of these prepositions followed immediately by a masc. inanimate noun in the genitive singular: ([(lc = “od”) | (lc = “z”) | (lc = “do”)] [tag = “NNIS2.*”]). Morphosyntactic variation in Czech 101 we count the highly frequent preposition bez ‘without’ among this group, we get a total of N = 51,433).17 (4) Kolemprep lesíka/lesíkugen.sg šla nějaká babka a trhala kopřivy pro husy. ‘An old lady strolled past the copse, picking nettles for her geese.’ Genitive object. A limited number of reasonably frequent verbs in modern Czech re- quire or accept a genitive object, among them in order of declining frequency: dosaho- vat/dosáhnout ‘to reach’, využívat/využít ‘to make use of’, bát se ‘to be afraid’, všímat si/všimnout si ‘to notice’, účastnit se/zúčastnit se ‘to participate’, týkat se ‘to concern’, zbavovat se/zbavit se ‘to get rid of’, ujímat se/ujmout se ‘to take hold of’, and certain other verbs with the prefix do-:18 (5) Nemohl bych být vegetarián, sýra/sýrugen.sg se nenajímv. ‘I couldn’t be a vegetarian; I can’t fill up on cheese.’ The locative case is used in a more limited number of environments and must occur as the object of one of five prepositions (v ‘in’, na ‘on’, o ‘about’, při ‘during/in the presence of’, po ‘across/after’). We selected the following contexts: Canonical locational phrase. The locative case appears indicating location as the object of the prepositions v ‘in’ or na ‘on’ (N = 604,123):19 (6) Neměla ani zdání, co je naprep papíře/papíruloc.sg napsáno. ‘She had no idea what was written on the paper.’ Locational phrase with . The locative case appears indicating location, with an adjective interposed between the preposition and the noun (N = 154,157):20 (7) Podle legendy je Grál uložen naprep neznámémadj hradě/hraduloc.sg, kde je střežen rytíři. ‘According to legend, the Grail is preserved in an unknown castle, where it is guarded by knights.’ Non-canonical locative phrase. The locative case appears with the prepositions o ‘about’ při ‘during/in the presence of’, po ‘across/after’ (N = 121,095):21 (8) Připrep mraze/mrazuloc.sg pod čtyřicet stupňů se nebezpečí omrzlin několikanásobně zvyšuje. ‘During frosts of forty below or worse the danger of frostbite rises exponentially.’

17The query was for a phrase consisting of these prepositions followed immediately by a masc. inani- mate noun in the genitive singular: ([(lc = “uprostřed”) | (lc = “vedle”) | (lc = “kolem”) | (lc = “podél”) | (lc = “kromě”)| (lc = “bez”) | (lc = “u”)] [tag = “NNIS2.*”]). 18Here, there is no single query that can reliably be made and so the figures are not easily compa- rable to those for other constructions. This list is drawn from the lexemes used in the survey and frequency is given as the total number of forms linked to the lemma (e.g. retrieved via queries like [lemma = “bát”]); those figures have not been verified by manual checking of examples. A different cor- pus statistic, to put some perspective on this, is the number of occurrences of masc. inanimate nouns in the genitive case that are within two words of one of the above verbs: N = 8,923, based on the query [tag = “NNIS2.*”] with a positive filter applied in the interval {−3, 3} from the key word in con- text (KWIC): [(lemma = “ujímat”) | (lemma = “ujmout”) | (lemma = “zbavovat”) | (lemma = “zbavit”) | (lemma = “z?účastnit”) | (lemma = “týkat”) | (word = “všímat”) | (word = “všimnout”) | (word = “bát”) | (word = “využív?a?t”) | (word = “dosahovat”) | (word = “dosáhnout”)]. 19Query string ([(lc = “ve?”) | (lc = “na”)] [tag = “NNIS6.*”]). 20Query string ([(lc = “ve?”) | (lc = “na”)] [tag = “A..S6.*”] [tag = “NNIS6.*”]). 21Query string ([(lc = “po”) | (lc = “při”) | (lc = “o”)] [tag = “A..S6.*”]? [tag = “NNIS6.*”]). 102 N. Bermel, L. Knittl

Locative object of the verb. In certain circumstances, prepositions requiring the locative are essentially devoid of their customary meaning; their use is dictated by the particular verb chosen, e.g. záležet na + loc ‘to depend on’, ptát se/zeptat se po + loc ‘to ask after’, pátrat po + loc ‘to investigate’ stýskat se po + loc ‘to pine for’:22 (9) Nové sportoviště pojmenovaliv poprep areále/areáluloc.sg, který stával na stejném místě v době před druhou světovou válkou. ‘The new sports hall was named after the complex that stood on the same spot before the second world war.’ The effect of the use of these contexts will be discussed in Sects. 12 and 13.

7 Investigating acceptability

Our acceptability experiment focused on nouns displaying variation in the corpus between endings in either case. As it turned out, the distribution of endings in the corpus proved unpredictable. For some of the lexemes, roughly equal proportions of each form appeared in the corpus, while for others, one ending or another appeared somewhat more frequently or much more frequently. Because we were interested in how the relative frequencies of competing forms in- teracted, we found it convenient to assign the words to one of seven frequency bands, so that we would be able to choose a variety of lexemes for our survey that show dif- fering behaviours in the corpus. However, neither the exact division of these bands nor the names assigned to them for convenience’s sake should aprioribe taken as evidence that these divisions are meaningful.23 The proportions are shown in Tables 4 and 5 be- low. We were curious whether this range of relative proportions in the corpus would tell us anything about the acceptability of these forms to native speakers, so we chose a selection of lexemes showing greater and lesser proportions of forms with one or the other ending.24

22The result here again is not exactly comparable to the first three categories, but the number of occurrences of masculine inanimate locative singular forms within two places of a particular list of verbs gives N = 9,657, with the query being [tag = “NNIS6.*”] and a positive filter ap- plied in the range {−3, 3} from KWIC: [(lemma = “podělit”) | (lemma = “lpět”) | (lemma = “záviset”) | (lemma = “záležet”) | (lemma = “dohodnout”) | (lemma = “dohodovat”) | (lemma = “pracovat”) | (lemma = “zakládat”) | (lemma = “z?účastnit”) | (lemma = “sahat”) | (lemma = “pátrat”) | (lemma = “ptát”) | (lemma = “toužit”) | (lemma = “stýskat”) | (lemma = “pojmenová?v?at”) | (lemma = “prahnout”) | (lemma = “dychtit”) | (lemma = “zklamat”) | (lemma = “bránit”) | (lemma = “vyznat”) | (lemma = “libovat”) | (lemma = “z?mýlit”) | (lemma = “shodovat”) | (lemma = “shodnout”)]. 23The scale we used is—by coincidence—remarkably similar to that used in the new popular-scholarly grammar by Cvrček et al. (2010) to categorize relative proportions of forms in the corpus. Their scale posits seven bands: nikdy/skoro nikdy ‘never/almost never’ (0–1%), zřídka ‘rarely’(1–10%), někdy ‘some- times’ (10–35%), stejně ‘the same’ (35–65%), často ‘frequently’ (65–90%), zpravidla ‘as a rule’ (90–99%), vždycky/skoro vždycky ‘always/almost always’ (over 99%). It differs only in the range of the middle band: 35–65% instead of 30–70%. 24In calculating relative proportions, we did not include examples of case form syncretism. For example, the form hradě appears only in the locative singular and is compared only to examples of the competing form hradu when those occur in a recognized locative context; we did not include in our calculations examples of hradu where it represents the genitive singular or the dative singular. We thus fail to capture one generalization about the ending -u—that it is expansive and its use possibly supported by its appearance in contexts external to the case under consideration—but we were not willing to let this speculative factor determine the set-up of the experiment. Morphosyntactic variation in Czech 103

Table 4 Proportions of forms with genitive singular variation -a ≈ -u

Relative proportion of -a Lexemes (highest > lowest proportion of -a in each group)

1 Under 1% (‘isolated’) perník, šuplík, kotník, zákoník, podzim, slovník, chodník, pátek, objekt, vesmír, krk 2 1% up to 10% (‘marked’) meloun, taxík, pokojík, chlup, obratník, větřík, prostředek, sedan, honzík, protějšek, kurník, živel, bochník, kožich, sklípek, kotlík, rok 3 10% up to 30% (‘minority’) bavorák, kousek, dvorek, chlívek, kombík, filament, rovnoběžník, sen, destructor, bedekr, opel, blackjack, osmiúhelník, trojúhelník, renault, cadillac, hřib, budíček, záchod, obdélník, javor 4 30% up to 70% (‘equipollent’) kalich, ječmen, týl, františek, anton, kopeček, komunikátor, buřt, gaučík, pronárod, velín, budík, regulátor, šestiúhelník, fiátek, vuřt, wartburg, podzimek 5 70% up to 90% (‘majority’) trabant, rybník, kout, kostelík, panák, lesík, komín, šajn, prajazyk, betlém, čůrák, předvečer, kostelíček 6 90% up to 99% (‘unmarked’) dnešek, mlejn, potok, zítřek, vepřín, jazyk, flok, mlýn, polosvět, národ, vajgl, včelín, čtvrtek, popel, klín, galavečer, walkman, sýr 7 99% and over (‘dominant’) ostrov, domov, včerejšek, večer, tábor, sklep, pondělek, dvůr, zákon, oběd, ocet, klášter, kostel, letošek

For a more detailed discussion of the usefulness of acceptability judgments and how they are evaluated, see Bermel and Knittl (forthcoming).

8 Survey structure: frequency of lexemes

Our survey of native speakers tested the acceptability of variant genitive singular and locative singular forms found in the corpus. The structure of the survey was as follows: • For each case, two lexemes were drawn from each of the seven bands described above, for a total of 28 lexemes under test. • Two base questionnaires were created, each using one lexeme per frequency band and per case (14 lexemes per questionnaire) (see Tables 6a, 6b, 6c and 6d).25 The choice of lexemes needs more detailed comment. We operated with principles that sometimes clashed with each other or with the reality of a limited corpus data set, meaning that compromises were sometimes required:

25We thus had two surveys of c. 145 respondents each. This structure allowed us to reliably compare answers in each series to each other, knowing that the population has been held constant. 104 N. Bermel, L. Knittl

Table 5 Proportions of forms with locative singular variation -ě ≈ -u

Relative proportion of -ě Lexemes (highest > lowest proportion of -ě in each group)

1 Under 1% (‘isolated’) provoz, labyrint, kurs, výbor, útes, transport, překlad, obřad, beton, popis, předpoklad, horizont, byznys, festival, zákaz, úvod, náraz, šampionát, přechod, proces, způsob, pořad, areál 2 1% up to 10% (‘marked’) advent, kožich, dluhopis, traktát, ohlas, odraz, chrám, parlament, sud, nesouhlas, salát, šál, konvent, diktát, důkaz, půdorys, šarlat, žlab, javor, pažit, vous, senát, protokol, ochoz, úpis, pás, pavilón, terminál, špenát, obrys, lazaret, karneval, slet, úžas, výkaz, průkaz, pergamen, divan, obal, výraz, dotaz, volant, inzerát, průvod, hrot, výklad, automat, pokus, předmět, odkaz, showbyznys, zákop, sloup, příkaz, post, důraz, pád, stadión, manifest, přepis, přístav, bufet, návod, doklad, bazén, servis, stadion, parapet, důchod, formát, klavír, kongres, ústav, převrat, kód, háv, asfalt 3 10% up to 30% (‘minority’) souhlas, sekretariát, opis, kompas, kancionál, knot, sovchoz, plakát, orient, klozet, kombinát, sklad, výkres, záhon, kurt, antikvariát, basketbal, penzionát, chleb, místopis, akát, apoštolát, loket, městys, národopisu, pankejt, úsvit, křišt’ál, vinohrad, salón, portál, příklad, komunál, kvadrát, vokál, aparát, řetěz, džbán, nadpis, ornát, vlas, obvod, jarmark, noviciát, apelplac, rajón, batalión, žurnál, omnibus, prazáklad, kabinet, zájezd, magistrát, účet, mraz, referát, újezd, lis, průjezd, salon, galakoncert, soupis, úraz, subkontinent, inspektorát, komisariát, doktorát, syndikát, sandál, rákos, román, soud, altán, otoman, rozpis, kompost, granát, misál, oceán, hvozd, pedál, chvost, tribunál, prut, náhon 4 30% up to 70% (‘equipollent’) obchod, výlet, internát, cestopis, šos, kokrhel, prvopis, cirkus, spis, ubrus, autobus, úvoz, jazyk, brus, letohrad, letopis, rektorát, topol, privát, úřad, podnos, rukopis, šat, grunt, tenis, velín, kolchoz, chorobopis, kumbál, kvartál, rod, díl, luft, parkán, předobraz, astrachán, buřt, dativ, dvojkoncert, kondicionál, lékopis, penál, pornočasopis, průhon, štus, pult, koncert, vagón, oddíl, prst, kšeft, rozkaz, trůn, tiskopis, velkoobchod, kontinent, schod, močál, povoz, dějepis, týl, výpis, sešit, příkop, futrál, chobot, obvaz, kastrol, pravopis, kanál, volejbal, rybník, masopust, trolejbus, regál, nápis, konvikt, protektorát, nohejbal, špýchar, konzulát, balón, děkanát, superstát, dvojhlas, narcis, prostoročas, výřad, rukáv, zápis, atlas, list, předpis, globál, přípis, pangejt 5 70% up to 90% (‘majority’) hlas, vál, balkon, kriminál, rozhlas, balkón, dopis, provaz, západ, nečas, stát, komín, včelín, kabát, pododdíl, verštat, strom, lokál, plot, špitál, betlém, potok, strojopis, kočár, led, kus, fotbal, časopis, přírodopis, drát, den, krám, funus, ksicht, zákon, převaz, maloobchod, závod, životopis, obraz, hřbet, bál, ocas, špagát, originál, národ, podklad, bod, kmín, paškál, perón, papír, přívoz, moment, stejnopis, zeměpis 6 90% up to 99% (‘unmarked’) jihozápad, domov, severovýchod, severozápad, základ, jihovýchod, voz, chlév, poločas, nos, kravín, plac, půlrok, sál, klášter, případ, kout, úvaz, středozápad, brod, východ, okres, klín, mlýn, chlív, most, zápas, hrad, mezičas, důl, týn, hrob, pas, strop, kinosál, ples, ocet, krchov 7 99% and over (‘dominant’) svět, les, poloostrov, záchod, rok, život, dům, stůl, oběd, dvůr, sklep, venkov, tábor, kostel, hřbitov, byt, ostrov Morphosyntactic variation in Czech 105

Table 6a Base questionnaire A—genitive (word series 1)

Frequency band Lexeme %u %a u (n =)a(n=)

1 les ‘forest’ 0.00 100.00 0 4339 2 sýr ‘cheese’ 9.06 90.94 118 1185 3 rybník ‘pond’ 11.86 88.14 125 929 4 týl ‘back of head/army’ 34.64 65.36 115 217 5 dvorek ‘small courtyard’ 74.63 25.37 150 51 6 pokojík ‘little room’ 92.99 7.01 199 15 7 most ‘bridge’ 99.93 0.07 3014 2

Table 6b Base questionnaire A—locative (word series 2)

Frequency band Lexeme %u %ě u (n =)ě(n =)

1 svět ‘world’ 0.10 99.90 25 24189 2 hrad ‘castle’ 7.44 92.56 122 1518 3 papír ‘paper’ 27.19 72.81 348 932 4 autobus ‘bus’ 35.95 64.05 229 408 5 mráz ‘frost’ 84.65 15.35 204 37 6 obal ‘wrapper’ 96.39 3.61 481 18 7 areál ‘territory’ 99.95 0.05 2014 1

Table 6c Base questionnaire B—genitive (word series 3)

Frequency band Lexeme %u %a u (n =)a(n =)

1 venkov ‘countryside’ 0.00 100.00 0 1041 2 potok ‘stream’ 1.22 98.78 14 1129 3 lesík ‘grove’ 17.89 82.11 22 101 4 podzimek ‘autumn’ 66.67 33.33 12 6 5 záchod ‘WC’ 87.71 12.29 257 36 6 taxík ‘taxi’ 99.00 1.00 296 3 7 vůz ‘vehicle’ 99.98 0.02 5578 1

Table 6d Base questionnaire B—locative (word series 4)

Frequency band Lexeme %u %ě u (n =)ě(n =)

1 poloostrov ‘peninsula’ 0.22 99.78 1 451 2 okres ‘district’ 5.46 94.54 93 1609 3 strom ‘tree’ 15.11 84.89 81 455 4 koncert ‘concert’ 50.52 49.48 584 572 5 účet ‘account’ 84.44 15.56 548 101 6 parapet ‘parapet’ 98.57 1.43 138 2 7 výbor ‘council’ 99.25 0.75 528 4 106 N. Bermel, L. Knittl

• To ensure the reliability of our choice of items, we used (with one exception) higher- frequency forms (locative case forms in the corpus at N > 100) with good dispersal rates throughout the corpus. Dispersal rates can be checked using the Average Reduced Frequency, which is easily available using the corpus query processor; they are important to ensure that the data is not overly dependent on input from one text type or author. • Lexemes where polysemy dictated the choice of ending (e.g. kopeček, where the choice of -a or -u in the genitive singular falls out from whether it means ‘little hill’ or ‘scoop’), or where accidental frequency effects were found (e.g. use in advertisements appearing repeatedly in the corpus), were avoided wherever possible. • We avoided lexemes from non-standard language registers, as these can evoke reactions based on stylistic objections to the word’s use. • In the broad ‘equipollent’ band 4 above, there were no nouns that met all our crite- ria for the genitive. One lexeme eventually chosen was týl ‘back of the head/rear of an army’, which displays polysemy; the corpus shows a preference for týlu ‘rear of an army’ vs. týla ‘back of the head’, although the distribution between meanings is not consistent. A second was podzimek ‘autumn’, which has no polysemy or immediately obvious rationale in the distribution of genitive singular forms, but is represented at a low frequency in this corpus. A check of other corpora (SYN2000 and SYN2006PUB) revealed a similar distribution of forms, so we included it. We did receive a few com- ments to the effect that some speakers would not use this word themselves; although these clearly affected individual assessments of sentences with this lexeme, they do not seem to have had a significant overall effect on the survey results. • To increase comparability across the tasks, we primarily chose lexemes that can be used with prepositions of physical location and motion. However, a certain number are not usually susceptible to such uses (e.g. účet ‘account’, výbor ‘council’, koncert ‘concert’, podzimek ‘autumn’, where location or motion is figurative rather than physical, or mráz ‘frost’, sýr ‘cheese’, where the possibility of physical location or motion is secondary to other uses); this helped broaden coverage of the cases involved. • Handbooks suggest that word-formation principles and lexical provenance have an effect on the choice of desinence (see Sect. 2 above). It was not our intent to investigate this contention in great detail, nor—given the various other restrictions we had already placed on our selection of lexemes—was it possible to eliminate these factors entirely by choosing e.g. only native words with single-morpheme stems. We nonetheless ensured that the survey lexemes represent different word-formation types and are of both domestic and foreign origin; these differing types are distributed through the frequency bands so as not to create a confounding factor. For example, nouns borrowed into Czech in the modern period and deverbal nouns, which are said in general to favour the use of -u in both cases, are selected from various frequency bands, not only from those that show a predominance of -u. The final selection includes lexemes whose stem contains only a single morpheme (sýr ‘cheese’, les ‘forest’, týl ‘back (of head/army)’, strom ‘tree’, vůz ‘vehicle’, mráz ‘frost’), prefixed deverbal nouns (výbor ‘council’, účet ‘account’, záchod ‘WC’, potok ‘stream’, obal ‘envelope’, okres ‘district’), lexemes with the common nominal suffixes -ík, -ek, -ník (taxík ‘taxi’, lesík ‘grove’, pokojík ‘little room’, dvorek ‘little courtyard’, podzimek ‘autumn’, rybník ‘pond’) and recent borrowings (koncert ‘concert’, parapet ‘parapet’, areál ‘area, territory’, autobus ‘bus’). • At either end of the proportionality scale (isolated/dominant), we looked first for nouns that showed no occurrence in the corpus of one of the variant forms, or only occurrences that were clearly examples of linguistic playfulness. This was unproblematic for the genitive singular, where there are high-frequency nouns that have no attestations of the Morphosyntactic variation in Czech 107

-u morph, and others with no attestations of the -a morph.26 For the locative singular, we found nouns with only the -u morph, but virtually none with only the -ě morph; those with only -ě were of such low absolute frequency that the absence of -u in the corpus could be a function of small sample size. For the locative, we therefore chose nouns that had isolated attestations of -u and isolated attestations of -ě in non-marked contexts.

9 Survey structure: practical details

We have treated the survey structure in greater detail in Bermel and Knittl (forthcoming), but the main points were: • We requested judgments (i.e. evaluate how acceptable the form is), rather than forced choices (i.e. fill in the ending). • Scalar judgments were used, with participants rating each variant on a Likert scale of 1 (perfectly normal/best form) to 7 (unacceptable/not normal). The midpoints were undefined, which (along with the sample size of almost 300) makes it possible to run parametric tests, which are normally only possible with interval data. • To increase the similarity between our experimental data and the corpus data, the forms were set in cue sentences drawn where possible from the corpus. In many instances we modified the cues by shortening, simplifying or removing distracting elements. In some instances we were unable to find a suitable sentence in SYN2005 and had to identify one elsewhere. • Respondents explicitly contrasted two versions of each cue sentence, so it was clear in each instance what they were evaluating. This approach is not unknown in studies of morphosyntax (see e.g. Marcus et al. 1995); it reduces the chance of irrelevant responses without compromising the study:27

Fig. 1 A survey example (‘The car ended up in the middle of the pond -a/-u’)

We have already remarked that each of the two base surveys was focused on seven lexemes in each case, chosen to represent differing frequency bands in the corpus. Furthermore: • Each lexeme appeared in four sentences, reflecting the variety of different syntactic environments in which that case (locative singular or genitive singular) is found (see below).

26The appearances of vozagen.sg ‘vehicle’, mostagen.sg ‘bridge’ are good examples of this. The form mosta occurs once in the linguistically eccentric translation Pan Kaplan má svou třídu stále rád and twice as a quote from Slovak, where it occurs in the expression z mosta do prosta ‘out of nowhere’. The form voza appears once in the corpus as an example of Silesian dialect in Ivan Landsmann’s autobiographical work Pestré vrstvy. 27In studies of syntax, the worry is that respondents will second-guess the researchers’ intent and construct bogus rules. This is less of an issue in morphology, as every speaker of Czech with basic education has an explicit knowledge of the existence of cases. 108 N. Bermel, L. Knittl

• Each feature (locative singular, genitive singular) constituted a third of the cues, with the remainder being taken up by filler questions on verbal morphology. • Each base questionnaire was distributed in four versions, with the order of cues ran- domized and then manually adjusted as per Cowart (1997). The first few cues in each version were designed to encourage use of the full scale from 1–7, cf. Fig. 1. • All versions were distributed randomly at each venue (see the following Sect. 10).

10 The cohort of respondents

We supervised sittings at three universities in Brno and Prague and at an academic high school with students in their final year. The venues and courses (management, computer science, civics, and natural sciences) were chosen to minimize the number of students with a degree-level education in language and linguistics.28 A smaller number of questionnaires were also distributed in workplaces in Olomouc, Prague and Přerov. We collected over 300 questionnaire responses. Once spoiled questionnaires and those from non-native speakers were removed, the final sample consisted of 289 responses. There were 34–38 examples of each of our eight questionnaire versions. Age-wise, the vast bulk of our respondents (N = 194) were between 18 and 25, which is to be expected, given where we collected our data. The remainder were spread throughout the older age groups in steadily decreasing numbers, so overall this survey can be said to be concerned with the language of younger Czechs (89% of respondents were between the ages of 18 and 45). Most (N = 182) of our respondents reported having completed sec- ondary education only, with a smaller number (N = 96) having completed higher education (although some of the first group were at university; see Bermel and Knittl forthcoming). There was a gender imbalance that surprised us, as a large number of respondents were taking a course in either computer science or engineering management (185 women to 102 men, with two non-responses). Respondents were asked to indicate which region of the Czech Republic they come from and other places they had lived for a year or more. We regarded the primary location as their ‘home’ and used this to test the hypothesis that one’s home region could have an effect on the choice of forms. Respondents were asked to select one of the fourteen administrative regions (kraje) of the Czech Republic, which (with certain exceptions) can be mapped according to the old historical/linguistic territories of Bohemia, Moravia and Silesia,29 cf. Fig. 2. Respondents’ home regions were predictably concentrated near where the surveys were conducted: Prague (48 respondents), south Moravia, which includes Brno (73 respondents), and Olomouc, which includes Přerov (51 respondents). Moravians formed 48% of all

28For a more detailed account of the survey locations and protocols, as well as a treatment of the issue of linguists as linguistic subjects, see Bermel and Knittl (forthcoming). 29The historical boundary between the Kingdom of Bohemia and the Margravate of Moravia runs through Pardubický kraj and Kraj Vysočina. Most of the former lies in Bohemia, and so the 10 respondents were considered “Bohemian” for the purpose of our analysis. The boundary bisects Kraj Vysočina; as linguistically Bohemian features are nowadays said to predominate in this region, we counted its 26 respondents as Bohemian speakers in our descriptive statistics, although we kept their data separate in our analyses of variation and correlation. Moravia and Silesia were considered as a unit, as there is no mention in the literature of a specifically Silesian dimension to the choice of endings in these cases, and we did not have enough data on Silesians to consider them separately; in any event, the region including Czech Silesia also takes in some historically Moravian territory. Morphosyntactic variation in Czech 109

Fig. 2 Geographical origin of respondents by region respondents but comprise only a third of the country’s population, so they are somewhat overrepresented in our sample. Bohemians and Silesians were underrepresented, forming respectively 49% and 3% of all respondents, but 60% and 7% of the population.

11 Do native-speaker acceptability judgments correlate with corpus frequency?

Our first issue was to ascertain what the judgments we received could be related to. For our current purposes, we will simply report the highlights of several tests we ran; for a more in-depth exploration of this question, see Bermel and Knittl (forthcoming).30 The within-subject variable Frequency (proportional frequency in the corpus) proved to have a statistically significant effect on the choice of ending (p<0.001 for all four data sets).31 The measure of this effect can be taken by the partial eta squared value, which was relatively high for proportional frequency in the corpus (partial η2 = 0.39 to 0.58). This suggests that a considerable proportion of the variance in people’s acceptability judgments can be explained by the relative frequency with which they encounter this form in everyday written language.

30There were four different versions of each questionnaire, each with the stimuli in a different order, so we needed to explore the possibility that participants’ responses were influenced by the particular order of stimuli in their version. We thus ran a repeated-measures ANOVA that explored the effects of the between- subjects variable Version (questionnaire version) and the within-subjects variables Frequency (proportional frequency in the corpus) and Context (syntactic environment). The results showed a low possibility of significance (from p = 0.17 to p = 0.46). This means that respondents’ responses were apparently not linked in any significant way to the order in which the stimuli were presented, lending support to the validity of our data. 31For our analyses of within-subject variables, some data did not pass the sphericity test, so in those instances, the results with the conservative Greenhouse-Geisser correction were used. For example, for base questionnaire 1, locative series in -ě, the test result is F(4.39, 557.88) = 136.48, p<0.001 for the variable Frequency and F(2.92, 370.59) = 66.02, p<0.001 for the variable Context. 110 N. Bermel, L. Knittl

Correlation tests can also shed light on how the relative proportions of forms in the cor- pus match with speakers’ acceptability judgments. We computed Pearson product-moment correlation coefficients to see how the relative proportion of a form in the corpus (0–100%) correlated with speakers’ judgments (1–7, with the lowest number 1 as the best). We found that in all instances the correlation between proportions in a corpus and acceptability judg- ments is very high. Our eight results were all in the range from r =−0.82 to r =−0.99, showing highly significant correlations between acceptability and corpus data.32 Each construction type was represented in equal proportions in our survey (25% for each type), while in the corpus data, constructions were represented unevenly, with some being more frequent and central than others (see Sect. 6 above). We therefore also ran an adjusted analysis. The figures in Sect. 6 show a difference between the most frequent and least frequent syntactic constructions of 50:1 or even >100:1, but they undercount the less-common syn- tactic environments, where precise searches are difficult and heuristic ones tend to retrieve only a portion of examples fitting the description. In the adjusted analysis, we assigned conservative weights to the different contexts in the proportions 10:5:2:1, assuming that some undercounting of the less frequent contexts had taken place. The most common or central construction types—adnominal position for the genitive, and location for the locative—received the 10 weighting, while the least common or least central contexts— direct governance by the verb—received the 1 weighting. The results were not markedly different, with the same range of values and only minor shifts. There is thus a strong correlation between the relative proportions of forms in a corpus and their acceptability to native-speaker informants.33

12 What influence does syntactic environment have?

As mentioned above, standard reference works assert that syntactic environment influences the choice of ending and some studies have shown evidence of this. We therefore wished to test whether this was in fact the case on this large sample, and if it was, how large an effect it might be vis-à-vis other factors influencing the choice of ending. A repeated-measures ANOVA was performed on each of our four sets of data (one for each word series). This analysis examines multiple data points from each respondent. A key assumption for repeated-measures ANOVA is equal variances of difference (sphericity) between the data from each participant (Field 2009, 458–461). For our analyses of the within-subject variables Construction and Frequency, the locative data did not pass the sphericity test, and so the results of the conservative Greenhouse-Geisser corrected test are reported for them. The ANOVA shows significant effects for the variable Construction in relation to the ending used for all analyses (see Table 7).

32Correlation measures run from r = 0.0 (no correlation) to r = 1.0 (most significant positive correlation possible) or r =−1.0 (most significant negative correlation possible). The correlation is negative because as the proportion of forms with one ending rises (e.g. 30% > 60% > 90%), the average rating on the traditional Czech scale goes downward from an ‘unacceptable’ 7.0 towards a ‘completely acceptable’ 1.0 (e.g. 3.2 > 2.3 > 1.7). 33Despite the strong correlation shown in the Pearson test and the highly significant interaction between Frequency and Ending shown by the ANOVA, we cannot assume that percentages from the corpus will map proportionally onto degrees of acceptability. The details of this are outlined in Bermel and Knittl (forthcoming); they chime with results originally found by Divjak (2008) for syntactic constructions and Kempen and Harbusch (2005, 2008) for word order, and confirmed by Bader and Häussler (2009)fora variety of features related to word order. Morphosyntactic variation in Czech 111

Table 7 The variables Construction (syntactic structure where the form appears) and Ending

Data set Result Effect size

Word series 1; genitive -a/-u* F(3, 369) = 10.72, p<0.001 partial η2 = 0.08 Word series 2; locative -ě/-u F(2.77, 337.28) = 28.78, p<0.001 partial η2 = 0.19 Word series 3; genitive -a/ -u* F(3, 354) = 6.33, p<0.001 partial η2 = 0.05 Word series 4; locative -ě/-u F(2.62, 322.15) = 35.88, p<0.001 partial η2 = 0.23

*Reported without Greenhouse-Geisser correction

A statistically significant interaction can still be minor compared to other interactions, as is often the case with larger sets of data. We can judge the size of the effect of Construction by reference to the F ratio, which determines whether our model improves the fit of the data over a default model, and if so, to what extent (see Field 2009, 201f., 785–786). The F ratio shows the effect of Construction to be relatively small for the genitive and somewhat higher for the locative. Another method for measuring this effect is the partial eta squared value, which can show what proportion of the variance is explained by a variable once other variables are excluded from the analysis (Field 2009, 415–416). A partial eta squared value of 0.00 means that none of the variance can be attributed to this variable (no effect); a value of 1.00 means that all the variance can be attributed to it. The values here concur with the F ratios, showing that the effect of Construction is relatively small for the genitive and somewhat larger for the locative. If we compare this to a variable like Frequency (the proportions of forms found in the corpus; details in Bermel and Knittl forthcoming), we find that syntactic environment (Construction) is a less influential factor in determining the rating of an ending. The F values and partial eta squared values for Frequency are considerably higher: F ratios range from 57.45 to 161.98, and partial η2 values range from 0.39 to 0.58. The syntactic environment thus has a statistically significant effect on the acceptability of the ending, but that effect is nonetheless smaller than that of the particular word and the proportions in which its forms are found in the corpus. We then looked at the overall acceptability of each ending in different constructions. Here we used both charts and pairwise analyses, which compare the statistical significance of differences between individual values per variable (in this instance, between data on the four syntactic construction types tested). The results of the first analysis are illustrated in Figs. 3a and 3b, which concern the use of -a and -u in the four genitive constructions included.34 For word series 1, there are statistically significant differences involved in five of the six pairwise comparisons for the -a ending (p<0.05). The difference between adnom- inal and non-motion constructions proved to be just above the customary cut-off point for significance (p = 0.054).Forthe-u ending, all pairwise comparisons are significant (p<0.001) except two: between adnominal constructions and motion constructions, and between non-motion constructions and directly governed constructions (p>0.05). For word series 3, there are significant differences involved in four of the six pairwise comparisons for the -a ending (p<0.001). Two comparisons are not significant (p>

34Data on the variable Context sometimes passed the sphericity tests and sometimes did not. Where the data was judged to have sphericity, the values cited are for uncorrected tests; where the data was judged not to have sphericity, the values are reported after the Bonferroni correction. 112 N. Bermel, L. Knittl

Fig. 3a Genitive -a vs. -u by syntactic environment (word series 1)

Fig. 3b Genitive -a vs. -u by syntactic environment (word series 3)

0.05): between the adnominal genitive and the directly governed genitive, and between constructions with motion and non-motion prepositions. For the -u ending, the same pattern holds. In both sets of questionnaires, speakers were more likely to highly rate the -a ending and downgrade the -u ending for environments following a preposition, and this was especially the case for prepositions of motion. However, the overall effect is not large. It is worth noticing that -u is favoured in both the most frequent context—the adnominal genitive—and the least frequent one—the directly governed genitive. When separated out on a lexeme-by-lexeme basis, we notice an occasional elevated rating for the -a ending across a number of lexemes, and a correspondingly lower rating in those instances for -u. Figures 3c and 3d concern locative constructions. For both word series (2 and 4), there are significant differences between the ratings of all four constructions for the -ě ending (with one exception all p<0.001. The single example of p<0.05 concerns the Morphosyntactic variation in Czech 113

Fig. 3c Locative -ě vs. -u by syntactic environment (word series 2)

Fig. 3d Locative -ě vs. -u by syntactic environment (word series 4) pairwise of ‘location’ and ‘location with adjective’ in word series 2). For the -u ending, all differences are significant except the one between constructions with location prepositions + adjectives and those with non-location prepositions (p>0.05). This confirms what we saw earlier: that the syntactic environment for the locative does, overall, seem to contribute to the respondents’ evaluation of an ending. For the locative, as can be seen from Figs. 3c and 3d, the distinction between construc- tion types is more marked than it was for the genitive. Location in PPs consisting of a preposition + noun, which is far and away the most widely attested syntactic environment for the locative case, favours the use of -ě, while ‘empty’ locative prepositions favour the use of -u, and other constructions (location PPs with interposed adjectives and phrases with non-location prepositions) are in between. When separated out on a lexeme-by-lexeme basis, we notice a consistently lower rating for -ě with semantically ‘empty’ prepositions and a correspondingly higher rating for -u. 114 N. Bermel, L. Knittl

The overall picture in Figs. 3a–3d is supported by the consistent high negative correla- tion (Pearson’s r, where r =−0.93 to r =−0.98, 0.00

13 Survey data matched to corpus data

A final task was to match the survey data regarding syntactic constructions to findings from the corpus. For the lexemes that appeared in the survey we retrieved all tokens from the corpus and, using a mixture of automatic sorting and manual trawling, separated them into the four syntactic constructions. Our survey had equal numbers of examples from each lexeme and context, whereas the corpus data reflected widely varying absolute frequencies in the corpus. For example, the genitive case of les ‘forest’ occurs in the four contexts studied 3491 times in the corpus, while the genitive case of sýr ‘cheese’ occurs only 466 times. Corpus findings, if unadjusted, would thus give les almost ten times the weight of the findings for sýr, compared to survey data, in which respondents saw forms of each word eight times exactly. Instead of totalling raw frequencies for each context in the corpus, we decided to work with proportions, meaning that each word was represented equally. We therefore calculated the proportion of each lexeme’s forms that occurred in the corpus in the relevant constructions with each ending. A sample set of results for these two words is shown in Table 8.

Table 8 Using proportions to represent contexts

Adnominal Motion Non-motion Governed

-a les ‘forest’ 40.62% 53.17% 6.07% 0.14% sýr ‘cheese’ 91.78% 7.04% 0.94% 0.23% ...etc. -u les ‘forest’ 0.00% 0.00% 0.00% 0.00% sýr ‘cheese’ 90.00% 7.50% 2.50% 0.00% ...etc.

We took the average of those proportions for all seven words in each series to arrive at the results in Figs. 4a and 4b. Because of the small numbers of corpus examples involved in the ‘governed’ categories (N = 3 to N = 57, with the vast bulk of examples coming from only four lexemes), we are best off excluding these and looking at the remaining three construction types. Here we see that corpus data show a very similar distribution to our survey data. For the genitive singular, we find that -u has a higher occurrence for adnominal constructions, whereas the proportion of -a forms increases for constructions with motion and non- motion prepositions. For the locative singular, we find that -ě occurs more frequently in immediate conjunction with a preceding preposition of location, whereas its frequency is lower with an interposed adjective, and lower still when the preposition does not indicate location. We ran Spearman’s rho tests to check whether there was any significant correlation between the syntactic judgments and syntactic data from the corpus. None of the statistics reached the level of significance (in all instances we found p>0.05), but three of the four approached this level (0.10

Fig. 4a Proportions of corpus forms with each ending, by syntactic construction—genitive singular

Fig. 4b Proportions of corpus forms with each ending, by syntactic construction—locative singular

This tallies with our observation that the main effect on ratings is from the particular lexeme, rather than the syntactic construction in which it is found. A data set more closely tailored to the impact of syntactic constructions might have yielded more information in this regard.

14 Conclusions

In this article, we have examined two examples of overdifferentiation: a part of the lan- guage system where language users have multiple methods for signalling a particular grammatical/syntactic relationship at their disposal. Brown (2007, 64) analyzed this same Morphosyntactic variation in Czech 117 effect with the Russian masculine inanimate locative case (which, incidentally, employs the same two historical morphs as the Czech locative examined in this article). The relatively well-defined set of word-formational, syntactic and semantic environments that allowed Brown to map contexts and situations where -u forms are used in Russian is absent in the Czech data. Instead, in Czech we are presented with an example of a more or less stable variation that is not clearly and decisively linked to the features that are often tentatively put forward in the scholarly literature and especially in handbooks as influencing factors. The latter are, as a reminder: • Geographically, variants with -a and -ě are said to be more frequent in Moravia than in Bohemia. • The distribution of locative singular formants is less clearly delineated than that of genitive singular formants. • Frequency of one or another formant can be influenced by the syntactic construction, and this is said to depend on whether we are dealing with an object of the verb or an adverbial construction. The adverbial meaning of place and to a lesser extent time or manner is often connected with the formant -a (gen.) or -ě (loc.), and other functions are connected with the formant -u. We found little evidence to support the contention that region of origin, age or gender plays a significant role. The few regional indicators we found did not indicate a consistent and significant difference in preference for particular endings; rather, they differed on a word- by-word basis, contradicting information found in the handbooks. There is some tentative evidence of age-related change, which might explain this contradiction. Our survey data comes primarily from younger speakers, while handbook information is presumably based on the most recent large-scale dialect surveys, which were undertaken a generation or more ago. As predicted, the locative singular does exhibit more variation than the genitive singular, with a much larger group of nouns showing considerable variation, and more of a balance in the total number of forms found with each formant. This matches the much higher overall acceptability ratings found for locative forms of both stripes. Although the primary determinant of an ending’s acceptability does simply vary widely word by word, we found evidence that the syntactic environment also influences accept- ability ratings. These patterns replicate those found in the corpus and bolster the argument that corpus data and acceptability judgments are linked. They also support the general ten- dencies put forth in grammars of Czech, although they suggest a less mechanistic and more cognitively plausible explanation for why the less common ending has been maintained in certain syntactic constructions. Our findings support emergentist explanations for this phenomenon: that the minority endings, which are closely associated with individual cases, receive higher ratings when the context is highly typical for that case and the construction used is fixed and unchang- ing. This observation fits in with the phenomenon that Langacker terms abstraction: “the emergence of a structure through reinforcement of the commonality inherent in multiple experiences” (2000, 4). He goes on to say that “[b]y its very nature, this abstractive process ‘filters out’ those facets of the individual experiences which do not recur”; the schema that can result from this process is a “commonality that emerges from distinct structures when one abstracts away from their points of difference by portraying them with lesser precision and specificity”. As the emergentist position tries to find explanations that are psychologically plausible, we need not abandon our finding that relative frequency of a form has the greatest effect 118 N. Bermel, L. Knittl on acceptability; instead, we can posit that language users have reference to a higher- level set of schemas in addition to their experience of frequency. These schemas may tip their acceptability ratings one way or another, although they do not seem to override the frequency data entirely. Speakers would thus have in the first instance recourse to their sense of frequency and entrenchment, but would also as a fallback have recourse to coarser-grained frequency data on syntactic structures. The linkage of minority endings with easily abstractable structures seems to bolster their acceptability. Langacker describes a scenario he calls “interactive activation” in which there is an “activation set” of which only one member will emerge “as the active structure” (Langacker 2000, 16). Frequency is, he says, the main determinant as to which member emerges— here entrenchment is evoked—but “contextual priming [...] can override the effects of familiarity”. If in the contemporary category of Czech these minority formants are more easily abstracted, then higher acceptability could be indirectly demonstrating that they are susceptible to contextual priming, making their appearance more likely in those syntactic constructions and explaining the corpus data as well. That development could be decisive in preventing the forms from being preempted and contributing to the maintenance of a system with pervasive variation.

Source

Czech National Corpus [Český národní korpus, including SYN2000 and SYN2005], http://www.korpus.cz.

References

Bader, M., & Häussler, J. (2009). Toward a model of grammaticality judgments. Journal of Linguistics, 46(2), 273–330. Bermel, N. (1993). Sémantické rozdíly v tvarech českého lokálu. Naše řeč, 76, 192–198. Bermel, N. (2004). V korpuse nebo v korpusu? Co nám řekne (a neřekne) ČNK o morfologické variaci v tvarech lokálu. In Z. Hladká & P. Karlík (Eds.), Čeština—univerzália a specifika, 5. Sborník 5. mezinárodního setkání bohemistů v Brně 13.–15.11.2003 (pp. 163–171). Praha. Bermel, N. (2010). Variace a frekvence variant na příkladu tvrdých neživotných maskulin. In S. Čmejrková, J. Hoffmannová, & E. Havlová (Eds.), Užívání a prožívání jazyka. K. 90. narozeninám Františka Daneše (pp. 135–140). Praha. Bermel, N., & Knittl, L. (forthcoming). Corpus frequency and acceptability judgments: a study of mor- phosyntactic variants in Czech. Corpus Linguistics and Linguistic Theory. Brooks, P. J., & Tomasello, M. (1999). How children constrain their argument structure constructions. Language, 75(4), 720–738. Brown, D. (2007). Peripheral functions and overdifferentiation: the Russian second locative. Russian Lin- guistics, 31(1), 61–76. Čermák, F. et al. (1997). Recepce současné češtiny a reprezentativnost korpusu (Výsledky a některé sou- vislosti jedné orientační sondy na pozadí budování Českého národního korpusu). Slovo a slovesnost, 58(2), 117–124. Cowart, W. (1997). Experimental syntax. Applying objective methods to sentence judgments. Thousand Oaks. Cummins, G. (1995). Locative in Czech: -u or -ě? Choosing locative singular endings in Czech nouns. Slavic and East European Journal, 39(2), 241–260. Cvrček, V. et al. (2010). Mluvnice současné češtiny.Praha. Divjak, D. (2008). On (in)frequency and (un)acceptability. In B. Lewandowska-Tomaszczyk (Ed.), Corpus linguistics, computer tools, and applications—state of the art. PALC 2007 (Lodz Studies in Language, 17) (pp. 213–233). Frankfurt. Featherston, S. (2005). The decathlon model of empirical syntax. In S. Kepser & M. Reis (Eds.), Linguistic evidence. Empirical, theoretical, and computational perspectives (Studies in Generative Grammar, 85) (pp. 187–208). Berlin. Morphosyntactic variation in Czech 119

Field, A. (2009). Discovering statistics using SPSS. London. Goldberg, A. E. (2003). Constructions: a new theoretical approach to language. Trends in Cognitive Science, 7(5), 219–224. Goldberg, A. E. (2009). The nature of generalization in language. Cognitive Linguistics, 20(1), 93–127. Karlík, P., Nekula, M., & Rusínová, Z. (Eds.) (1995). Příruční mluvnice češtiny. Praha. Kasal, J. (1992). Dublety a jejich využití. Philologica, 65 (Studia Bohemica, 6) (pp. 107–114). Kempen, G., & Harbusch, K. (2005). The relationship between grammaticality ratings and corpus frequen- cies: a case study into word order variability in the midfield of German clauses. In S. Kepser & M. Reis (Eds.), Linguistic evidence. Empirical, theoretical, and computational perspectives (Studies in Generative Grammar, 85) (pp. 329–349). Berlin. Kempen, G., & Harbusch, K. (2008). Comparing linguistic judgments and corpus frequencies as windows on grammatical competence: a study of argument linearization in German clauses. In A. Steube (Ed.), The discourse potential of underspecified structures (Language, Context, and Cognition, 8) (pp. 179– 192). Berlin. Klimeš, L. (1953). Lokál singuláru a plurálu vzoru ‘hrad’ a ‘město’. Naše řeč, 36, 212–219. Králík, J., & Šulc, M. (2005). The representativeness of Czech corpora. International Journal of Corpus Linguistics, 10(3), 357–366. Langacker, R. W. (2000). A dynamic usage-based model. In M. Barlow & S. Kemmer (Eds.), Usage-based models of language (pp. 1–63). Stanford. Marcus, G. F. et al. (1995). German inflection: the exception that proves the rule. Cognitive Psychology, 29(3), 189–256. McKoon, G., & Macfarland, T. (2000). Externally and internally caused change of state verbs. Language, 76(4), 833–858. Petr, J. (Ed.) (1986). Mluvnice češtiny. Volume 2: Tvarosloví.Praha. Rusínová, Z. (1992). Některé aspekty distribuce alomorfů (genitiv a lokál sg. maskulin). Sborník prací filozofické fakulty brněnské univerzity, A, 40, 23–31. Schütze, C. T. (1996). The empirical base of linguistics. Grammaticality judgments and linguistic method- ology. Chicago. Sedláček, M. (1982). V ‘Záhřebě’ i v ‘Záhřebu’. Naše řeč, 65, 11–15. Štícha, F. (2009). Lokál singuláru tvrdých neživotných maskulin (ve vlaku vs. vpotoce): úzus a gramatičnost. Slovo a slovesnost, 70(3), 193–220. Šulc, M. (2001). Životná koncovka -a v akuzativu singuláru neživotných maskulin. Jazykovědné aktuality, 38(3), 117–128.