Corpus evidence for prototype-driven alternations: the case of German weak (Draft of August 20, 2014)

Roland Schäfer , Freie Universität Berlin1

In this paper, I present a large study of so-called “weak” nouns in German based on

the 9.1 billion token DECOW2012 web corpus. The weak nouns form a small class of

masculine nouns with prototypical semantic and phonotactic features as well as case

and number affixes which are unusual within the German system of inflection.

These weak nouns have a certain tendency to be assimilated to the dominant “strong”

inflectional pattern. I quantify the strength of this tendency and model it as an effect of

the presence or absence of the prototypical features in the nouns as well as paradig-

matic (case) features and token frequency. While staying neutral with respect to spe-

cific theories, this analysis provides overall evidence for usage-based theories, and it

shows that even low-frequency alternations which are typical of non-standard lan-

guage can be examined in corpus studies, provided that very large (and therefore nec-

essarily web-derived) corpora are available.

Keywords prototypeDRAFT theory, alternations, German, noun inflection, web corpora for personal use only

1 I would like to thank the following people (in alphabetical order) for comments, discussion, and sug-

gestions: Götz Keydana, Ulrike Sayatz, Christian Zimmer. I also thank the participants of the work-

shop “Usage‐ Based Approaches to Morphology” at the Annual Conference of the German Linguistic

Society (DGfS) 2013. All remaining errors and inadequacies are mine, and I intend to keep them.

1 1 Overview

On July 23, 2012, the German online newspaper Spiegel Online published an about a controversial statement made by Philipp Rösler, the former leader of the Ger- man liberal democratic party. In that article, there is a comment from a fellow party member quoted in a headline as (1a), but repeated in the text body as (1b).2

(1) a. Auf welchem Planeten lebt er?

on which planet lives he

On which planet does he live?

b. Auf welchem Planet lebt er?

On which planet lives he

In (1a), the “weak” masculine noun Planet (‘planet’) takes the inflectional marker -en in the dative, but it does not in (1b). The form in (1b) represents a non-standard alterna- tion, because dative singular forms without a suffix are characteristic of the dominant

“strong” masculine inflection pattern. While it is impossible to find out which variant was originally uttered in this case, many native speakers would agree that dropping the

-en in this case might be stylistically dispreferred in written standard German, but that it is not unusual in colloquialDRAFT spoken German, and that it is far from full ungrammatical- ity. Examples of an accusative and afor genitive personal of a use (Mensch only, ‘man, human’)

2 http://www.spiegel.de/politik/deutschland/philipp-roesler-empoert-fdp-freunde-mit-griechenland-

aeusserung-a-845980.html

2 which inflect according to the strong pattern (accusative Mensch instead of Mensch-en and genitive Mensch-es instead of Mensch-en) can be found in (2) and (3).3,4

(2) Gibt es einen Mensch, der stetig wächst?

gives it a man who constantly grows

Is there any man who grows constantly?

(3) Das Leben eines Mensches wird zu politischen Zwecken

the life of.a man becomes for political reasons

aufs Spiel gesetzt.

on.the game put

The live of a man is put at risk for political reasons.

The weak nouns form a small class of just over 450 masculine nouns (compared to thousands of strong masculine nouns), and they have remarkable prototypical semantic and phonotactic properties (Köpcke [1995]), for example human denotation or non-final accent (cf. Section 2). In addition, their case and number forms are quite remarkable within the German system in that they use -en as a non-nominative-singular marker (Thieroff [2003]). All forms except for the nominative singular take the suffix

-en (see Section 2). Not surprisingly, given the low type frequency of the weak nouns and their non-canonicalDRAFT inflection, some of them have been fully assimilated to the strong inflectional pattern (Wurzel for[1985], personal Joeres [1996]). Additionally,use only nouns which are still predominantly weak have alternative strong forms as exemplified in (1) – (3).

No extensive corpus study which shows whether the presence or the absence of the pro- totypical semantic and phonotactic features of weak nouns influences the alternation

3 http://www.flegel-g.de/wachstum-wachstum.html

4 http://www.mumia.de/doc/aktuell/991201ai00.html

3 strength has been presented. By “alternation strength”, I mean the probability that a weak noun occurs in a strong form. The lack of corpus evidence was—in my view— mostly due to the lack of adequate corpora, and the present paper aims to remedy this situation by using a very large web corpus of German for a data-driven study.

I pursue two major goals in this paper: a theoretical and a methodological one. On the theoretical side, I consider this study to contribute evidence for the adequacy of usage- based models of inflectional morphology, without making strong claims in favor of a specific version of, for example, Construction Morphology such as advocated in Booij

(2010), who is not very much concerned with inflection, anyway. A prototype or schema approach as proposed by Köpcke (1988,1995,1998) suggests itself, not only because it was applied to German nominal inflection within those publications. The advantage of a schema approach is that it specifically allows for central and non-central membership.

Lexical items less strongly associated with a certain schema can be assumed to disasso- ciate from that schema and associate with another one more readily than strongly asso- ciated items. In the case at hand, I suggest that such a theory predicts that the alternation strength of highly prototypical weak nouns should be lower than that of less prototypi- cal weak nouns. My corpus study not only confirms my hypothesis but also quantifies the strength of the influenceDRAFT of the different semantic, phonotactic, and morphological properties. It is a difficult question whetherfor personaland how such corpus use data only can be interpreted in “cognitive” terms. I carefully subscribe to a view of corpus linguistics as “psy- chologically informed (cognitively-inspired) usage-based linguistics” (Gries

[2010:334]). In this vein, the inspiration for the present study comes from the cognitive schema-based theories by Köpcke. Also, the study contributes—by producing results

4 from corpus data which are highly compatible to those theories—evidence to support the assumption that reflexes of cognitive representations can indeed be measured in cor- pus data.

As for the second (methodological) goal, I demonstrate that data-driven corpus studies of rare event alternations—even if they predominantly occur in unedited colloquial writ- ten language—are possible, if corpora of the appropriate size and nature are available. I suggest web corpora in the region of 10 billion tokens as the appropriate source of data.

I consider this methodological goal to be equally important as the theoretical one.

I now briefly describe the system of noun inflection in German as well as the position of the weak nouns within that system in Section 2. Section 3 presents the corpus study including the statistical results. Section 4 summarizes the findings.

2 German weak nouns

2.1 German noun inflection

In this section, I briefly describe noun inflection in current German, mostly for readers who are unfamiliar with German. I follow the overviews in Augst (1979), Eisenberg

(2000, 2013), and Schäfer (2014, submitted). German has a three-wayDRAFT gender distinction (neuter, masculine, feminine), with four cases (nominative, accusative, dative,for genitive) personal and a two-way use number only distinction (sin- gular, ).5 Leaving only the weak nouns aside (cf. Section 2.2), case inflection in nouns reduces to two simple rules: The dative plural takes -(e)n (after the plural marker)

5 Many syncretisms can be observed in the case system, but if we take the whole range of case-marked

parts-of-speech (nouns, , , and ), non-ambiguous NP forms can be

found for each of the four cases.

5 whenever it is phonotactically possible, and the genitive singular of masculine and neuter nouns always takes -(e)s.6

Furthermore, and inflection are tightly coupled in as much as a noun’s gender largely predetermines its plural inflection. Leaving aside the weak nouns and some other small classes once again, the general masculine and neuter (strong) plu- ral marker is -e (with or without stem umlaut) as in Stuhl/Stühl-e (‘chair/s’) and

Gurt/Gurt-e (‘belt/s’). The general feminine plural marker is -en as in Nadel/Nadel-n

(‘needle/s’). Sometimes, the prototypical masculine and neuter plural suffixes occur with feminine nouns and vice versa. Examples are masculine Staat-en (‘states’) or femi- nine Wänd-e (‘wall/s’). The class of weak nouns also takes the atypical masculine plural marker -en. I now turn to the additional peculiarities of weak nouns in Section 2.2.

2.2 Weak nouns

In addition to taking the rare masculine plural -en, weak nouns deviate from the other- wise strict pattern of case marking in that they take the suffix -en in all singular forms except for the nominative. In other words, -en becomes a non-nominative-singular marker instead of a simple plural marker. In this section, I briefly summarize the rele- vant previous work onDRAFT weak nouns, especially regarding their special status within the German declension system. for personal use only

6 In inflectional affixes, orthographic always corresponds phonologically to schwa. For -en and

-es, there are actually two variants (with and without schwa), and their selection is conditioned by

phonotactic properties of the stem. The exact conditions are irrelevant for the given purpose, and I re-

fer to the suffixes as -en and -es throughout the remainder of this paper.

6 For the present corpus study, Köpcke (1995) is most pertinent. In Köpcke’s paper, pro- totypes for weak nouns are established. The two prototypes are defined using lexical se- mantic and phonotactic features (cf. overview Köpcke [1995: 168, 178]). Both proto- types are characterized by denoting humans and having masculine grammatical gender.

The semantic property “human denotation” is, of course, not exclusive to weak nouns.

Prototype I is additionally defined by final schwa, polysyllabicity, and penultimate ac- cent (Matrose [ma'tʁoːzə] ‘sailor’). Prototype II has ultimate accent and is also polysyl- labic (Artist [ʔaʁ'tɪst] ‘artiste’).7 According to Köpcke himself, but also Thieroff (2003:

111–113) the prototype groups contain marginal (i.e., not very prototypical) members.

This is expected under a prototype view and does not refute the general idea. Köpcke

(2000: 119) then goes so far as to call the final schwa (cf. Prototype I) a marker of hu- manness or even agentivity for masculine nouns. Indeed, there are only few masculine

(weak) nouns ending in Schwa which do not denote humans (Affe ‘ape’, Löwe ‘lion’, etc.). This quasi-causal relationship between form and lexical meaning is a far reaching interpretation which cannot be tested in a corpus study and is therefore left aside here.

More importantly, Köpcke argues based on diachronic data that those weak nouns which have already become strong have done so based on semantic properties (non-relatedness to humans). DRAFT Much more was said about weak fornouns, personaloften under a diachronic use onlyperspective. How- ever, I cannot test hypotheses from most of this body of work using a corpus containing exclusively present-day German. Most pertinent would possibly be Eisenberg (2000), who tries to establish the weak nouns as a “fourth grammatical gender” based on their morphophonological and lexical semantic properties. This, however, is problematic in 7 All transcriptions represent the German standard pronunciation according to Krech et al. (2009).

7 as far as it obviously applies a new definition of the term “grammatical gender”. After all, the weak nouns group perfectly with other masculine nouns in terms of congruence with articles, pronouns, and adjectives and are classified otherwise unanimously as mas- culine. Clearly, Eisenberg’s analysis defies empirical testing.

Weak (canonical) Strong (alternate) Nominative der Linguist der Linguist Accusative den Linguist-en den Linguist Singular Dative dem Linguist-en dem Linguist Genitive des Linguist-en des Linguist-s Nominative die Linguist-en Plural Accusative/Dative den Linguist-en Genitive der Linguist-en

Table 2.1 The canonical weak and the alternative strong forms of weak nouns with the appropriate form of the definite article der (‘the’). Gray cells highlight a syncretism be- tween full NPs in the canonical singular and the plural in the accusative (Section 3.1).

As explained in Section 1, weak nouns have a certain (yet unquantified) tendency to occur in the strong inflection, cf. Thieroff (2003). Table 2.1 shows the canonical weak and the alternate strong forms.8 Obviously, only the non-nominative singular is affected.

Thieroff (2003: 113–115) convincingly points out that there are certain paradigmatic forces that make it moreDRAFT or less natural for weak nouns to occur in strong inflectional for personal use only forms (cf. also Section 2.1 above). The weak nouns are unusual because, firstly, the ac-

8 Strong genitive singular forms (mentioned by Thieroff [2003:114–115]) with a re-analyzed stem such

as Menschen-s (alternatively Mensch-ens) turned out to be so rare in terms of token frequency that I

ignore them completely. Working with frequencies near zero causes numerous problems with most

methods in inferential statistics.

8 cusative and dative singular of all other nouns never take suffixes.9 Secondly, the geni- tive singular of masculine nouns is otherwise always distinguishable from the accusa- tive and the dative. Thirdly, the genitive marker (if there is one) is otherwise always -es.

I want to add that the weak nouns also do not follow the regularity that the nominative and the accusative of nouns are always formally identical. Also, the affix pattern of weak nouns is exactly the same as the one of the so-called weak adjectives in the mas- culine singular, including cases where the is used as the head of the NP as in den Groß-en (accusative, “the big one”), dem Groß-en (dative), des Groß-en (genitive).

The weak pattern is thus not just atypical of nouns, but at the same time typical of an- other part-of-speech.10 This might contribute to the tendency of weak nouns to leave their original paradigm, and it is a fact which—to my knowledge—has not been men- tioned so far.

To summarize, there are reasons of paradigmatic unity (prototypicality of case marking) which could cause weak nouns to become strong. On the other hand, the weak nouns constitute a very homogeneous group based on their prototypical semantic and phono- tactic properties. I suggest that the presence of those properties should exert a measur- able force in favor of the preservation of the unusual pattern of case and number mark- ing. To substantiate thisDRAFT hypothesis, I will quantify the alternation effect, and I will show how strongly the different forces influencefor personal the strength of usethe alternation. only Since the

9 The archaic dative singular -e is virtually non-existent is current German.

10 Weak forms of adjectives occur after inflected determiners. When there is no or a deter-

miner without case marker, then the adjective takes over the inflection of the articles and pronouns.

An overview of the phenomena and a careful critical discussion of previous approaches can be found

in Demske (2001: Chapter 2).

9 nominative is never affected, I now look exclusively at the accusative, dative, and geni- tive forms of weak nouns in Section 3.11

3 Corpus study

3.1 Challenges and corpus choice

For a data-driven corpus-based account of the factors that influence the strength of the alternation of weak nouns toward the strong pattern, the source of data has to satisfy two requirements which are not satisfied by most available corpora. First, the corpus has to contain more or less colloquial spontaneous writing, because the strong forms of weak nouns are very rare in standard written German. Occurrences in professionally written media like (1) in Section 1 are quite exceptional. Secondly, because the phenomenon is rare, an appropriate corpus should be as large as possible. The size requirement is even more vital because I wanted to query for unambiguous case forms automatically. This means that only co-occurrences of weak nouns with the singular indefinite determiner can be used, as I am going to argue now.

Consider the forms of which the frequencies are to be compared, as shown in Table

2.1. German masculine determiners have unambiguous case endings in the singular, so it appears that we couldDRAFT just search for sequences of a determiner, optionally followed by one or more (appropriately inflectfored) adjectives, personal followed byuse a weak only noun. However,

NPs containing the canonical accusative singular are homonymous with those contain- ing the accusative or dative plural. This makes it impossible to search for canonical

11 There are a few exceptions where the nominative singular is affected as the result of idiosyncratic his-

toric developments. Most of them have a weak variant like Friede (‘peace’) and a strong variant with

a re-analyzed nominative like Frieden. These nouns were excluded from the samples, cf. Section 3.

10 forms with determiners that have a plural without manual inspection of large numbers of sentences. The only determiner without a plural is ein (‘a’), because indefinite plural

NPs do not take a determiner. As a quantifier, ein (‘one’) is semantically blocked from occurring in the plural. This reduces the unambiguous configurations suitable for a data- driven corpus study to those with the determiner ein, which means that I can only take into account a small fraction of all occurrences of weak nouns.12

Because of such problems of data sparseness, the DECOW2012 web corpus was cho- sen.13 It is part of the COW collection of web corpora and contains 9.1 billion tokens, which makes it the largest linguistically annotated corpus of German available at the time of this writing.14 Since not much is known about the properties of large web cor- pora from a linguistic perspective, I briefly justify the use of this resource here. The DE-

COW corpus was described by its creators in Schäfer and Bildhauer (2012, 2013).

Schäfer and Bildhauer (2012: 492–493) and Biemann et al. (2013: 48) present an assess- ment of the text types and registers contained in the corpus based on manually annotated samples, showing that the corpus contains an estimated 22.5% documents written in a

“quasi-spontaneous” mode, i.e., mostly forum discussions. This is the kind of text where we can hope to find most non-standard strong forms of weak nouns. Biemann et al. (p. 12 Searching only for NPsDRAFT with the indefinite article has another advantage. It excludes configurations for which there are claims in the literaturefor (e.g., personal Müller [2002] or Sternefeld use [2004]) only that inflectional markers can be dropped for independent syntactic or functional reasons. These configurations include

determinerless datives of weak nouns in phrases such as ohne Dirigent (‘without a conductor’) in-

stead of *ohne Dirigent-en (singular reading). Further complications caused by such effects are ele-

gantly avoided because they do not affect NPs with the determiner ein.

13 The corpus was built and is hosted by the Group at Freie Universität, Berlin.

14 There are other COW corpora available in Dutch, English, French, Spanish and Swedish.

11 37–46) also show that the COW corpora compare favorably to traditional and web cor- pora in extrinsic evaluations (collocation extraction tasks). In Schäfer and Sayatz

(2014), DECOW2012 was used for similar reasons as in this study, and Van Goethem and Hiligsmann (2014) base part of their corpus study on another COW corpus (NL-

COW2012) for comparable reasons. The use of earlier web corpora like the WaCky cor- pora (Baroni et al., 2009) has also been justified convincingly by, for example, Zeldes

(2012: 96–98). The corpus which I have chosen here can therefore be considered a re- cent but adequately tested source of data.

In the next section, I describe my method of obtaining reliable counts from such sources of data by formulating precise queries in a spirit comparable to that of Zeldes

(2012: 97), who states (about his work with the WaCky corpora) that “[t]his type of au- tomatically retrieved data is rather heterogeneous and possibly error-prone, so a main priority in searching through such corpora is to ensure high accuracy of results by for- mulating precise queries and manually evaluating error rates as required.” It want to add and clarify that heterogeneity in the data is also a blessing, because it increases the ex- ternal validity of a study, i.e., the probability that the findings generalize well (Maxwell and Delaney 2004: 30). Also, considering that corpora like DECOW allow us to take very large samples, theDRAFT “error-prone” nature of the data is less of a problem, because the error is distributed randomly, at leastfor if there personal is no systematic samplinguse only error.

12 3.2 Descriptive statistics

Noun Meaning Reason for removal Artist ‘artiste’ Contamination with an English loan (‘artist’). Bauer ‘peasant’ Frequent proper name. Buchstabe ‘letter’ (alphabet) Two stems. Friede ‘peace’ Two stems. Gedanke ‘thought’ Two stems. Glaube ‘belief’ Two stems. Götze ‘idol’ Frequent proper name. Herr ‘Sir’, ‘Mister’ Special appositional uses. Junge ‘boy’ (masc) Homonymous neuter noun ‘young’ (always strong). Mensch ‘human, man’ Extremely frequent. Name ‘name’ Two stems. Page ‘footboy’ Sample contaminated with an English loans such as einen Page Fault (‘a page fault’ [Acc]). Resident ‘resident’ Contamination with an English loan (‘resident DJ’). Same ‘Lapp’ Two stems, and two mostly indistinguishable near- ‘seed, semen’ homonyms. Steinmetz ‘mason’ Almost 90% strong. Titan ‘titan’ Proper name (Saturn moon). Wille ‘will’ Two stems.

Table 3.1. Lemmas which were removed from the sample.

To derive a sample DRAFTfor the analysis of the counts of weak and non-canonical strong for personal use only uses of weak nouns, I proceeded as follows. First, a list of weak nouns was bootstrapped from DECOW2012 by looking for word forms that end in , preceded by a deter- miner in the genitive. Using this bootstrapping method requires each selected noun to occur at least once in a canonically inflected genitive form with a determiner and with- out an adjective. This is a desirable by-effect because it ensures that no extremely in-

13 frequent nouns make it into the final sample. The list was sifted by hand in order to eliminate erroneous matches, including strong nouns taking on weak forms (see discus- sion in Section 3.4). In total, 451 noun lemmas were found, which is roughly in line with Bittner (2003:98), who states that forty native and more than four hundred non-na- tive lemmas fall into the class of weak nouns.

Then, the final sample was taken by querying inflected non-nominative singular forms of ein (accusative einen, dative einem, genitive eines), optionally followed by properly inflected adjectives, followed by either the weak of the strong form of each of the lem- mas. In total 451 lemmas × 3 cases × 2 alternative forms = 2,706 queries were exe- cuted.15 In the resulting sample (n = 953,117), there were 26,667 token occurrences of the strong inflectional pattern and 926,450 token occurrences of the weak inflectional pattern. Accounting for a mere 2.8% of all occurrences, the use of strong inflection can thus be considered a rare event. In Appendix 1, a list of the noun lemmas and their fre- quencies in the final sample can be found. The confidence intervals for the proportion of strong forms specified there (per lemma) are large, which indicates that analyzing the alternation strengths of single lemmas is not a good idea from a statistical point of view.

Some of the lemmas turned out to be problematic and were removed. Some have homonyms which areDRAFT frequent proper names, which always inflect according to the strong pattern. German proper personfor names personal also occur with determiners,use only not just in di- alects, but also in certain styles of writing, cf. (4).16

15 A sample query as performed in the IMS Open Corpus Workbench (Evert and Hardie, 2011):

[word="einen"%c][pos="ADJA"&word=".+en"]*[word="Abiturient"%c] 16 http://www.breitnigge.de/2011/01/14/so-schnell-sind-23-tage-winterpause-rum/

14 (4) Ich werde lieber mit einem Götze Zweiter

I become rather with a Götze second

als mit einer Amoroso-Diva Erster!

than with a Amoroso diva first

I would rather finish second with Götze than first with a diva like Amoroso!

Other problematic nouns are those for which it has been known for a long time that they alternate strongly between weak and strong use for individual historic reasons, some even having re-analyzed alternative stems ending in -en (e.g., strong Frieden

[‘peace’] with the genitive Friedens versus the older but co-existing weak Friede with the genitive Frieden).17 Sometimes, the differentiation of stems correlates with a seman- tic differentiation (Wurzel [1985], Joeres [1996], also Köpcke [1995: 173]). For the cor- pus search, this mean that these nouns introduce an ambiguity between accusatives and datives of the reanalyzed stem without a suffix (e.g., dem Frieden) and accusatives and datives of the older stem with a suffix (e.g., dem Friede-n). This ambiguity cannot be re- solved automatically. Also, they have a highly increased alternation strength compared to the other weak nouns, for example 13% strong forms for Friede. These facts indicate that they do not belong to the otherwise homogeneous group of weak nouns without re- analyzed stems and withDRAFT a low alternation strength. Therefore, I decided to remove them from the sample. The noun Steinmetzfor (‘mason’) personal was also removed use because only it has almost completed its development into a already (without stem reanalysis). The noun Mensch (‘man’, ‘human’; 221,210 occurrences) is extremely frequent and occurs over eight times more often than the next frequent weak noun Kunde (‘customer’;

17 Some nouns like Weizen (‘wheat’) have already completed a similar development.

15 26,576 occurrences). I removed it because otherwise the sample would have contained

an extremely large proportion of token occurrences of this single noun. The nouns

which were removed are summarized in Table 3.1. They are still included in the list in

Appendix 1.

The reduced sample contains 433 lemmas and 466,922 token occurrences of which

10,488 are strong and 456,434 are weak. The proportion of strong forms drops by 0.6%

to 2.2%, which is mostly due to the removal of the historic exceptions with two stems.

However, I still have a very large sample of nouns which show the low-frequency alter-

nation effect in which I am interested.

Figure 3.1. Bagplot ofDRAFT the distribution of the weak and the strong forms. Each dot repre- sents one lemma. The x axis shows thefor raw frequencypersonal of the strong use form only (non-canonical), and the y axis shows the raw frequency of the weak form. Ten outliers are not shown.

Figure 3.1 shows a bagplot of the reduced sample, where each dot represents a weak

noun, and the coordinates correspond to the total number of occurrences in the corpus (x

axis for strong, y axis for weak occurrences). Bagplots (Rousseeuw et al., 1999) are bi-

16 variate boxplots. The inner polygon (“bag”) shows half of the points. The outer polygon

(“fence”) is the bag grown by a factor of three. Data points outside the fence are consid- ered outliers. This bagplot shows that there are many infrequent and few frequent nouns in the sample, which is trivially expected given the power law distribution of word fre- quencies. However, we can also tell from the scales of the axes that the strong forms (x axis) are very rare compared to the weak forms (y axis). While there is a number of out- liers, the majority of data points lies within the bivariate mean (bag). All in all, 43 of

433 data points (ten not shown in the plot)—i.e, 10%—fall outside the fence and can be considered outliers.

Case Proportion 99% CI n (strong) accusative 0.0247 ±0.0009 218,740 dative 0.0278 ±0.0011 154,623 genitive 0.0112 ±0.0009 93,559

Table 3.2. Proportions of strong occurrences of weak nouns by with 99% confidence intervals and sample size.

Descriptively, it was obvious that in the three grammatical cases (accusative, dative, and genitive), there are remarkable distributional differences between strong and weak forms. See Table 3.2 forDRAFT an overview of the proportions in the three cases. In the accusa- for personal use only tive and dative, the strong inflection is much more frequent.18 This distribution is inter- pretable in line with the paradigmatic facts reported in Section 2.2. The genitive singu- lar is always marked, be it in the strong or the weak pattern. While the weak genitive

18 Notice that the large sample sizes allows estimation of the population proportions with very high con-

fidence.

17 marked with -en is a less prototypical genitive than one marked with -es, at least it has a suffix. I assume that the weak genitive is thus less non-prototypical than the weak accu- sative and dative, and that it therefore resists the alternation most strongly. In addition to the semantic and phonotactic factors which I use in the Generalized Linear Model re- ported in Section 3.3, I will include case as a regressor to test how this effect of case performs in a multifactorial analysis.

3.3 Factors that influence the alternation strength in a logistic regression

Variable Level Description Case grammatical case: Nom nominative Acc accusative Dat dative Sem semantic class of noun: Hum human Ani animate non-human Ina inanimate Pt phonotactics: PolySchwa polysyllabic with final schwa PolyUlt polysyllabic with final accent PolyNult polysyllabic with non-final accent MonoDRAFTmonosyllabic LogFreq [ −2.1390, 2.2530 ] forlemma personal log-frequency useper million only tokens

Table 3.3. Variables used in the GLM reported in Section 3.3.

In this section, I present the inferential statistics which show that the strength of the al- ternation of weak nouns toward the strong inflection is both influenced by prototypical features of weak nouns and by preferences for case marking in the three grammatical

18 cases under examination. I perform logistic regression—i.e., a binomial Generalized

Linear Model (GLM)—in order to test for the significance of factors and quantify the strength of the significant factors.19

Since the factors which play a role by hypothesis (cf. Section 2) are at least (1) the se- mantic class of the noun (human, animate non-human, inanimate), (2) final schwa (yes, no), (3) accent (final, non-final), and (4) case (accusative, dative, genitive), multifacto- rial modeling is required. A logistic regression on token occurrences of weak nouns with the aforementioned factors as regressors and weak/strong as the response variable is a reasonable choice. It must be noticed that schwa syllables never take the accent, that monosyllabic words never contain schwa syllables, and that monosyllabic words offer no choice as to where the accent is placed. Consequently, atomic phonotactic variables would not be independent. I therefore coded the nouns for a single phonotactic regressor variable (“Pt”) with the levels “PolyUlt” for polysyllabic words with final accent,

“PolyNult” for polysyllabic words with non-final accent, “PolySchwa” for polysyllabic words ending in schwa (and consequently with non-final accent), and “Mono” for monosyllabic words. Table 3.3 summarizes the response variable, the regressors and their levels. As it was shown in SectionDRAFT 3.2, strong forms of weak nouns are quite rare (2.2% of all forms). This makes logistic regressionfor problematic personal in many use ways, andonly special proce- dures exist for so-called “rare events regression”. They are described, for example, in

King and Zeng (2001) and implemented by Gary King in the Zelig package for R

19 All calculations were executed using the R statistics software (R Core Team [2014]), specifically the

glm() function to fit the GLM. My approach to regression is guided by Fahrmeir et al. (2013) and

Zuur et al. (2009).

19 (Owen et al. [2013]). Most readers would not be familiar with the specific output of those procedures, and I used a much simpler approach to rare events regression. I took a stratified sample where both outcomes (weak/strong) were represented in equal propor- tions. Models estimated based on such artificial samples are not useful for making pre- dictions about actual occurrences of events. Since I use the regression model not for prediction (as in actuarial sciences, for example), but just in order to make an inference about the distribution of the levels of the regressor variables within the two strata de- fined by the response variable, this weakness does not concern me. The sample used to estimate the GLM contained 10,000 events: 5,000 weak nouns inflected according to the canonical weak pattern, and 5,000 weak nouns inflected according to the strong pattern.

Since with samples of this size, reaching α levels is easy, the magnitudes of the effects are central to the interpretation.

I ran the standard model diagnostics (analysis of deviance, dispersion, pseudo-R²), and

I made sure that there was no severe collinearity by calculating the generalized variance inflation factors (VIF) according to Fox and Monette (1992). Table 3.4 summarizes the model. All regressors were determined to have a significant overall effect using a Log-

Likelihood ratio test at α = 0.05. The proportion of variance explained by the model is decent but not very highDRAFT at R² = 0.220. From a linguistic point of view, I see no hy- potheses about any additional fixedfor effects personal (i.e., explanatory use variables only that could be added) and hence conclude that the unexplained variance is simply due to a partially free nature of the alternation.20

20 Readers might speculate that a mixed model (GLMM) could attribute some of this unexplained vari-

ance to either lemma-specific or author-specific random effects. I tried “lemma” as a random effect,

but counts for strong occurrences per lemma are too low to produce a high quality model.

20 std. Regressor β R z p err. Intercept −1.321 — 0.080 −17.767 < 0.01 (SemHum, PtPolySchwa, CaseGen) CaseAcc 0.557 1.746 0.068 9.044 < 0.01 CaseDat 0.698 2.010 0.070 10.721 < 0.01 SemAni 0.872 2.392 0.074 12.459 < 0.01 SemIna 0.708 2.029 0.067 9.971 < 0.01 PtPolyUlt 1.059 2.882 0.056 18.045 < 0.01 PtMono 2.066 7.891 0.077 25.980 < 0.01 PtPolyNult 2.502 12.211 0.154 17.631 < 0.01 LogFreq −0.373 0.689 0.033 −9.883 < 0.01

Table 3.4. Summary of the model reported in Section 3.3. Binomial GLM. Response: strong inflection (non-canonical, positive coefficients) vs. weak variants (canonical, negative coefficients). β is the estimated coefficient and R the odds ratio. n = 10,000. VIF(Case) = 1.022, VIF(Sem) = 1.170, VIF(Pt) = 1.163, VIF(LogFreq) = 1.065. Analy- sis of deviance (df = 12,070, intercept-only model df = 13,862) with χ²: p < 0.001. Dis- persion: φ = 1.025. Nagelkerke R² = 0.220.

The intercept was made to comprise those factor levels which by hypothesis most strongly favor the weak forms over the non-canonical strong forms (SemHum, Pt-

PolySchwa, CaseGen). As expected, it models a preference (β = −1.321) for the weak inflection. The accusativeDRAFT (R = 1.746) and the dative (R = 2.010) both cause a signifi- for personal use only cant tendency toward the strong inflection compared to the genitive, which is modeled by the intercept.21 The same is true if the noun denotes inanimate (R = 2.029) or non-hu-

Speaker/writer as a random effect cannot be tested, because authorship is usually unknown in web

corpora.

21 I prefer to interpret odds ratios instead of coefficients for all effects except the intercept. The odds ra-

tios model the change in odds for each effect, compared to the intercept. As opposed to the coeffi-

21 man animate objects (R = 2.392), compared to humans. In general, the phonotactic ef- fects are stronger than case and semantics with non-final accent in polysyllabic words

(R = 12.211) favoring the strong pattern most strongly, followed by monosyllabicity

(R = 7.891) and final accent in polysyllabic words (R = 2.882). The stronger influence of the phonotactic properties is expected because they are much closer to being exclu- sive of weak nouns compared to humanness of denotation. The strongest tendency to- ward the non-canonical strong pattern would thus be predicted with polysyllabic nouns having non-final accent denoting non-human objects in the dative. These results are ex- actly as expected under the prototype approach by Köpcke (1995) and the paradigmatic analysis by Thieroff (2003), combined with my hypothesis that prototypicality influ- ences the alternation strength.22

Finally, a higher token frequency of a weak noun (measured as log-transformed token frequency per Million tokens) slightly disfavors its occurrence in strong forms

(β = 0.689). I assume that less frequently seen items are less clearly categorized as be- longing to the non-canonical weak pattern and tend to be formed more productively ac- cording to the strong pattern. In other words, the more frequent a noun is, the less likely it is to be affected by the leveling effects predicted by Thieroff (2003). Again, this effect is a good indicator thatDRAFT we are dealing with a phenomenon that is best explained by us- age-based approaches. In Section 4,for I will personalsummarize the results, use but only in Section 3.4, I cients, odds ratios can be interpreted linearly. An odds ratio of 1 corresponds to a coefficient which is

0, i.e., no effect.

22 Notice that I also tried to calculate a GLM with an interaction between Pt and Sem. However, while

animateness and non-ultimate accent showed a significant interaction, the Nagelkerke pseudo-R² did

not increase, while the standard errors for some other factor levels increased dramatically. I therefore

decided to work with the simpler model as reported above.

22 first provide evidence that the class of weak nouns is open to new members, strongly suggesting a productive prototype.

3.4 Further evidence: strong nouns in weak forms

A phenomenon which is maybe as important as the alternation of weak nouns is the opposite: the alternation of strong nouns toward the weak pattern. If predominantly strong nouns which have properties typical of weak nouns can be found in weak forms, it would be further evidence for a weak noun prototype deeply rooted in the productive

(most likely cognitive) system.

strong weak human 53,596 2,274 non-human 58,482 134

Table 3.5. Counts of strong and weak occurrences of 62 strong nouns ending in -or by humanness of denotation. n = 114,486. Considering the size of the sample, significance testing is dispensable. Odds ratio R(weak|human,weak|non-human) = 18.868.

To show that this is indeed the case, I looked at a class of strong loan nouns which oc- curred very often in weak forms in the bootstrap sample described in Section 3.2, namely masculine LatinDRAFT loans ending in -or. They are all polysyllabic and have penulti- mate accent, thus qualifying as goodfor weak personal nouns phonotactically. use23 Interestingly,only some of them like Autor (‘author’) denote humans, others like Transistor (‘transistor’) do not.

I took 32 human-denoting and 30 non-human-denoting of them and counted their occur- rences in strong and weak forms just as I did for the samples used in Sections 3.1 and

23 Note that well integrated (typically Germanic) strong nouns such as Tisch (‘table’) simply have no

chance of ever occurring in weak forms (*dem Tisch-en etc.), as any native speaker will confirm.

23 3.2. As for humanness, Table 3.5 sums up the results.24 Clearly, results from the earlier sections are confirmed, as human-denoting -or nouns have a much stronger tendency to occur in weak forms. The odds ratio means that the chance of seeing a weak form of a human-denoting noun is 18.868 times higher than the chance of seeing a weak form of a non-human-denoting noun.

strong weak accusative 54,556 1,053 dative 36,755 731 genitive 20,767 624

Table 3.6. Counts of strong and weak occurrences of 62 strong nouns ending in -or by case. n = 114,486. Considering the size of the sample, significance testing is dispens- able. Pairwise odds ratios (R) are as follows: R(weak|dat, weak|acc) = 1.030, R(weak| gen, weak|acc) = 1.557, R(weak|gen, weak|dat) = 1.511.

The results for strong and weak forms in the three grammatical cases also confirm the results from earlier sections, cf. Table 3.6. The genitive slightly favors the weak pattern compared to the accusative (R = 1.557) and the dative (R = 1.511). A higher alternation strength of original strong nouns toward the weak pattern corresponds perfectly to a stronger resistance againstDRAFT the alternation toward the strong pattern for weak nouns. for personal use only

24 For this auxiliary study, I chose to use simple monofactorial statistics, mainly because I only look at a

small subset of one type of candidate strong nouns, and the phonotactic factors are consequently not

available.

24 4 Conclusions

With the study presented here, I have achieved both goals set in Section 1. First of all, a web-derived corpus of German was demonstrated to be large enough and to contain enough non-standard material such that even low-frequency alternations can be exam- ined using inferential statistics.

Furthermore, I have shown that the prototypical semantic and phonotactic properties of weak nouns as described by Köpcke (1995) influence the alternation strength of those nouns in exactly the direction suggested by the theory: the more prototypical a noun, the lower the probability that it occurs in the strong pattern. The phonotactic properties, which are more exclusive to weak nouns than the semantic properties, proved to have a stronger effect than the semantic properties. By showing that higher frequencies of nouns tend to prevent the alternation, I produced even more evidence for the plausibility of a usage-based interpretation. Finally, the significant influence of case on the alterna- tion strength is entirely expected under a usage-based view because the masculine geni- tive is the only case which is always marked (dominantly by -es, in the weak pattern by

-en). Hence, the weak inflection pattern is more compatible with the strong genitive than the accusative and the dative (strong: no marking at all, weak: -en). This accounts for the weak genitive’sDRAFT reluctance to becoming strong and the strong genitive’s in- creased affinity to the weak pattern comparedfor personal to the other two use cases. only

Finally, I showed that strong nouns with properties which are prototypical of weak nouns do in fact occur in weak forms. Thus, I provided evidence which shows that the weak noun prototype is rooted in the productive system. Further work should include a more detailed look at those nouns. Also, if we want to interpret corpus findings in cog-

25 nitive terms, an experimental cross-examination of the effects observed in the corpus data suggests itself.

Roland Schäfer German Grammar, Freie Universität Berlin Habelschwerdter Allee 45 14195 Berlin, Germany

DRAFT for personal use only

26 Appendix 1: Weak nouns and their absolute frequencies in the full sample25

Abiturient (398/0.013±0.011), Abonnent (706/0.004±0.005), Absolvent

(479/0.013±0.01), Adept (305/0.039±0.022), Adjutant (140/0.021±0.024), Adressat

(585/0.024±0.012), Adventist (66/0±0), Advokat (262/0.038±0.023), Affe

(3579/0.015±0.004), Afghane (254/0±0), Agent (3214/0.059±0.008), Ahn

(315/0.162±0.041), Akrobat (68/0.015±0.029), Aktivist (613/0.007±0.006), Alchimist

(193/0.005±0.01), Alemanne (83/0.024±0.033), Ammonit (77/0.013±0.025), Analphabet

(196/0.005±0.01), Analyst (310/0.032±0.02), Anarchist (326/0.009±0.01), Anästhesist

(351/0.011±0.011), Anatom (45/0.267±0.129), Anglist (9/0±0), Antagonist

(184/0.016±0.018), Anthropologe (117/0.017±0.024), Antichrist (60/0.217±0.104), Apo- loget (37/0.027±0.052), Archäologe (598/0±0), Architekt (5168/0.023±0.004), Archont

(14/0.071±0.135), Argonaut (7/0±0), Aristokrat (198/0±0), Artist (297/0.444±0.056),

Asiate (419/0.002±0.005), Asket (230/0.004±0.008), Aspirant (62/0±0), Assistent

(2489/0.016±0.005), Asteroid (1543/0.029±0.008), Ästhet (60/0.017±0.032), Astronaut

(556/0.016±0.01), Astronom (359/0.081±0.028), Aszendent (69/0.058±0.055), Atheist

(921/0.014±0.008), Athlet (1070/0.014±0.007), Autist (331/0.003±0.006), Autokrat (59/0.034±0.046), AutomatDRAFT (3674/0.05±0.007), Baptist (34/0.029±0.057), Bär (5551/0.089±0.008), Barbare (571/0.012±0.009),for personal Barde use (573/0.01±0.008), only Baske (23/0±0), Bassist (1024/0.018±0.008), Beduine (214/0±0), Biograph (72/0±0), Biologe

(581/0.003±0.005), Böhme (24/0.292±0.182), Bolide (337/0±0), Borusse (77/0±0), Bote

(2824/0.025±0.006), Brahmane (283/0±0), Brillant (1378/0.065±0.013), Brite

25 For each noun, the total count in the full sample and the proportion of strong forms with half a 95%

confidence interval (rounded to three decimal digits) is specified.

27 (628/0.005±0.005), Bube (724/0.007±0.006), Buchstabe (5818/0.1±0.008), Bulgare

(93/0±0), Bulle (1261/0.016±0.007), Bürge (660/0.015±0.009), Bürokrat

(150/0.013±0.018), Bursche (1159/0.004±0.004), Cellist (138/0±0), Chaot

(189/0.074±0.037), Chilene (83/0±0), Chinese (1783/0.022±0.007), Chirurg

(2015/0.051±0.01), Choreograph (107/0.056±0.044), Christ (5224/0.025±0.004),

Chronist (249/0.008±0.011), Chronograph (185/0.103±0.044), Dadaist (9/0.111±0.205),

Däne (346/0±0), Delinquent (137/0.007±0.014), Demagoge (163/0±0), Demiurg

(65/0.046±0.051), Demokrat (491/0.006±0.007), Demonstrant (593/0.017±0.01),

Dendrit (35/0.171±0.125), Denunziant (95/0.011±0.02), Depp (722/0.191±0.029),

Despot (446/0.018±0.012), Dilettant (132/0±0), Diözesan (45/0±0), Dirigent

(1298/0.012±0.006), Disponent (146/0.041±0.032), Dissident (192/0±0), Doge

(34/0.147±0.119), Dozent (1646/0.021±0.007), Dramaturg (88/0.08±0.056), Drogist

(57/0.035±0.048), Druide (868/0.005±0.004), Egoist (231/0.026±0.02), Egomane

(126/0.04±0.034), Elefant (6214/0.028±0.004), Elektrostat (59/0.102±0.077), Emigrant

(238/0.004±0.008), Epigone (29/0±0), Eremit (235/0.051±0.028), Erotomane (10/0±0),

Essayist (19/0±0), Este (35/0.029±0.055), Ethnologe (87/0±0), Eunuch

(122/0.025±0.028), Evangelist (190/0.005±0.01), Exeget (21/0±0), Exorzist (252/0.016±0.015), ExotDRAFT (530/0.051±0.019), Experte (9299/0.002±0.001), Exponent (216/0.037±0.025), Extremist (180/0.006±0.011),for personal Fabrikant use(331/0.018±0.014), only Falke

(920/0.021±0.009), Faschist (251/0.016±0.016), Favorit (2559/0.036±0.007),

Feuilletonist (31/0±0), Filialist (57/0.018±0.034), Finalist (100/0.06±0.046), Fink

(51/0.373±0.133), Finne (227/0.009±0.012), Florist (145/0.014±0.019), Flötist (53/0±0),

Fotograf (4797/0.039±0.006), Franke (478/0.027±0.015), Franzose (2483/0.005±0.003),

28 Friede (8063/0.13±0.007), Friese (435/0.08±0.026), Fundamentalist (263/0.004±0.007),

Fürst (2333/0.029±0.007), Galerist (184/0.005±0.011), Ganove (230/0.013±0.015),

Garant (342/0.345±0.05), Gardist (209/0.019±0.019), Gatte (374/0.011±0.01), Geck

(59/0.441±0.127), Gedanke (19068/0.065±0.004), Gehilfe (573/0.002±0.003), Gendarm

(171/0.193±0.059), Generalist (89/0.034±0.038), Genosse (779/0.001±0.002), Geograph

(82/0.085±0.06), Geologe (306/0±0), Gepard (322/0.289±0.05), Germane (154/0±0),

Germanist (185/0±0), Gigant (403/0.037±0.018), Gitarrist (2067/0.014±0.005), Glaube

(11857/0.149±0.006), Götze (635/0.09±0.022), Graf (1589/0.056±0.011), Graph

(1117/0.132±0.02), Gratulant (18/0±0), Grieche (1091/0.003±0.003), Grossist (31/0±0),

Gymnasiast (303/0.023±0.017), Gynäkologe (571/0.007±0.007), Hanseat

(36/0.028±0.054), Hase (4683/0.019±0.004), Havarist (39/0±0), Heide

(1400/0.008±0.005), Held (13473/0.042±0.003), Hermaphrodit (50/0.06±0.066), Herr

(3461/0.146±0.012), Hesse (121/0.091±0.051), Hirte (3067/0.002±0.001), Humanist

(172/0.006±0.011), Humorist (73/0±0), Hüne (125/0±0), Husar (74/0.068±0.057),

Hydrant (510/0.057±0.02), Idealist (280/0.018±0.016), Ideologe (83/0±0), Idiot

(2180/0.027±0.007), Ignorant (249/0.004±0.008), Illuminat (51/0.275±0.122),

Immigrant (170/0.047±0.032), Imperialist (23/0.043±0.083), Individualist (110/0.045±0.039), InformantDRAFT (1095/0.016±0.007), Insasse (341/0.006±0.008), Inserent (46/0±0), Instrumentalist (60/0±0),for Intendantpersonal (220/0.009±0.012), use only Interessent

(2772/0.006±0.003), Internist (795/0.016±0.009), Interpret (846/0.025±0.01), Ire

(362/0.011±0.011), Israelit (126/0.008±0.016), Jesuit (252/0.008±0.011), Journalist

(6904/0.01±0.002), Jude (4558/0.005±0.002), Junge (79464/0.006±0), Jurist

(2489/0.01±0.004), Kabarettist (1549/0.001±0.002), Kalif (158/0.032±0.027),

29 Kalligraph (27/0.037±0.071), Kamerad (3704/0.017±0.004), Kandidat

(8728/0.012±0.002), Kannibale (148/0.014±0.019), Kapitalist (414/0.01±0.009),

Karikaturist (107/0.009±0.018), Kartograph (54/0.037±0.05), Kasache (43/0±0),

Katalane (28/0±0), Katholik (738/0.015±0.009), Kaukase (49/0±0), Kelte (65/0±0),

Kentaur (45/0.244±0.126), Kirgise (22/0±0), Kleptomane (15/0±0), Klient

(1700/0.011±0.005), Klingone (655/0.002±0.003), Knabe (2249/0.008±0.004), Knappe

(15073/0.002±0.001), Koeffizient (224/0.031±0.023), Kollege (25536/0.009±0.001),

Kolonist (109/0.018±0.025), Kolumnist (86/0.035±0.039), Komet (1669/0.023±0.007),

Kommunist (808/0.01±0.007), Komponist (2636/0.004±0.002), Kongolese (38/0±0),

Konkurrent (6921/0.01±0.002), Konsonant (427/0.089±0.027), Konsument

(669/0.012±0.008), Kontrahent (890/0.003±0.004), Konvertit (116/0.009±0.017),

Kopist (66/0±0), Korrespondent (323/0.009±0.01), Korse (17/0±0), Kosmonaut

(61/0.049±0.054), Krake (229/0.048±0.028), Kroate (182/0±0), Kryostat

(41/0.146±0.108), Kunde (26576/0.007±0.001), Kurde (241/0±0), Laie

(5673/0.006±0.002), Leopard (929/0.238±0.027), Lette (30/0.033±0.064), Libanese

(147/0±0), Ligist (142/0.042±0.033), Linguist (114/0±0), Literat (284/0.004±0.007),

Lithograph (21/0±0), Liturg (7/0.143±0.259), Lobbyist (220/0.023±0.02), Lombarde (5/0.2±0.351), Lotse DRAFT(399/0±0), Löwe (4823/0.017±0.004), Lymphozyt (10/0.3±0.284), Magnet (3761/0.147±0.011), Marxistfor (186/0±0),personal Maschinist use (149/0.054±0.036),only

Materialist (85/0.012±0.023), Matrose (649/0.005±0.005), Mensch (221210/0.017±0),

Meteorit (1026/0.042±0.012), Mime (54/0.13±0.09), Ministrant (112/0.027±0.03),

Minotaur (57/0.263±0.114), Misanthrop (87/0.08±0.057), Mohr (123/0.203±0.071),

Monarch (687/0.051±0.016), Monolith (379/0.185±0.039), Monopolist

30 (344/0.012±0.011), Moralist (178/0.006±0.011), Mutant (364/0.049±0.022), Name

(107348/0.028±0.001), Narr (2235/0.029±0.007), Nationalist (178/0.006±0.011), Neffe

(795/0.013±0.008), Nekromant (350/0.037±0.02), Neophyt (33/0.182±0.132),

Neurologe (1725/0.005±0.003), Nomade (157/0.013±0.018), Obelisk (640/0.188±0.03),

Obrist (42/0.071±0.078), Ochse (1358/0.012±0.006), Ökonom (287/0.101±0.035),

Oligarch (69/0.014±0.028), Opponent (74/0.081±0.062), Opportunist (136/0±0),

Optimist (391/0.02±0.014), Organist (390/0.008±0.009), Orientale (131/0.008±0.015),

Ostheopath (51/0.039±0.053), Pädagoge (629/0±0), Päderast (43/0±0), Page

(350/0.417±0.052), Paragraf (321/0.019±0.015), Parasit (679/0.047±0.016), Partisan

(180/0.089±0.042), Pate (1175/0.004±0.004), Pathologe (408/0±0), Patient

(18232/0.01±0.002), Patriot (297/0.114±0.036), Pazifist (169/0.024±0.023),

Perfektionist (150/0.013±0.018), Pessimist (211/0.047±0.029), Pfaffe

(146/0.007±0.013), Phantast (101/0.02±0.027), Pharmazeut (49/0±0), Philanthrop

(28/0.036±0.069), Philologe (116/0±0), Philosoph (2330/0.011±0.004), Pianist

(1124/0.017±0.008), Pilot (4226/0.048±0.006), Planet (19683/0.018±0.002), Planetoid

(166/0.012±0.017), Poet (428/0.023±0.014), Pole (1013/0.139±0.021), Politologe

(88/0±0), Polizist (10166/0.018±0.003), Polyp (219/0.183±0.051), Populist (117/0.009±0.017), DRAFT Portugiese (189/0.005±0.01), Posaunist (56/0±0), Potentat (91/0.022±0.03), Präfekt (107/0.028±0.031),for personal Praktikant (1542/0.034±0.009),use only Prälat

(117/0.009±0.017), Präsident (6013/0.016±0.003), Prinz (3152/0.085±0.01), Proband

(491/0.018±0.012), Produzent (2426/0.012±0.004), Prokurist (341/0.003±0.006), Prolet

(99/0.01±0.02), Propagandist (50/0.02±0.039), Prophet (3291/0.022±0.005), Proselyt

(29/0±0), Protagonist (1103/0.011±0.006), Protestant (267/0.004±0.007), Protokollant

31 (80/0.012±0.024), Psychopath (873/0.016±0.008), Publizist (156/0.006±0.012), Purist

(42/0.024±0.046), Pyromane (51/0.02±0.038), Quadrant (285/0.021±0.017), Querulant

(142/0.007±0.014), Quotient (247/0.077±0.033), Rabe (1169/0.022±0.008), Radiologe

(273/0.007±0.01), Rassist (378/0.042±0.02), Realist (210/0.024±0.021), Rebell

(587/0.143±0.028), Referent (1608/0.021±0.007), Regent (335/0.024±0.016), Rekrut

(302/0.013±0.013), Renegat (51/0.02±0.038), Reservist (93/0.011±0.021), Revisionist

(30/0±0), Rezensent (325/0.009±0.01), Rezipient (128/0.008±0.015), Riese

(15717/0.002±0.001), Rivale (1007/0.001±0.002), Romanist (18/0±0), Rüde

(4923/0.009±0.003), Rumäne (224/0.004±0.009), Russe (1244/0.006±0.004), Sachse

(240/0.042±0.025), Sadist (178/0.011±0.016), Same (998/0.063±0.015), Sarde

(15/0.067±0.126), Satellit (2631/0.037±0.007), Schamane (1589/0.01±0.005), Schenk

(247/0.065±0.031), Scherge (100/0±0), Schiit (36/0±0), Schimpanse (603/0.007±0.006),

Schöffe (191/0±0), Schotte (387/0.034±0.018), Schultheiß (120/0.367±0.086), Schurke

(1202/0.015±0.007), Schwabe (408/0±0), Schwede (635/0.006±0.006), Seismograph

(156/0.064±0.038), Semit (7/0±0), Senegalese (37/0±0), Separatist (22/0.136±0.143),

Serbe (215/0.005±0.009), Sklave (2664/0.005±0.003), Slawe (21/0±0), Slowake

(69/0±0), Slowene (58/0±0), Solipsist (18/0±0), Solist (336/0.015±0.013), Sophist (52/0.019±0.037), SorbeDRAFT (12/0±0), Sozialist (263/0.027±0.02), Soziologe (303/0±0), Spekulant (152/0.02±0.022), Spezialistfor (8607/0.011±0.002), personal useStatist (165/0.055±0.035),only

Stipendiat (115/0.139±0.063), Stratege (222/0±0), Student (6399/0.03±0.004), Stylist

(153/0.033±0.028), Sudanese (38/0±0), Sunnit (31/0.032±0.062), Sympathisant

(113/0.053±0.041), Szenarist (11/0±0), Technokrat (38/0±0), Telefonist (20/0.1±0.132),

Telegraph (83/0.024±0.033), Terrorist (1899/0.034±0.008), Theologe

32 (1414/0.003±0.003), Therapeut (6623/0.013±0.003), Tomograph (31/0±0), Tourist

(1868/0.029±0.008), Transvestit (255/0.043±0.025), Trilith (22/0.136±0.143), Trotzkist

(27/0±0), Tscheche (263/0.008±0.01), Tscherkesse (8/0±0), Türke (2235/0.007±0.004),

Tyrann (1165/0.078±0.015), Untertan (291/0.474±0.057), Urologe (1206/0.009±0.005),

Vasall (280/0.107±0.036), Veteran (731/0.248±0.031), Virtuose (520/0±0), Waise

(74/0.014±0.026), Welsche (26/0±0), Westfale (100/0.04±0.038), Wille

(11581/0.132±0.006), Zar (125/0.224±0.073), Zentaur (139/0.266±0.074), Zeuge

(11613/0.005±0.001), Zionist (57/0±0), Zivilist (604/0.008±0.007), Zoologe

(87/0.011±0.022), Zyklop (229/0.031±0.022), Aeronaut (17/0.824±0.181), Bauer

(13842/0.068±0.004), Bayer (1232/0.119±0.018), Centaur (50/0.6±0.136), Doktorand

(319/0.031±0.019), Infant (8/0.375±0.336), Infanterist (234/0.021±0.018), Resident

(171/0.567±0.074), Steinmetz (491/0.839±0.032), Titan (528/0.57±0.042), Zeolith

(30/0.6±0.175)

References

Augst, Gerhard. 1979. Neuere Forschungen zur Substantivflexion. Zeitschrift für

germanistische Linguistik 7(3). 220–232. Baroni, Marco, SilviaDRAFT Bernardini, Adriano Ferraresi & Eros Zanchetta. 2009. The WaCky Wide Web: A Collectionfor of personal Very Large Linguistically use only Processed Web-

Crawled Corpora. Language Resources and Evaluation 43(3). 209–226.

Biemann, Chris, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff,

Roland Schäfer, Johannes Simon, Leonard Swiezinski & Torsten Zesch. 2013.

33 Scalable Construction of High-Quality Web Corpora. Journal for Language Tech-

nology and Computational Linguistics 28(2). 23–60.

Bittner, Dagmar. 2003. Von starken Feminina und schwachen Maskulina: Die

neuhochdeutsche Substantivflexion - Eine Systemanalyse im Rahmen der natür-

lichen Morphologie. Berlin: ZAS.

Booij, Geert. 2010. Construction Morphology. Oxford: Oxford University Press.

Demske, Ulrike. 2001. Merkmale und Relationen – Diachrone Studien zur Nomi-

nalphrase des Deutschen . Berlin : De Gruyter.

Eisenberg, Peter. 2000. Das vierte Genus? Über die Kategorisation der deutschen

Substantive. In Andreas Bittner, Dagmar Bittner & Klaus-Michael Köpcke (eds.),

Angemessene Strukturen: Systemorganisation in Phonologie, Morphologie und

Syntax , 91–105. Hildesheim, Zürich, New York: Olms.

Eisenberg, Peter. 2013. Grundriss der deutschen Grammatik: Das Wort, edited by

Nanna Fuhrhop. Stuttgart: Metzler.

Evert, Stefan & Andrew Hardie. 2011. Twenty-first century Corpus Workbench: Up-

dating a query architecture for the new millennium. Proceedings of the Corpus

Linguistics 2011DRAFT Conference. Birmingham: University of Birmingham. for personal use only Fahrmeir, Ludwig, Thomas Kneib, Stefan Lang & Brian Marx. 2013. Regression –

Models, Methods, and Application. Berlin: Springer.

Fox, John & Georges Monette. 1992. Generalized collinearity diagnostics. Journal of

the American Statistics Association 87. 178–183.

34 Gries, Stefan T. 2010. Corpus Linguistics and Theoretical Linguistics – A Love-Hate

Relationship? Not Necessarily. International Journal of Corpus Linguistics 15(3),

327–343.

Joeres, Rolf. 1996. "Der Friede" oder "der Frieden": ein Normproblem der Substan-

tivflexion. Sprachwissenschaft 21, 301–336.

King, Gary & Langche Zeng. 2001. Logistic Regression in Rare Events Data. Politi-

cal Analysis 9(2), 137–163.

Köpcke, Klaus-Michael. 1988. Schemas in German Plural Formation. Lingua 74,

303–335.

Köpcke, Klaus-Michael. 1995. Die Klassifikation der schwachen Maskulina in der

deutschen Gegenwartssprache – Ein Beispiel für die Leistungsfähigkeit der Proto-

typentheorie. Zeitschrift für Sprachwissenschaft 14(2), 159–180.

Köpcke, Klaus-Michael. 1998. The acquisition of plural marking in English and Ger-

man revisited: schemata versus rules. Journal of Child Language 25. 293–319.

Köpcke, Klaus-Michael. 2000. Chaos und Ordnung – zur semantischen Remo-

tivierung einer Deklinationsklasse im Übergang vom Mhd. zum Nhd. In Andreas

Bittner, DagmarDRAFT Bittner & Klaus-Michael Köpcke (eds.), Angemessene Struk- for personal use only turen: Systemorganisation in Phonologie, Morphologie und Syntax , 107–122.

Hildesheim, Zürich, New York: Olms.

Krech, Eva-Maria, Eberhard Stock, Ursula Hirschfeld & Lutz Christian Anders

(eds.). 2009. Deutsches Aussprachewörterbuch. Berlin : De Gruyter.

35 Maxwell, Scott E. & Harold D. Delaney. 2004. Designing experiments and analyzing

data: a model comparison perspective. Mahwa, New Jersey, London: Lawrence

Erlbaum Associates.

Müller, Gereon. 2002. Syntaktisch determinierter Kasuswegfall in der deutschen DP.

Linguistische Berichte 189, 89–114.

Owen, Matt, Kosuke Imai, Gary King & Olivia Lau. 2013. Zelig: Everyone's Statisti-

cal Software. R package version 4.2-1.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vi-

enna: R Foundation for Statistical Computing.

Rousseeuw, Peter J., Ida Ruts & John W. Tukey. 1999. The Bagplot: A Bivariate Box-

plot. The American Statistician 53 (4), 382–387.

Schäfer, Roland & Ulrike Sayatz. 2014, in print. Die Kurzformen des Indefinitar-

tikels im Deutschen. Zeitschrift für Sprachwissenschaft 33(2).

Schäfer, Roland & Felix Bildhauer. 2012. Building large corpora from the web using

a new effcient tool chain. In Nicoletta Calzolari, Khalid Choukri, Thierry De-

clerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Jan Odijk & Ste-

lios Piperidis (Hrsg.),DRAFT Eighth International Conference on Language Resources for personal use only and Evaluation, 486–493. Istanbul: ELRA.

Schäfer, Roland & Felix Bildhauer. 2013. Web Corpus Construction. San Francisco:

Morgan & Claypool.

36 Schäfer, Roland. 2014, submitted. Einführung in die grammatische Beschreibung des

Deutschen.

Sternefeld, Wolfgang. 2004. Feature Checking, Case, and Agreement in German DPs

. In: Gereon Müller, Lutz Gunkel & Gisela Zifonun (eds.), Explorations in Nomi-

nal Inflection, 269–299. Berlin: Mouton de Gruyter.

Thieroff, Rolf. 2003. Die Bedienung des Automatens durch den Mensch. Deklination

der schwachen Maskulina als Zweifelsfall. Linguistik Online 16(4), 105–117.

Van Goethem, Kristel & Philippe Hiligsmann. 2014. When Two Paths Converge:

Debonding and Clipping of Dutch “Reuze”. Journal of Germanic Linguistics

26(1), 31–64.

Wurzel, Wolfgang U. 1985. Deutsch “der Funke” zu “der Funken”: Ein Fall für die

natürliche Morphologie. Linguistische Studien des Zentralinstituts für Sprachwis-

senschaft der Akademie der Wissenschaften der DDR A(127), 129–145.

Zeldes, Amir. 2002. Productivity in Argument Selection. Berlin, Boston: De Gruyter..

Zuur, Alain F., Elena N. Ieno, Neil Walker, Anatoly A. Saveliev & Graham M. Smith. 2009. Mixed EffectsDRAFT Models and Extensions in Ecology with R. Berlin: Springer. for personal use only

37