<<

Coding practices used in the project Optimal Typology of Determiner

Gregory Garretson

This document is section of the coding manual used by the project Optimal Typology of Determiner Phrases, based at Boston University. is being made available due to popular demand. If have questions about the rest of the manual, or the contents of this section, please contact Gregory Garretson at [email protected] or [email protected].

Please cite this document as follows:

Garretson, Gregory. 2004. Coding practices used in the project Optimal Typology of Determiner Phrases. Unpublished manuscript, Boston University, Boston, MA. http://npcorpus.bu.edu/html/documentation/index.html.

Contents

5.1 General practices 5.2 Construction type 5.3 Which examples to count? 5.4 Expression type 5.5 5.6 5.7 Weight 5.0 Applying tags to the examples

This section of the coding manual details the policies have adopted in adding tags to the examples, once have been bracketed. Because of the wondrous variety of language, coding a corpus is not nearly as straightforward as we might like. This section will help you to make the often difficult decisions about which tags to apply when.

Section 5.1 gives general guidance for coding; Sections 5.2 and on discuss various classes of tags in detail.

Starting in Section 5.2, each section will begin with a list of the tags discussed in that section, so you can easily find the discussion of a given tag by looking at these lines. They look like this:

[Tags discussed in this section: INCL, EXCL, PND, PWD, CMPD, IDIOM, NAME, PART, PART2, DINS, SORT]

Please remember that in addition to the discussion in this section, there is a brief description of each tag given in Section 2.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.1 General practices

This section gives some general advice for coding.

------5.1.1 When you are not sure

It will frequently be the case that you come across an example you feel you will need to discuss with others. In such cases, there are two procedures you can follow.

If you come across a very good example of something or other, or an example that you would like to discuss with others, or that you want to mark for any other reason, use the tag INTERESTING. This flags the example in a harmless way as something to look at again. Then, you can search for all examples with that tag to create a list of examples to review.

If you are trying to code an example for, say, animacy, and you are not sure which tag to apply, rather than choosing one without confidence, use the "other" tag; in this case, it would be OANIM_H or OANIM_M. To date, there are "other" tags for expression type, definiteness, and animacy:

OET_H, OET_M ODEF_H, ODEF_M OANIM_H, OANIM_M

To review those examples you were uncertain about, you can simply search for all examples with "other" tags.

The rationale for using "other" tags liberally is simple: It is much easier to reexamine all examples coded with OANIM_H and OANIM_M than to reexamine all the examples! Remember: When in doubt, choose "other"!

------5.1.2 The importance of saving your work Because the coder does not make backups *while* you work, you should get into the habit of quitting the program fairly frequently (every half-hour or so is good), and then opening it again on the resulting file. This will make sure that you don't lose too much data in case your computer crashes, you lose your Internet connection, etc. See Sections 3.2 and 3.3 for more information about working with files.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.2 Construction type

[Tags discussed in this section: XS, OFX, BOAD]

The easiest distinction to make is that between examples that receive the OFX tag and examples that receive the XS tag. Examples with "of_PREP" are coded as OFX, and examples with a prenominal possessor (be it a with "'s", a such as "his", or another pronoun such as "whose") are coded as XS. This is done automatically by the program autocoder_ct, so you will not need to do it.

Once this is done, there are several subordinate "construction type" tags that may need to be applied. Some of these can be added by autocoders, and others cannot. Section 5.3, "Which examples to count?" describes most of these tags.

Of the subclasses of construction types identified so far, all result in the example being excluded from further coding, except for two: PWD and BOAD. The former is discussed in the next section, and the latter is discussed here.

------5.2.1 "Boss of All Dogs" examples

The tag BOAD stands for "Boss Of All Dogs", which, although cryptic, is our favorite example from this class and so, for lack of a better name, was chosen to metanymically represent the class. The BOAD examples have a very specific form: They are all OFX examples, and they all have in the Y position a singular that is missing a (normally obligatory) determiner. They are found only in predicative and appositive positions. Here are some examples:

the Rev. J. D. Wickham , <"headmaster of Burr and Burton Seminary"> L. C. Orvis , <"manager of the Western Union Telegraph Company"> Amos C. Barstow , <"ex-mayor of Providence"> was chosen <"president of the meeting"> William Hartman Woodin , was <"Secretary of the Treasury"> Oliver Herford , artist , author , and <"foe of stupidity"> Rob Roy remained <"boss of all the dogs">

There are also tendencies within the class that are not prerequisites for inclusion. First, the Y tends to be a noun representing a role in some type of institution, and the X tends to represent that institution. Note, however that this is not always the case. Second, the BOAD example usually occurs as an appositive, as seen in several of the examples above.

BOAD examples are odd in that they are one of a few classes of noun phrases in which singular count , which normally are required to have a determiner, are able to escape this restriction. We wish to keep track of these examples and see how they alternate with the XS form. Therefore, we do not exclude them from the set of counted examples. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.3 Which examples to count?

[Tags discussed in this section: INCL, EXCL, PND, PWD, CMPD, IDIOM, NAME, PART, PART2, DINS, SORT]

All of the following assumes that you are familiar with the discussion in Section 4 of which examples to include and which ones to exclude in our corpus of examples. As explained there, only of-examples of the form "NP of NP" are included. "Non-coded examples" consisting of an of- not following a noun (e.g., "sing of", "careful of") are excluded outright from our corpus of examples.

However, even among the examples that are "NP of NP", there are some that we do not wish to include in our statistical analysis of linguistic variation. This is because they are not tokens that could have been realized in a different way--that is, they were not potential loci of variation at the time of speaking. Thus "The Man of La Mancha" could not be said "La Mancha's Man", if referring to a proper name.

We include these examples in the corpus because while it is true that these "frozen" nominals could not have been produced differently by the speaker or writer, it is also true that, viewed diachronically, they could have been "frozen" in a different form. Thus, while we have proper names like "Schindler's List" and "The Double Life of Veronique", such names could frequently have been rendered differently (cf. "Veronique's Double Life", which, incidentally, is how the title comes out in Swedish). The same may apply to expressions that are "frozen" grammatically, or idioms ("for the love of God" vs. "for Pete's sake", etc.).

Because we may wish to look at these examples in the future, we are not excluding them from our corpus. Rather, we are making a distinction between the examples we wish to count at this stage and the ones we do not by applying the tags "INCL" and "EXCL". The program "counter" counts only those examples marked with "INCL". As it happens, we do not have to apply these tags ourselves; they are automatically added on the basis of prior tags added by a human coder, which mark examples as belonging to certain classes of constructions.

ALL EXAMPLES TO WHICH "EXCL" HAS BEEN APPLIED DO NOT NEED TO BE CODED FOR ANYTHING FURTHER! Because these constructions do not allow variation, we will not make further use of these examples in the present study. Therefore, we don't need to code them for definiteness, animacy, etc. Note that the autocoder will still do this, and we should leave those tags; however, you do not need to check them.

To reiterate: you should not apply the tags EXCL and INCL; this is done automatically after you apply one of the tags discussed below.

As mentioned in Section 5.2, all examples are automatically coded with the tags OFX or XS. Some examples will receive no further "construction type" tags. However, some examples will need to be tagged further with one of the following:

PART ("") PART2 ("partitive 2") DINS ("described instance") SORT ("sort") PWD ("preposition with determiner") PND ("preposition, no determiner") CMPD ("compound") IDIOM ("idiom") NAME ("name") The tag "EXCL" is applied by autocoder_rev when one of the following is found on an example: PND, CMPD, IDIOM, NAME, PART, PART2, DINS, SORT, or NREV.

The tag "INCL" is applied by the autocoder when "PWD" is found, or when none of these tags is found.

------5.3.1 PART, PART2, DINS, SORT

The reason we apply the tags PART, PART2, DINS, and SORT is that we have identified these as syntactic or semantic classes of OFX examples that allow no variation--they are always NREV (non-reversible). Therefore, once we have identified examples as beloning to one of these classes, we can automatically tag them as EXCL. Then (as explained above), we do not need to add any further tags to these examples.

The good news is that the majority of PART and PART2 examples can be identified and tagged by the autocoder_part. The bad news is that the DINS and SORT examples are not yet autocoded, and must be identified and tagged by a human.

Here is a definition of each of these four classes:

PART, standing for "partitive", is used to tag any example of the form "Y of X" in which Y is actually a headless quantifier (such as "much") and X is a *definite* NP expressing the set on which the quantifier operates:

most of them none of this money Some of the defendants much of the week-end eight of the 10 cases one of the rescue trucks One of the first things he would do Fifty-three of the 150 representatives each of the workers now paying such taxes about half of the people in the country one-fourth of the first year students

PART2 is used to tag partitive-like examples that do not fit the definition of "true" , in that X is *indefinite*, rather than definite. Otherwise, these are identical to examples coded as PART:

one of a number of recommendations one of two alternative courses more of such care

DINS stands for "described instance", and is used to code examples that fall into the semantic categories of CONTAINER, MEASURE, , CONFIGURATIVE, and, possibly, CONSTITUTIVE. All of these examples are nonreversible, and therefore we want them all to be excluded from counts.

CONTAINER: a cup of coffee MEASURE: a couple of clicks CLASSIFIER: a pair of trousers CONFIGURATION: pillows of flesh CONSTITUTIVE: woods of Scots pine SORT represents a well-defined category of examples in which Y is a like "sort", "kind", or "type", and X is the set from which the "sort" is drawn.

the same type of object this type of decoration a sort of Emperor Maximilian beard and mustache this sort of philosophical study of sociology the crassest kind of materialism some kind of trick Budd had thought up

Note that you will still find occasional examples such as "he's not my sort", which are *not* SORT examples, and so are included. Note that in a double example such as

He's not <"<"my### sort of### person"><#XS INCL#>"><#OFX SORT EXCL#> . the XS example would be retained, while the OFX example would be tagged as SORT.

Concluding remark on this section: Remember that all of the above classes are non- reversible: you will never get an "X's Y" form with the same meaning.

------5.3.2 PWD, PND

Two closely related classes, PWD and PND, have as yet only been observed occurring with OFX examples, although in principle they should be able to occur with XS examples, too.

The tags PWD and PND are applied when the example is the complement of a preposition and is *only* interpretable vis-a-vis this preposition. PWD stands for "preposition with determiner", and is applied when Y has a determiner (usually "the", but not always), or at least is *not lacking* a determiner (as is the case when Y is a noun).

In this example,

in <"the face of their opposition"> we see that no one is actually talking about faces; rather, the expression is "in the face of". Since there is a determiner after the preposition ("in the face"), we code this as PWD.

Similar examples that lack a determiner where one would normally be expected (.e., before a singular count noun) are coded as PND, for "preposition, no determiner":

by <"dint of his stubborness">

The reason we draw this distinction is that we have decided to include PWDs and exclude PNDs from our counts. The reasoning behind this is that in the case of PWDs, there are two full NPs in the example, complete with determiner. However, in PNDs, the Y is not a full NP, and this is taken to signal something that is frozen, and thus nonreversible. Therefore, PNDs receive the EXCL tag and are not coded further. This is not to say that PWDs, by contrast, are always reversible. But because they are only sometimes reversible, they need to be coded as REV or NREV by hand, if at all.

Here are some more examples of PWD ex.:

patience on <"the part of Americans"> rigged to <"the advantage of a private contracting company"> in <"the face of modern trends away from the Bible"> in <"the interest of lending their support to the new project"> at <"the start of the ninth"> in <"the middle of Oriole plans for a drive on the 1961 American League pennant">

Note that not *all* examples preceded by a preposition are PWD or PND. PWD examples are those in which the preposition is crucially important to the meaning. Thus, this would not be a PWD example:

We returned home after <"the end of the show"> .

Sometimes it is hard to say at first whether an example is PWD or not. Consider:

(a) He addressed me in <"terms of endearment">. (b) In <"terms of endearment">, he rates fairly high. (c) It's like that scene in <"Terms of Endearment">.

Of the above examples, only (b) would count as PWD (being plural, it is not PND), while (c) would count as NAME (see below) and (a) would not receive any construction type tag other than OFX.

One tricky question is whether singular nouns that can be both count and non-count nouns can ever be coded as PND, as in

in <"light of this discovery">

Because "light" can be either a count noun or a non-count noun, this might be coded as PND or as PWD.

------5.3.3 CMPD

CMPD, standing for "compound", is applied only to XS examples that syntactically are compounds. These are coded with EXCL and are not coded further.

Recognizing CMPD examples can be tricky, because they are written as two , the first bearing the possessive "'s". However, their syntactic behavior is that of compounds, not that of normal "X's Y" examples.

Consider the term "men's magazine". It looks like a possessive , with a possessor and a possessee. (In fact, it could be, but more on that in a moment.) However, note that while we can say "two men's magazines", the singular is "one men's magazine", not "one man's magazine". Similarly, we can talk about "one children's story" and "one women's room".

Note that when spoken, these compounds act prosodically like any other compounds, with the first element bearing the main stress: "one WOMEN's room", not "*one women's ROOM". This would not be the case if they were ordinary . Being compounds, they can receive modification like ordinary nouns; note that such modification applies to the compound as a whole, not to one part. Thus, "dog-eared men's magazines" does not refer to "dog-eared men", but rather to dog-eared (men's) magazines.

That said, sometimes these compounds can look indistinguishable from ordinary OFX examples. Note that " took the men's magazines" can be read aloud in two different ways (with capitals representing sentence accent):

(a) She took the (dirty) MEN'S magazines. (b) She took the men's (dirty) MAGAZINES.

In (a), we understand that "men's magazines" is a compound, while in (b), there must be some men who have magazines, and it is not a compound.

Usually, we can tell from the what is meant by an example. Unfortunately, however, sometimes we cannot. Take for instance the sentence "I often look in the bird's nest." Although this can be read prosodically in two different ways, and thus has two different syntactic interpretations, in context the two interpretations would probably both be acceptable. Thus, we cannot say with certainty which was intended. In such cases, assume that the example is *not* a compound, and treat it as an ordinary XS example.

To summarize, apply CMPD to an XS example that is clearly a compound by one of the tests above: the number test, the modification test, or the prosodic test.

It is an open question whether these compounds might be able to be expressed in OFX form. However, since we are not considering any compounds in this study, we are not considering XS compounds, either.

------5.3.4 NAME

The tag NAME is applied to those XS or OFX examples that are names, and as such, invariant. (Following Quirk et al. 1985, section @?@, we will say that while a "proper noun" is a single word, a "name" can be one or more words.) Because these are fixed, we do not include them in our counts. They are coded with EXCL and not coded further.

NAME examples are fixed by "arbitrary" decision, either by a group or by a single person. This includes names of people, names of places, and titles of books, institutions, etc. Because they are proper names, they can't be reversed at the time of usage (though they could well have been fixed in a different form).

The following are all examples of NAMEs:

the City of Atlanta the State Board of Education the Texas House of Representatives Doctor of Education the University of Oklahoma Master of Science Massachusetts Institute of Technology the Signal Corps of the U.S. Army Secretary of State the Federal Bureau of Investigation the Rotary Club of Providence the State Federation of Women's Clubs Port of New York the State Board of Tax Appeals the District of Columbia the Battle of New Orleans the Home Builders Association of Philadelphia the International Association of Fire Fighters assistant secretary of state for economic affairs the Democratic Party of Oregon The Assemblies of God St. Patrick's Day Knights of Columbus

Note that while we do not code these further *internally*, they often occur as the X or the Y of another example:

<"the end of <"A Tale of Two Cities">">

In such cases, they are coded as PROP and DEF, because they are proper names. Other features will have to be inferred from the nature of the example. Thus, in the example above, while "A Tale of Two Cities" is coded as NAME (and will be autocoded for other things), "the end of A Tale of Two Cities" will be coded as PROP_M DEF_M INANIM_M REV, etc.

------5.3.5 IDIOM

The tag IDIOM marks examples that are frozen but not proper names. Many of them are invariant grammatical constructions that have a meaning that is not predictable from the component words alone--that is, they are idioms. Such examples are tagged with EXCL and are not coded further.

Idioms are a mixed bag. Sometimes, all the words in the idiom are invariant. Sometimes, it constitutes a "formula", in which various words may be inserted. Sometimes, the noun phrase is part of a larger idiom, involving a or other elements. The following are all examples of idioms:

of course sort of stupid not much of a meal (I've had) enough of you top of the morning

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.4 Expression type

[Tags discussed in this section: PRON_H, PRON_M, PROP_H, PROP_M, KIN_H, KIN_M, GER_H, GER_M, WHPRON_H, WHPRON_M, COMM_H, COMM_M, NOHEAD_H, NOHEAD_M, OET_H, OET_M, MIX_ET_H, MIX_ET_M, ELLIP, DISLOC]

One of the factors for which we code the examples in our corpus is expression type. This refers to the lexico-syntactic class to which the X and the Y of each example belong. Every example gets two of these tags. Note that there is always one tag for the X--with the extension "_M", for "modifier"--and one tag for the Y--with the extension "_H", for "". For ease of exposition, we will talk about "the " when we mean "either X or Y".

The easiest distinction to see is that between and nouns; however, we also distinguish proper nouns from common nouns, and kinship nouns and from all of these. These are described in Sections 5.4.1 to 5.4.7. If you are unable to decide what the expression type of X or Y is, use the appropriate "other" tag: OET_H or OET_H, respectively.

It also happens that some constructions have an X or a Y that is "headless", either because of the particular syntactic construction involved, or because the head has been dislocated. Sections 5.4.8 and 5.4.9 cover these cases.

------5.4.1 PRON_H, PRON_M

Use the PRON tag when the nominal is a pronoun. In the case of X, this will usually be a possessive personal pronoun, such as "my" or "her". In the case of Y, it will usually be an objective (accusative) personal pronoun, such as "him" or "us". However, you will also encounter other types of pronouns, such as the following:

mine what whose someone('s) anything etc.

Beware the results of the autocoder_et! Although it tends to work quite well, the autocoder is often fooled into thinking that things are pronouns which are not. This is mostly due to the fact that Karlsson's ENGCG part-of-speech tags often label things like quantifiers as pronouns:

all_P much_P some_P etc.

Do not hesitate to change the expression type tags on these examples (not the POS tags on the words!) when necessary.

For what to do with wh-pronouns, see Section 5.4.5 below.

------5.4.2 PROP_H, PROP_M

Use the PROP tag when the nominal is a proper noun (only one word) or a name (which can be one word or several). It can be very tricky to decide whether a nominal is proper or not. Here is our working test:

* An NP without a determiner (e.g., "Texas") is proper if it cannot be changed in number or take a determiner ("*Texases", "*a Texas"). An NP with a determiner (e.g., "the West Indies") is proper if it cannot be changed in number or lose its determiner ("*a West Indy", "*go to West Indies"). If number or determiner alternation is possible, it is not PROP.

It is important to understand that nouns that are usually proper can be "coerced" into behaving like common nouns, as in "Do you mean the Washington on the Pacific or the Washington on the Potomac?" or "She wants to be a Shakespeare". In these sentences, the names "Washington" and "Shakespeare" uncharacteristically occur with a determiner, and we therefore say that in this case, they are being *used* like common nouns, not proper nouns. We code such proper nouns used like common nouns as common, since it is actual instances of usage that we are concerned with. It should be clear, then, that when you apply the test above, you need to keep in mind that some proper nouns *can* plausibly occur with determiners and modification of some sort--but in such cases, they are being coerced into commoness and will be tagged as common. In other words, when applying the test, make sure you are not coercing the nominal yourself.

Proper nouns have unique reference. As such, they never take indefinite determiners such as "a" or "some", except in the case of titles like "A Separate Peace". Therefore, we can say that any nominal with an indefinite , unless it is a title, is not coded as PROP.

However, there is one class of examples that complicates the picture painted so far: nouns with non-restrictive modification, which are still considered proper. In expressions such as "the beautiful Princess Diana", the determiner is licensed by the (usually emotionally colored) non-restrictive modification.

Quirk, et al. (1985, section 5.64) have the following to say about non-restrictive modification:

Nonrestrictive premodifiers are limited to with emotive colouring, such as:

old Mrs Fletcher beautiful Spain dear little Eric historic New York poor Charles sunny July

In a more formal and rather stereotyped style, the is placed between 'the' and a personal name:

the beautiful Princess Diana the inimitable Henry Higgins

Proper nouns show that they can temporarily take on features of common nouns and accept restrictive modification of various kinds:

The Dr. Brown I know comes from Australia.

The flower arrangement was done by a Miss Phillips in Park Road.

Do you mean the Memphis which used to be the capital of Egypt, or the Memphis in Tennessee?

I spoke to the younger Mr. Hamilton, not Mr. Hamilton the manager.

In the last set of examples given by Quirk et al.--those with restrictive modification--we see examples of proper nouns that we would code as COMM, because of the coercion mentioned by Quirk et al.

In most of the cases you encounter, modification will be restrictive--and therefore indicate coercion--so while you should code things such as "the Alps" as PROP, you should be ready to code things such as "the Alps of my boyhood" as COMM.

Of course, we must beware names that include modification as part of the proper name, as in "North Dakota" or "Manchester by the Sea". Let us summarize by stating a simple rule:

* If a noun occurs with restrictive modification (that is not *part of* a name), it is not PROP.

We must say somthing about capitalization: Note that in English, names are often signaled in writing with capital letters. This is a reasonable flag to draw our attention to the possibility of a noun being proper. However, while it seems that in English all proper nouns are written with capital letters (except possibly "e.e. cummings"), it is not the case that all nouns written with capital letters are proper. To pick an obvious example, the first word of each sentence is written with a capital letter, regardless of its expression type. But there are trickier cases, too. Below are two lists, one of nouns we consider proper, and one of nouns we consider common.

PROPER NOUNS COMMON NOUNS January last January Sunday today 2002 (the year) now the United States the Government Russia Russians the Republican Party Republicans Siam a Siamese Stetson (a person) my Stetson (a hat)

Sometimes, it will be difficult to decide whether a noun is proper or not. If you are still uncertain after applying the tests above, here is what to do. Consider the three most common categories of proper nouns:

PERSONAL NAMES (e.g., Bill Gates) PLACE NAMES (e.g., Bellvue) INSTITUTIONS (e.g., Microsoft, Harvard, the Supreme Court, etc.)

If the word in question is capitalized and fits into one of these three categories, go ahead and code it as PROP. Otherwise, be conservative and code it as COMM.

That said, if you are really unsure, use OET_H or OET_M (for "other expression type"). As always, it is better to code something as "other" than to give it the wrong tag!

------5.4.3 KIN_H, KIN_M

Nominals should be coded as KIN if they belong to the relatively limited class of kinship terms. Here is a sample of kinship terms:

mother grandmother great-grandmother sister cousin niece sibling aunt great-aunt godmother goddaughter daughter great-granddaughter grandchild parent wife mother-in-law step-mother etc.

Although knship nouns are arguably a type of common noun, we code them specially because cross-linguistically, they behave in special ways that differentiate them from other common nouns. We want to see to what extent this is also true in English.

Please note that a word like "child" has two meanings: "son or daughter" and "young person". We code instances with the relational meaning (son or daughter) as KIN, but instances of the non-relational meaning as COMM. We consider these to be two separate lexical entries. Similarly, when we talk about "syntactic sisters", we are using the term "sister" as a COMM noun.

------5.4.4 GER_H, GER_M

Gerunds are a special syntactic class of "verbal nouns". They always end in -ing, but of course, not all words ending in -ing are gerunds. Leaving aside the obvious examples such as "sing" and "ding-a-ling", we can point out that there is a difference between pairs of examples such as the one below. (Note: In all examples below, (a) will contain a , and (b) a normal noun.)

(a) the importance of writing his disseration (b) the importance of the writing of his dissertation

Syntactically, a gerund is a verb taking the place of a nominal. While it occupies this slot, it continues to act like a verb in many other ways. By contrast, there are many deverbal nouns ending in -ing; although they have been derived from , they are nouns in every way, and therefore behave differently from gerunds. We can tell the difference by applying some tests.

A gerund, being essentially a verb, can take a direct object without any preposition intervening. Nouns cannot do this. Observe:

(a) the importance of writing (*of) his disseration (b) the importance of the writing *(of) his dissertation

A gerund can also take a subject; however, it requires a special form for its subject: that of a possessive pronoun or a noun with the possessive 's ending:

(a) his/John's writing his dissertation (b) the writing of his dissertation

Now this is tricky: Note that a noun can also take a possessive pronoun--in this case, we call it a possessor, not a subject. This means that you can get a gerund- noun pair that differs only minimally:

(a) his writing his dissertation (b) his writing of his dissertation

The important fact is that while gerunds can take possessive pronouns, they cannot take determiners: (a) (his/*the) writing his dissertation (b) the writing *(of) his dissertation

So, thus far, we have identified two tests: (1) gerunds can take a direct object (assuming they are transitive), and (2) gerunds cannot take a determiner.

There is another test, also stemming from the fact that gerunds are really verbs: instead of adjectival modifiers, they take adverbial modifiers:

(a) his quickly writing his dissertation (b) his quick writing of his dissertation

Therefore, we can offer three tests for gerunds, stated in this rule:

* If you encounter a nominal ending in -ing, it is a gerund if: (1) it can take a direct object without a preposition, or (2) it cannot take a determiner, or (3) it can take adverbial, but not adjectival, modification

------5.4.5 WHPRON_H, WHPRON_M

When the X or Y of an example begins with a "wh-word", the of the whole nominal can be of various types; it can be a pronoun, a free relative, an embedded question, etc. Although these are very different from one another syntactically, because we are not using most of these in our analysis, we are (provisionally) drawing only a distinction between what we call PRON and what we call WHPRON. Several heterogeneous things are lumped into the latter category.

The only wh-words to be coded as PRON are "which", "who", "whom", and "whose"; the others should be coded as WHPRON. But *not all instances* of these will be PRON, as will be explained below.

When you encounter interrogative noun phrases that begin with "what", "which", or "whose" as *determiners* and not pronouns, you should ignore the determiner and code for the nominal head, which is usually COMM, as in the following examples:

By <"what right of superior virtue"> ... do <"the people of the North"> do this ?

did not even know <"what sort of clothes"> I ought+to be wearing

we go on endlessly trying to draw the line ... as+to <"which kind of man"> we wish to see dominate

If the nominal is an entire embedded question, code it as WHPRON. This is true even if they begin with "when", "where", "why", or "how", despite the fact that these are not pronouns.

Similarly, all free (or headless) relatives will be coded as WHPRON, despite the fact that their syntactic category actually varies. The following are examples of free relatives:

<"<"your understanding of what I am saying">">

<"part of what Gilbert Seldes implies when he says ...">

<"a knowledge of what the evidence at <"his disposal"> proves"> So when are wh-pronouns coded as PRON? When they constitute the entire X or Y and are relative, not interrogative, pronouns. This is usually true of "whose", and is often true of "which":

In homely terms <"whose timeliness"> is startling today

Note this logic: If we code "whose timeliness" as PRON_M, we should also code "the "the timeliness of which" as PRON_M, and this is what we do.

Similarly, the following example would be coded PRON_M:

The reporters were questioning the Interior man and the French officer , <"both of whom"> remained noncommittal

However, note that sluicing, which can also result in an X that is one word long, involves interrogatives, so examples such as

We know we want to buy one, but we have yet to settle <"the question of which">. should be coded as WHPRON_M. These, however, are very rare.

Also rare are exclamations beginning with "what"; if you encounter one, the "what" can safely be ignored. Thus,

And <"what a galaxy of those"> adorns that fair land ! would be coded as COMM_H.

Finally, please note that "whether" is a . If you encounter an X that begins with "whether", code it as OET_M.

------5.4.6 COMM_H, COMM_M

Nominals should be coded as COMM if they do not fit into any of the other classes described in this section. That is, they are not pronouns, proper nouns, kinship nouns, gerunds, or wh-pronouns. The majority of nominals you encounter will be in this class.

Generally, the most difficult distinction to draw is between common nouns and proper nouns. See above for a discussion of this distinction.

------5.4.7 MIX_ET_H, MIX_ET_M

As discussed in Section 4.3, many examples contain either a coordinate head or a coordinate modifier. In such cases, you will need to evaluate both conjuncts for expression type. They may be of the same type, or they may be of different types, as shown in the examples below. If they are the same, simply give them the tag you would have given a single nominal, as in examples (a)-(e). However, if they are different, use the tag MIX_ET_H or MIX_ET_M, as in examples (f) and (g).

(a) COOR_M COMM_M "the appointment of appraisers , guardians and administrators"

(b) COOR_M PROP_M "the five counties of Dallas , Harris , Bexar , Tarrant and El Paso" (c) COOR_H COMM_H "the guilt or innocence of the 50 persons"

(d) COOR_H COMM_H "all his time and attention"

(e) COOR_H COOR_M COMM_H COMM_M "the causes and prevention of dependency and illegitimacy"

But:

(f) COOR_M MIX_ET_M "the approval of their parents and teachers"

(g) COOR_H MIX_ET_H "none or only part of the hospital-care credit"

We use the tag MIX_ET because it would require too many tags to have an option for every possible pairing: "one common noun and one kinship noun", etc. Applying MIX_ET allows us to mark these as complex, so that we can go back and examine them more carefully later.

------5.4.8 NOHEAD_H, ELLIP, DISLOC

The sections above describe the tags to apply to an example when both the X and the Y contain a lexically realized NP. It often happens, however, that examples do not have both an X and a Y that are lexically realized NPs. We call these "headless" Xs or Ys. Note that being headless is not always the same as being null; that is, a headless X or Y can have *some* content, just not an NP, as we will see below.

Let us first consider examples with a headless Y, or NOHEAD_H examples. There are three major classes of these (with "___" marking the place from which the head is absent):

HEADLESS CONSTRUCTIONS, such as partitives: "most ___ of the bananas" "two ___ of them"

EXAMPLES WITH A DISLOCATED HEAD: "___ Of the house, only the chimney remained." "___ Of the class, two students failed."

EXAMPLES WITH AN ELLIPTICAL HEAD: "both Roger's work and Stephen's ___" "the works of Shakespeare and ___ of Bacon"

The reason we call the partitive a "headless construction" is that we adopt the analysis of partitives in which Y consists of a quantifier with a null NP complement, while X consists of a definite full NP. We assume that "two of the boys" and "most of us" actually correspond to "two boys of the boys" and "most ones of us", but the construction has evolved such that the first appearance of the NP is eliminated.

If we make this assumption, it is a simple matter to tag all partitive examples as NOHEAD, and in fact, this is done by the program autocoder_part. And since partitives are also tagged with EXCL, you will not have to code them further. Therefore, the headless examples you will have to tag are of the second and third classes listed above. In each case, you will have to add two tags to each example: NOHEAD and either DISLOC or ELLIP. The NOHEAD tag is added to show that the example is one of the categories of headless examples, and the second tag is added to show which category. This system affords us the maximal ability to draw distinctions in performing searches of the corpus later on.

The DISLOC tag should be added to a headless example in which the Y is present somewhere in the sentence but is not in the usual position (the usual position being before "of", or after "'s", for example). It may actually be the case that in these examples the head stays in the right place in the sentence, while the complement moves. In any case, the result is the same: the construction is split into two parts. We choose to bracket the complement/modifier portion and call the head "dislocated".

Such examples are usually partitives, but may be of other types.

<"Of the two"> , The Life Of Bright is incomparably the better biography.

<"of Costaggini"> , only some foliage has been washed , at the point where his work stopped

Later the cleaning and restoration were ordered , first <"of the older part of the frieze"> , finally <"of the canopy">.

Note that in DISLOC examples, the dislocated head is not included within the example brackets. The DISLOC tag serves to alert us to the fact that, even though it isn't to be seen in the example, it does occur in the text. This is important, because we will code the (dislocated) head for other things (like animacy).

Contrast these with ELLIP examples. In these headless examples, the Y does not occur *anywhere* in the text. It is, nonetheless, recoverable from the context, usually because it is an elliptical NP identical to some antecedent in the text:

in connection with President Eisenhower's cabinet selections in 1953 and <"President Kennedy's in 1961">

In many cases, there is coordination between two noun phrases, and the parallelism between them allows the Y of the second to go unexpressed, as in the example above.

Please note the distinction between such examples and DISLOC ones: in this example, the NP "cabinet selections" occurs in the utterance, but it is *not* the head of "President Kennedy's ___ in 1961"; rather, it is the head of "President Eisenhower's cabinet selections in 1953". It has not been moved from anywhere. The head of "President Kennedy's ___ in 1961" is recoverable only because we are able to interpret the null head as an ellipsis and identify the previous head as the probable antecedent.

The way we deal with ELLIP examples of the form "Y of X" is special. As discussed in Section 4.3, because the second (and third, etc.) "of" in such examples appears to be roughly optional, we choose to treat all such examples the same: as though all "ofs" after the first did not occur. Therefore, in examples such as

"the works of Shakespeare and ___ of Bacon" the example associated with the first "of" will remain, but the second "of" will be "killed", and the brackets of the first example extended to cover all coordinate material. Thus, instead of two examples, as in

<"the works of### Shakespeare>" and <"___ of### Bacon"> we have only one example, like so:

<"the works of### Shakespeare and ___ ofxxx Bacon">

The resulting example would then be tagged with COOR_M, and neither NOHEAD nor ELLIP. The reason why should be clear: The example *has* a head--"the works"-- which counts for both modifiers.

The negative result of this is that the elliptical nature of "___ of Bacon" goes untagged, except insofar as all examples that are COOR_M can be considered elliptical.

To review the relationship among the tags NOHEAD, DISLOC, ELLIP, and PART, we consider the five examples below. In each case, the example in question is within quotes, while material not in the example is in parentheses. The tags that apply to each example are given at left.

(a) PART NOHEAD_H "some of the students" (are sick) (b) PART NOHEAD_H DISLOC "of the students" (, some are sick) (c) NOHEAD_H DISLOC "of the class" (, two students are sick) (d) NOHEAD_H ELLIP (Roger's work and) "Stephen's" (e) also NOHEAD_H ELLIP (Roger's work and) "that of Stephen"

Example (a) is a standard partitive expession. Example (b) is also a partitive, but has its head moved out, and therefore would get the tag DISLOC. In practice, however, we wouldn't bother adding this tag, since we are not coding partitives any further. Note that while both (a) and (b) are partitive examples, (c) is not; the undislocated example would be "two students of the class", which is not a partitive by our definition, precisely because it is not headless! It has the nominal head "students", and is therefore tagged with NOHEAD_H and DISLOC only if it occurs as in (c).

Example (d) is like the Kennedy example discussed above; "Stephen's" is headless and elliptical, though we know what the noun would have been had it recurred. Example (e) is also coded the same way; the difference is that the Y here, while it is headless, is not empty. It is considered headless because we analyze "that" as a determiner, not a pronoun; the full example would be "that work of Stephen". See the discussion above of "headless constructions". Such examples with "that" or "those" are quite common:

the drain on their savings -- or <"those of their children"> -- caused by an extended hospital stay

all Presidential appointments , including <"those of cabinet rank">

another vexing issue -- <"that of finances in state government">

Note that the tags DISLOC and ELLIP do not have "_H" and "_M" versions; this is because, to the best of our knowledge, Xs are never left out, except in the sense described in the next section. All DISLOC and ELLIP tags mean, essentially, DISLOC_H and ELLIP_H.

------5.4.9 NOHEAD_M

Finally, let us consider the tag NOHEAD_M. To the best of our knowledge, neither XS nor OFX examples can have a truly null modifier:

*I don't know the name of (___). *I don't know (___) name.

When, then, is the tag NOHEAD_M to be used? Precisely in those cases in which an example that is NOHEAD_H occurs *inside* another example, as the X. Such examples are not common, but they do occur.

Moreover , all three representations must be squeezed comfortably into little more than the length Brumidi allowed for <"each one of### <"his###">"> .

In this example, <"his"> has a null Y (NOHEAD_H ELLIP, to be precise). However, <"each one of his"> then takes this same "his" as its X. What is the expression type of this X, then? We can't say that it's PRON_M, because the pronoun is not the head of this DP, but rather only the modifier of a null head. So, to be absolutely correct, we apply the tag NOHEAD_M, resulting in this coding:

<"each one of### <"his###"><#XS NOHEAD_H PRON_M ELLIP#> "><#OFX COMM_H NOHEAD_M#>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.5 Definiteness

[Tags discussed in this section: DEF_H, DEF_M, INDEF_H, INDEF_M, ODEF_H, ODEF_M, MIX_DEF_H, MIX_DEF_M]

Definiteness is difficult to define. There is considerable disagreement in the literature about this, and we do not feel that we have solved the problem either. Some researchers believe that definiteness is a semantic feature, while others belive that the term should be used only for certain types of syntactic marking (definite determiners, etc.). While in languages like English, there is often syntactic marking of definiteness, in languages like Czech or Mandarin, nouns usually lack any such marking.

Lyons (1999) explains that definitions of definiteness usually fall into two groups, (a) definitions based on familiarity or, better, identifiability, and (b) definitions based on uniqueness or, better, inclusiveness. Whichever type of definition is adopted, there seems to be general in the majority of cases about whether a given noun phrase is definite or indefinite.

What we need is a test of definiteness which can easily be applied to corpus data. Therefore, we adopt a test that--while not universally approved--has been used by numerous researchers to distinguish between definite and indefinite noun phrases. This is the "existential there" test.

This test is generally not considered a test of definiteness per se, but rather of distinctions such as that between "strong" and "weak" determiners (coined by Milsark (1974)). We use the test as defined by Keenan (1989); in his terms, it distinguishes between (semantic) determiners that are "existential" and those that are not. We choose to use the test simply to distinguish two groups, which we call "indefinite noun phrases" and "definte noun phrases". The assumption behind the test is that, in English, only indefinite noun phrases can go in the slot in an "existential there" context such as this:

There is/are ______outside.

If a noun phrase can go in the slot *and give an existential reading*, the determiner is "existential" and the noun phrase is (by our definition) indefinite. Note that in a sentence like

There is that new restaurant across the street. is only acceptable on two readings, neither of which is existential. The first is the "list" reading ("Fast food? Well, there's McDonalds, there's Wendy's..."), and the second is the locative reading ("Look! There's that restaurant we were looking for!")

Keenan gives us another test for existential determiners, which we can use as a test for indefinite noun phrases. It goes like this: If you can say "A is a B", and also "A that is B exists", and they always have the same truth value, then A is an indefinite noun phrase. Thus:

"Some student" is indefinte, because you can say Some student is a vegetarian. and Some student who is a vegetarian exists.

"At least n students" is indefinite, because you can say At least n students are vegetarians. and At least n students who are vegetarians exists.

"Every student" is *definte*, because you can *not* say Every student is a vegetarian. and Every student who is a vegetarian exists.

In this last case, the first sentence is false (or can be), while the second is necessarily true.

The syntactic means to signal definiteness--what Keenan calls "determiners"--can actually be a number of things:

articles quantifier pronouns wh-pronouns lack of a determiner etc.

In what follows, we will simply list the lexical items in these categories that we have determined to be definite and indefinite.

------5.5.1 DEF_H, DEF_M

Here is a list of those lexical items we deem to be definite by the tests described above. They all should receive the tag DEF_H or DEF_M. ARTICLES: the

DEMONSTRATIVES: this that these this those that these this those

PRONOUNS (PERSONAL): 'em he her hers herself him himself his hisself I it its itself me mine my myself our ours ourselves she thee their theirs them themselves they thine thou thy us we ya ye you your yours yourself yourselves

PRONOUNS (IMPERSONAL): each other everybody everybody else everyone everyone else everything everything else one another

WH PRONOUNS: which (when relative) who (when relative) whatever whatsoever whichever whoever whosever whosoever

QUANTIFIERS: all all such both each either every most neither

OTHER: today to-day tomorrow yesterday

<"last" or "next" plus one of the following> night, week, month, year, season, winter, spring, summer, fall, autumn, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, January, February, March, April, May, June, July, August, September, October, November, December, Christmas, Easter, Thanksgiving, term, semester

------5.5.2 INDEF_H, INDEF_M

Here is a list of those lexical items we deem to be indefinite by the tests described above. They all should receive the tag DEF_H or DEF_M.

ARTICLES: a an

PRONOUNS (IMPERSONAL): any one else anybody anybody else anyone anyone else anything anything else no-one nobody nobody else nothin' nothing nothing else one's oneself somebody somebody else someone someone else somethin' something something else

QUANTIFIERS: a bit of a few a great many a little a little more a lot more a lot of another any enough few fewer half less little little or no lots of many, many many more much no no more no such none one-half one-third one one or more ones plenty of several some twice

OTHER: -ment, -tion, -sion, -ism ------5.5.3 ODEF_H, ODEF_M

The tags ODEF_H and ODEF_M actually have two uses. The first is the use common to all "other" tags: to mark examples the researcher or the coder is unsure of.

The second is to mark those nominals that we judge *not to be marked for definteness at all*. There are a number of constructions for which we judge it impossible to say whether a nominal is definite or indefinite. In fact, in some, we hypothesize that the nominal is an NP and not a DP, and therefore--if we assume that DP is what has the definteness feature--has no definiteness at all.

These include:

the heads of BOAD examples

"manager of the operation"

the modifiers of SORT examples

"a sort of car"

and probably certain other classes of expressions, as

"my point of view"

The autocoder attempts to code gerunds as DEF, but in cases where it can't be sure whether a nominal is a gerund or a common noun, it codes it as ODEF. In some cases, even we cannot be sure, so we will continue to code such cases as ODEF:

"a high standard of living"

Finally, all clauses and embedded questions are marked as ODEF.

------5.5.4 MIX_DEF_H, MIX_DEF_M

As with other classes of tags, MIX_DEF_H and MIX_DEF_M are used to mark coordinate nominals in which the two conjuncts differ in definiteness.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.6 Animacy

[Tags discussed in this section: HUMAN_H, HUMAN_M, ORG_H, ORG_M, ANIMAL_H, ANIMAL_M, PLACE_H, PLACE_M, TIME_H, TIME_M, CONCRETE_H, CONCRETE_M, NONCONC_H, NONCONC_M, OANIM_H, OANIM_M, MIX_AN_H, MIX_AN_M, VANIM_H, VANIM_M, ANTHRO_H, ANTRHO_M]

Animacy is a semantic property of entities; these entities can be referred to by means of various types of referring expressions. Animacy is usually conceived of as a hierarchy or scale; the number of points on the scale appears to vary from language to language.

The simplest animacy hierarchy would look like this:

human > non-human A three-way distinction could be made this way:

human > animal > inanimate

Often, we see animacy combined with expression type into a hierarchy, as in Silverstein's (1976) version of the "animacy hierarchy", which includes local person, third person, pronouns, etc. In our study, we are attempting to keep these two dimensions separate. For discussion of expression type, see Section 5.4 above.

We have chosen to use a version of the animacy hierarchy that has essentially three tiers with seven categories:

TOP MIDDLE BOTTOM

human > animal > concrete inanimate > organization > non-concrete inanimate > place > time

That is, the category "inanimate" has been divided into four sub-categories. We are very non-committal about the placement of "organization"; where this falls on the scale is an empricical question, as is the arrangement on the scale of the inanimate categories.

The absolutely crucial thing to bear in mind when coding for animacy is that when we do so, we are not considering *the nominal* per se (e.g., the word "church"), but rather *the entitity that is the referent of that nominal* (e.g., some particular thing in the real world, which could be a building, a group of people, an institution, etc.). This is harder to do than it may sound. It is very tempting to always code the word "chair" as CONCRETE, but sometimes it is used as in "the chair of the department", referring to a human. Similarly, if we see the expression "the Big Cheese" used to refer to a person, we code it as HUMAN, even though in general the word "cheese" is used to refer to something inanimate.

The cases in which the referent is a single human are actually the easy ones. There are a great many cases in which it is hard to say what the referent of a nominal is; for the very hardest cases, we have the tag VANIM, which is discussed in the next section.

Please remember that when you code for the animacy of the head, you must consider the entire DP, not just the Y. Otherwise, you might be tempted to code "the House of Representatives" as CONCRETE because of "house", rather than as ORG, which is correct.

------5.6.1 VANIM_H, VANIM_M

All of the animacy tags are mutually exclusive except for VANIM_H, and VANIM_M, which are represented by checkboxes in the coder (not radiobuttons). These tags are to be applied to mark those nominals for which it is truly hard to say what the referent is, between two or more choices. VANIM stands for "variable animacy"; this is a clue to its use. There are many classes of words whose reference can vary systematically. At times, it will be impossible to say exactly what is being referred to in a given instance. This is precisely when you should apply the tag VANIM.

Note that if you are totally baffled by an example, you should use the tag OANIM, not VANIM. OANIM means "come back to this later"; no other animacy tags are applied. In the case of VANIM, you should apply the tag when you are between 50% and 90% sure that a given referent is intended, but are aware of another possible reading.

To summarise this,

When you are: You should apply:

0%-50% sure OANIM 50%-90% sure VANIM + best tag 90%-100% sure best tag

As an illustration, consider the word "Washington":

(a) Washington had wooden teeth. (b) Washington is thirty miles from Baltimore. (c) Washington is on the phone. (d) Washington hardly operates like a well-oiled machine. (e) Washington doesn't send any senators to Congress.

In (a), "Washington" refers to a person. Such an instance would be tagged as HUMAN.

In (b), "Washington" refers simply to the city qua a place in the world. Such an instance would be tagged as PLACE.

In (c), "Washington" obviously refers to some person who is considered to represent Washington or the government (whether or not he or she is actually in Washington!). This is a simple case of metonymy. Such an instance would be tagged as HUMAN.

In (d), we see another case of metonymy. However, in this case, "Washington" does not refer to the people in Washington, but rather to the political system that is centered in Washington. Different coders might consider the referent here to be an organization of humans, or an abstract entity (cf. "the bicameral legislative system"); this could therefore be tagged either as ORG or as NONCONC, and would probably have the tag VANIM applied as well.

In (e), it is even more difficult to say what the referent is. It is not exactly the geographical place. It might be seen as referring to the human inhabitants of the city, which is who senators would represent. Or it might be seen as referring to the city as a complex political entity (cf. The Commonwealth of Massachusetts). Such an instance could conceivably get coded as HUMAN, ORG, PLACE, or NONCONC; in any case, it would almost certainly also get the tag VANIM.

The usefulness of the tag VANIM is twofold: First, if we need to, we can exclude these extremely difficult examples from consideration when we do our frequency counts, to try to get clearer results. Second, we can review all VANIM examples as group in an attempt to clarify matters and refine the coding system.

------5.6.2 HUMAN_H, HUMAN_M

The tags HUMAN_H and HUMAN_M are used to tag nominals that refer to one or more humans (who are not an ORG--see below). These can be proper names (such as "Fritz Newmeyer"), kinship terms (such as "father"), common nouns (such as "linguist", "boss", or "nerd"), or pronouns (such as "him" or "everybody"). What matters, again, is not the form of the referring expression, but rather whether the referent is a human or not. Note that a referent doesn't have to be alive, or even real, to be classified as HUMAN. We can say that "Paul Bunyan" is human, just as "Paul Revere" is. The point is that when we think of these entities, we consider them as humans, not as animals or inanimate things. The same would go for "my future children" or "nobody".

Included under HUMAN are humanoid entities like gods, elves, ghosts, and androids. In most cases, the degree of autonomy and animation ascribed to these entities is similar to that of normal humans.

------5.6.3 ORG_H, ORG_M

The tags ORG_H and ORG_M are used to tag nominals that refer to "organizations", "institutions", or "collective entities" comprising groups of humans (not animals). For example, "Microsoft" often refers to a collection of people who together make up the company. We can talk about those people as a collective body by saying "Microsoft" or "the company" or "the Evil Empire".

There are two problems with the category ORG. The first is that it is difficult to define. The second is that, nominals that can refer to organizations can almost always refer to other things, too. (See the discussion of VANIM above.)

To address the first problem, the first criterion for ORG is that humans be involved. Thus, a flock of goats is not an ORG, and neither is a suite of computer programs.

The second criterion is that more than one human be involved. But the referent "two boys" is still HUMAN. In order to be an ORG, there must be some degree of group identity. Here is a set of properties that (seem to) form an implicational hierarchy:

+/- chartered/official +/- temporally stable +/- collective voice/purpose +/- collective action +/- collective

Thus, if a group is + chartered, it will also be + temporally stable, and so on. So, "the US Senate" is + all of these, while "my neighbors" is only + collective.

The way we have been coding so far, any group of humans at + collective voice or higher has generally been coded as ORG. Thus, while "the posse" would be an ORG, "the mob" might not be, depending on whether you see the mob as having a collective purpose. "The crowd" would not be considered ORG, but rather simply HUMAN.

To address the second problem with ORGs, once we have identified a nominal as potentially referring to an ORG, there is still the (very good) chance that a given instance of that nominal will refer to something abstract (hence NONCONC), to a place, or even to a non-ORG group of humans. For instance,

As mentioned above, "Microsoft" is a noun that commonly refers to an organized group of people. However, it is also possible to use the word "Microsoft" to refer to an abstract entity, a corporation. When we say "Microsoft was founded in 1980", we are not referring so much to the people in the company as to the company as an inanimate entity or legal body. Such a case would be NONCONC. Similarly, when we say "the United States", we can be referring either to some of the people who make up the country/its government (as in "The United States has decided not to sign the treaty"), or to the place between the Atlantic and the Pacific (as in "That could never happen in the United States"). While the former would be ORG, the latter would be PLACE.

When it is difficult to resolve what the referent is, use VANIM (see above).

Note that most nominals referring to ORGs are singular nouns, as in "a family", "our team", or "the nation". If you encounter a singular nominal that refers to a group of people, it is likely to refer to an ORG. However, not all ORG-referring nominals are singular: "the Bahamas" and "the Beatles" could also be ORG.

One test that you can try in case of uncertainty is to try using the pronouns "they" and "it" in the same context. If you can say only "it", you're not dealing with an ORG. If you can say "it" or "they", it probably is an ORG. Thus:

The city is large. (I've visited it. / *I've visited them. -> PLACE) The city elected a woman. (It made a good decision. / They made a good decision. -> ORG)

But this test is far from foolproof, so use with care.

------5.6.4 ANIMAL_H, ANIMAL_M

The tags ANIMAL_H and ANIMAL_M are used to tag nominals that refer to non-human animals. These can be proper names ("Fido"), pronouns ("it"), kinship terms ("sire"), or common nouns ("critter").

To make it easy, we have decided that anything that is counted as in the animal kingdom in the Linnean taxonomy is coded as ANIMAL.

We have decided to include viruses and bacteria in this category, although we haven't encountered any references to them yet.

------5.6.5 PLACE_H, PLACE_M

The tags PLACE_H, PLACE_M are used for nominals that refer to a place *as a place*. "Florida" does not automatically get coded as PLACE, because it may refer to the citizens in florida, the government of Florida, an electoral fiasco that occurred in Florida, etc.

Specifically, we code as PLACE something that is a stationary area on the surface of the planet (or above it) that is the potential location of a human. We believe that the human perspective is the only one likely to be relevant to nominal variation of the sort we are studying. Thus "a phonebooth" is a place, but "the drawer" is not.

We are trying to be fairly stringent about applying PLACE only when the nominal actually seems to refer to the place qua place. "My house" would be coded PLACE in the context "I was at my house" or "on my way to my house", but it would be coded as CONC in the context "my house was built in 1960" or "my house is made of brick".

------5.6.6 TIME_H, TIME_M The tags TIME_H and TIME_M mark nominals that refer to periods of time. These can be brief or long periods: "a moment", or "the eighteenth century". For now, we are being fairly liberal in applying this tag; we are aware that there may be a difference between specific periods of time ("last May") and non-specific periods ("it will take an hour"), but we want first to collect all the different sorts and look at them together.

The one thing that you should be careful not to code as TIME is refererences that use temporal terms to refer to events or abstractions, as in "his past", or "the future of the company". It may help to think of a timeline, and whether the nominal refers to something that could be located on the timeline. In this example,

that the new administration would react even tougher than the Eisenhower administration would during <"the formative period of the administration"> each instance of "administration" would be considered an ORG, but the phrase "the formative period of the administration" could be coded as TIME, since it could conceivably be placed on a timeline.

------5.6.7 CONCRETE_H, CONCRETE_M

The tags CONCRETE_H and CONCRETE_M are used for nominals with concrete referents, of course, but there are strict conditions placed on this. "Concrete" here means "prototypical concrete inanimate object or substance", so "stone", "milk", and "tree" would all qualify (plants are considered inanimate for our purposes). We have decided to restrict this category to *prototypical* concrete objects, because we reason that if any difference is seen between concrete and abstract referents (and there is no guarantee that there will be), it will be between those referents that are prototypically concrete and those that are fully abstract.

That said, we decided not to form a category "prototypically abstract", because we found it hard to define. So the distinction we make is between "prototypically concrete" and "not prototypically concrete", with the latter including all abstract referents, and even things that are not particularly abstract, but not particularly concrete, either.

The rule of thumb to apply is that we want to code as CONCRETE only *good* examples of concrete things. Here is a list of nominals we think are likely to refer to prototypically concrete objects (provided that they aren't being used in another sense, as "house" could be--see above):

apple, bed, car, cardboard, computer, cup, dirt, door, gasoline, gun, house, knee, knife, ladder, meat, moon, pen, river, rock, scissors, shirt, snow, spoon, stars, stick, sugar, sun, table, tree, water

And here are some nominals that we think are unlikely to do so:

air, atom, belch, chromosome, current, fog, gases, kink, lucky stars, molecule, pocket, protein, smoke, tide, voice, wind

This is not to say that chromosomes, for example, are not concrete. It simply means that they are not *prototypically* concrete.

Here is a test to apply: If you can *not* perceive something with one of the five senses, it is *not* CONCRETE. So radiation is not CONCRETE. Note that body parts, and all other parts of people or animals, will be coded as CONCRETE. Animacy is not "inherited" by things just by virtue of their being a part or a product of something animate. Thus, while "cow" is ANIMAL, "beef" is CONCRETE.

------5.6.8 NONCONC_H, NONCONC_M

The tags NONCONC_H and NONCONC_M are used for a large and mixed group of nominals. This is the default category for animacy, and we find that as many as 40% of Xs and 60% of Ys end up in this category.

All events are categorically NONCONC, so "laugh", "trip", "thought", etc. would be NONCONC.

Essentially, anything that is inanimate and not a good example of a concrete object is NONCONC.

Of course, if you are simply confused about an example and do not know how to label it, use OANIM.

------5.6.9 OANIM_H, OANIM_M, MIX_AN_H, MIX_AN_M

The tags OANIM_H and OANIM_M are to be used when you are really not sure about the animacy of the particular referent; see the discussion above in Section 5.6.1. As always, it is better to apply "other" tags than to apply tags with little confidence.

The tags MIX_AN_H and MIX_AN_M are used to tag coordinate Ys or Xs, respectively, when the animacy of the two conjuncts is not the same. This is exactly parallel to the case of MIX_ET_H and MIX_ET_M discussed in Section 5.4.7.

------5.6.10 Metaphor and anthropomorphization

Finally, we consider some of the more difficult general processes in reference, metaphor and anthropomorphization.

In metaphor, an expression is used to refer not to the literal (typical) referent, but to something else that is considered to be comparable in some way to the literal referent.

Here is how we deal with metaphor:

If you can figure out what the actual/current (not literal) referent of the nominal is, code for that. If you aren't sure, but you can narrow it down to a set of options that would all get the same tag, apply that tag. Thus, in

"The engine of our marriage has blown a head gasket."

It is not exactly clear what the actual referent of "the engine of our marriage" is, but whatever it is, it must be something abstract. Therefore, you could apply the tag NONCONC.

If, however, you have no idea what the actual referent of the nominal is, then code for the literal referent. So, if you encounter "This restaurant is the cat's pyjamas!" you may have little idea what "the cat's pyjamas" is intended to refer to, and even less what "the cat" is. Therefore, you could (if you were not to code this as an idiom) code this as CONCRETE_H ANIMAL_M. Admittedly, this does seems less than ideal, but in practice we have found very few such cases.

Now we turn to anthropomorphization, which is an ascribing of human traits to something that is not human (sometimes called personification). The most common case is treating pet animals like humans, but there are also many literary passages in which, say, the forest is referred to as hearing, looking, or even thinking. For such cases, we have the tags ANTHRO_H and ANTHRO_M.

In a case of anthropomorphization, simply code the nominal for what the referent *really* is (e.g., a forest or a dog), and add the tag ANTHTOP, which is not mutually exclusive with any other tags. This enables us to keep track of this somewhat problematic examples without necessarily excluding them from the database.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5.7 Weight

[Tags discussed in this section: XW0, XW1, (...) XW20, XWover20, YW0, YW1, (...) YW20, YWover20]

One factor that doubtless conditions a speaker's decision to use the "X's Y" or the "Y of X" form is the length--also called the weight--of the particular X and the particular Y. As discussed in Section 4.5.2, we are interested in studying the respective weights of the X and Y of each example, to see how large a role this factor plays in the variation between the two forms.

Deciding how to measure "syntactic weight" is not a trivial matter. There are many possibile measures of weight discussed in the literature: number of maximal projections, number of nodes, number of words, etc. It has been found that in general, all these different measures give roughly equivalent results (see Wasow 2002). Therefore, we adopt a measure that allows us to code for weight *automatically*: number of words.

The program autocoder_weight tags each example in a file with two tags: one from the first column below, and one from the second column.

MODIFIER HEAD ------XW0 YW0 XW1 YW1 XW2 YW2 ...... XW19 YW19 XW20 YW20 XWover20 YWover20

That is, each example is tagged with the number of words in X and the number of words in Y, whether the construction is XS or OFX. The tag YW0 refers to a null head, which is a subset of headless examples (see Section 5.4.8) in which there is literally *no content*, not just no NP. Thus, (a) below would be YW0 (and XW1), while (b) would be YW1 (and XW3):

(a) an absolutely unalterable right to what is <"his"> . (b) substitute the standards of the peer group for <"those of parents and teachers">

The tag XW0 would represent a truly null modifier; so far, we have not encountered one (see the discussion in Section 5.4.9).

The autocoder_weight works very reliably, but its sophistication is limited: All it does is count the orthographic words within the example brackets (<" ">) before and after the target word "of_PREP###", in the case of OFX examples. In the case of XS examples, it includes all words up to and including the target word (e.g., "John's_Ngen###") in the modifier, and everything after that in the head.

Note that is is for precisely this reason that we exclude material extraposed from the head from the brackets (though we tag it with EXTRAP_H, as discussed in Section 4). In this example,

[...] there are <"some members of <"our congressional delegation in Washington">"> who would like to see it ( the resolution ) passed . we put the final bracket associated with to "of" before the "who would like...", since the noun phrase is not "our congressional delegation in Washington who would like to see it (the resolution) passed", but rather (reconstructed) "some members who would like to see it (the resolution) passed". This assures that the relative clause is not erroneously counted as part of X.

However, please note that as things stand, the weight of the extraposed material is not counted as part of the weight of Y, either. We may develop a more sophisticated view of this later.

The autocoder considers to be a "word" anything that has a POS tag. It excludes punctuation, but not letters (so "P. T. Barnum" would be three words, as would "Mr. F. Zappa"). Again, the example's target "of" does not count as either head or modifier, although any other "of"s in the example will be counted:

<"the experiences of <"the first few years of <"the child's life"><#XW2 YW1#>"><#XW3 YW4#>"><#XW8 YW2#>

Other things that are counted as a word are:

copular ("cries of he 's right" = 5 words)

contracted "have" ("I 've tried" = 3 words)

contracted "not" ("he has n't yet learned" = 5 words)

any string of numerals occuring together ("1923" = 1 word)

strings of words that Karlsson's ENGCG tagger considers to be a single word ("out+of_prep", "at+least_adv")

This final group is the only one that is potentially problematic. We may decide in the future to count each of the elements of such "words", but for now we are sticking with the "one POS tag, one word" rule.