Detecting the Countability of English Compound Nouns Using Web-Based Models
Total Page:16
File Type:pdf, Size:1020Kb
Detecting the Countability of English Compound Nouns Using Web-based Models Jing Peng Kenji Araki Language Media Laboratory Language Media Laboratory Hokkaido University Hokkaido University Kita-14, Nishi-9, Kita-ku, Kita-14, Nishi-9, Kita-ku, Sapporo, JAPAN Sapporo, JAPAN [email protected] [email protected] Baldwin and Bond, 2003). However, all the pro- Abstract posed approaches focused on learning the countability of individual nouns. In this paper, we proposed an approach A compound noun is a noun that is made up for detecting the countability of English of two or more words. Most compound nouns in compound nouns treating the web as English are formed by nouns modified by other a large corpus of words. We classified nouns or adjectives. In this paper, we concen- compound nouns into three classes: trate solely on compound nouns made up of only countable, uncountable, plural only. two words, as they account for the vast majority Our detecting algorithm is based on of compound nouns. There are three forms of simple, viable n-gram models, whose compound words: the closed form, in which the parameters can be obtained using the words are melded together, such as “songwriter”, WWW search engine Google. The de- “softball”, “scoreboard”; the hyphenated form, tecting thresholds are optimized on the such as “daughter-in-law”, “master-at-arms”; small training set. Finally we and the open form, such as “post office”, “real experimentally showed that our estate”, “middle class”. algorithm based on these simple Compound words create special problems models could perform the promising when we need to know their countability. Ac- results with a precision of 89.2% on the cording to “Guide to English Grammar and total test set. Writing”, the base element within the compound noun will generally function as a regular noun 1 Introduction for the countability, such as “Bedrooms”. How- ever this rule is highly irregular. Some uncount- In English, a noun can be countable or uncount- able nouns occur in their plural forms within able. Countable nouns can be "counted", they compound nouns, such as “mineral waters” (wa- have a singular and plural form. For example: an ter is usually considered as uncountable noun). apple, two apples, three apples. Uncountable The countability of some words changes when nouns cannot be counted. This means they have occur in different compound nouns. “Rag” is only a singular form, such as water, rice, wine. countable noun, while “kentish rag” is uncount- Countability is the semantic property that de- able; “glad rags” is plural only. “Wages” is plu- termines whether a noun can occur in singular ral only, but “absolute wage” and “standard and plural forms. We can obtain the information wage” are countable. So it is obvious that de- about countability of individual nouns easily termining countability of a compound noun from grammar books or dictionaries. Several should take all its elements into account, not researchers have explored automatically learn- consider solely on the base word. ing the countability of English nouns (Bond and The number of compound nouns is so large Vatikiotis-Bateson, 2002; Schwartz, 2002; that it is impossible to collect all of them in one 103 dictionary, which also need to be updated fre- only distinguish between countable and un- quently, for newcoined words are being created countable for individual nouns. The best model continuously, and most of them are compound is the determiner-noun model, which achieves nouns, such as “leisure sickness”, “Green fam- 88.62% on countable and 91.53% on uncount- ine”. able nouns. Knowledge of countability of compound In section 2 of the paper, we describe The nouns is very important in English text genera- main approach used in the paper. The prepara- tion. The research is motivated by our project: tion of the training and test data is introduced in post-edit translation candidates in machine section 3. The details of the experiments and translation. In Baldwin and Bond (2003), they results are presented in section 4. Finally, in sec- also mentioned that many languages, such as tion 5 we list our conclusions. Chinese and Japanese, do not mark countability, so how to determine the appropriate form of 2 Our approach translation candidates is depend on the knowl- edge of countability. For example, the correct We classified compound nouns into three 1 translation for “发育性痛 ” is “growing pains”, classes, countable, uncountable and plural only. not “growing pain”. In Baldwin and Bond (2003), they classified In this paper, we learn the countability of individual nouns into four possible classes. Be- English compound nouns using WWW as a sides the classes mentioned above, they also large corpus. For many compound nouns, espe- considered bipartite nouns. These words can cially the relatively new words, such as genetic only be plural when they head a noun phrase pollution, have not yet reached any dictionaries. (trousers), but singular when used as a modifier we believe that using the web-scale data can be (trouser leg). We did not take this class into ac- a viable alternative to avoid the sparseness prob- count in the paper, for the bipartite words is lem from smaller corpora. We classified com- very few in compound nouns. pound nouns into three classes: countable (eg., bedroom), uncountable (eg,. cash money), plural C-noun only (eg,. crocodile tears). To detect which class a compound noun is, we proposed some simple, viable n-gram models, such as freq(N) (the fre- F(Ns)>>F(N) Y quency of the singular form of the noun) whose plural only parameters’ values (web hits of literal queries) N can be obtained with the help of WWW search Y engine Google. The detecting thresholds (a noun F(N)>>F(Ns) uncountablity whose value of parameter is above the threshold N is considered as plural only) are estimated on the small countability-tagged training set. Finally countablity we evaluated our detecting approach on a test set and showed that our algorithm based on the simple models performed the promising results. Figure 1. Detecting processing flow Querying in WWW adds noise to the data, For plural only compound noun, we assume we certainly lose some precision compared to that the frequency of the word occurrence in the supervised statistical models, but we assume that plural form is much larger than that in the singu- the size of the WWW will compensate the rough lar form, while for the uncountable noun, the queries. Keller and Lapata (2003) also showed frequency in the singular form is much larger the evidence of the reliability of the web counts than that in the plural form. The main processing for natural language processing. In (Lapata and flow is shown in Figure 1. In the figure, “C- Keller, 2005), they also investigated the count- noun” and “Ns” mean compound noun and the ability leaning task for nouns. However they plural form of the word respectively. “F(Ns)>>F(N)” means that the frequency of the 1 “发育性痛”(fa yu xing tong) which is Chinese compound plural form of the noun is much larger than that noun means “growing pains”. of the singular form. 104 Our approach for detecting countability is ralize the last word. Eg,. grown-ups, based on some simple unsupervised models. stand-bys. f() Ns Although the rules cannot adapt for each ≥ θ (1) f() N compound noun, in our experimental data, all the countable compound nouns follow the rules. In (1), we use the frequency of a word in the We are sure that the rules are viable for most plural form against that in the singular form. θ countable compound nouns. is the detecting threshold above which the word Although we used Google as our search en- can be considered as a plural only. gine, we did not use Google Web API service f(,) much N ≥ θ (2) for programme realization, for Google limits to f(,) many Ns 1000 automated queries per day. As we just In (2), we use the frequency of a word in the need web hits returned for each search query, we singular form co-occurring with the determiner extracted the numbers of hits from the web “much” against the frequency of the word in the pages found directly. plural form with many, if above θ , the word can be considered as uncountable word. (2) is used 3 Experimental Data to distinguish between countable and uncount- The main experimental data is from Webster’s able compound nouns. New International Dictionary (Second Edition). f(,) Ns are ≥θ (3) The list of compound words of the dictionary is f(,) N is available in the Internet2. We selected the com- The model 3 that compares the frequencies pound words randomly from the list and keep of noun-be pairs (eg,. f(“account books are”), the nouns, for the word list also mixes com- f(“account book is”) is used to distinguish plural pound verbs and adjectives with nouns together. only and countable compound nouns. We repeated the process several times until got With the help of WWW search engine our experimental data. We collected 3000 words Google, the frequencies (web hits) in the models for the training which is prepared for optimizing can be obtained using quoted n-gram queries the detecting thresholds, and 500 words for the (“soft surroundings”). Although in Keller and test set which is used to evaluate our approach. Lapata (2002), they experimentally showed that In the sets we added 180 newcoined compound web-based approach can overcome data sparse- nouns (150 for training; 30 for test). These rela- ness for bigrams, but the problem still exists in tively new words that were created over the past seven years have not yet reached any dictionari- our experiments.