Quick viewing(Text Mode)

Detecting the Countability of English Compound Nouns Using Web-Based Models

Detecting the Countability of English Compound Nouns Using Web-Based Models

Detecting the Countability of English Using Web-based Models

Jing Peng Kenji Araki Language Media Laboratory Language Media Laboratory Hokkaido University Hokkaido University Kita-14, Nishi-9, Kita-ku, Kita-14, Nishi-9, Kita-ku, Sapporo, JAPAN Sapporo, JAPAN [email protected] [email protected]

Baldwin and Bond, 2003). However, all the pro- Abstract posed approaches focused on learning the countability of individual nouns. In this paper, we proposed an approach A compound is a noun that is made up for detecting the countability of English of two or more words. Most compound nouns in compound nouns treating the web as English are formed by nouns modified by other a large corpus of words. We classified nouns or . In this paper, we concen- compound nouns into three classes: trate solely on compound nouns made up of only countable, uncountable, only. two words, as they account for the vast majority Our detecting algorithm is based on of compound nouns. There are three forms of simple, viable n-gram models, whose compound words: the closed form, in which the parameters can be obtained using the words are melded together, such as “songwriter”, WWW search engine Google. The de- “softball”, “scoreboard”; the hyphenated form, tecting thresholds are optimized on the such as “daughter-in-law”, “master-at-arms”; small training set. Finally we and the open form, such as “post office”, “real experimentally showed that our estate”, “middle class”. algorithm based on these simple Compound words create special problems models could perform the promising when we need to know their countability. Ac- results with a precision of 89.2% on the cording to “Guide to and total test set. Writing”, the base element within the compound noun will generally function as a regular noun 1 Introduction for the countability, such as “Bedrooms”. How- ever this rule is highly irregular. Some uncount- In English, a noun can be countable or uncount- able nouns occur in their plural forms within able. Countable nouns can be "counted", they compound nouns, such as “mineral waters” (wa- have a singular and plural form. For example: an ter is usually considered as uncountable noun). apple, two apples, three apples. Uncountable The countability of some words changes when nouns cannot be counted. This means they have occur in different compound nouns. “Rag” is only a singular form, such as water, rice, wine. countable noun, while “kentish rag” is uncount- Countability is the semantic property that de- able; “glad rags” is plural only. “Wages” is plu- termines whether a noun can occur in singular ral only, but “absolute wage” and “standard and plural forms. We can obtain the information wage” are countable. So it is obvious that de- about countability of individual nouns easily termining countability of a compound noun from grammar books or dictionaries. Several should take all its elements into account, not researchers have explored automatically learn- consider solely on the base word. ing the countability of (Bond and The number of compound nouns is so large Vatikiotis-Bateson, 2002; Schwartz, 2002; that it is impossible to collect all of them in one

103 dictionary, which also need to be updated fre- only distinguish between countable and un- quently, for newcoined words are being created countable for individual nouns. The best model continuously, and most of them are compound is the determiner-noun model, which achieves nouns, such as “leisure sickness”, “Green fam- 88.62% on countable and 91.53% on uncount- ine”. able nouns. Knowledge of countability of compound In section 2 of the paper, we describe The nouns is very important in English text genera- main approach used in the paper. The prepara- tion. The research is motivated by our project: tion of the training and test data is introduced in post-edit translation candidates in machine section 3. The details of the experiments and translation. In Baldwin and Bond (2003), they results are presented in section 4. Finally, in sec- also mentioned that many languages, such as tion 5 we list our conclusions. Chinese and Japanese, do not mark countability, so how to determine the appropriate form of 2 Our approach translation candidates is depend on the knowl- edge of countability. For example, the correct We classified compound nouns into three 1 translation for “发育性痛 ” is “growing pains”, classes, countable, uncountable and plural only. not “growing pain”. In Baldwin and Bond (2003), they classified In this paper, we learn the countability of individual nouns into four possible classes. Be- English compound nouns using WWW as a sides the classes mentioned above, they also large corpus. For many compound nouns, espe- considered bipartite nouns. These words can cially the relatively new words, such as genetic only be plural when they head a noun pollution, have not yet reached any dictionaries. (trousers), but singular when used as a modifier we believe that using the web-scale data can be (trouser leg). We did not take this class into ac- a viable alternative to avoid the sparseness prob- count in the paper, for the bipartite words is lem from smaller corpora. We classified com- very few in compound nouns. pound nouns into three classes: countable (eg., bedroom), uncountable (eg,. cash money), plural C-noun only (eg,. crocodile tears). To detect which class a compound noun is, we proposed some simple, viable n-gram models, such as freq(N) (the fre- F(Ns)>>F(N) Y quency of the singular form of the noun) whose plural only parameters’ values (web hits of literal queries) N can be obtained with the help of WWW search Y engine Google. The detecting thresholds (a noun F(N)>>F(Ns) uncountablity whose value of parameter is above the threshold N is considered as plural only) are estimated on the small countability-tagged training set. Finally countablity we evaluated our detecting approach on a test set and showed that our algorithm based on the simple models performed the promising results. Figure 1. Detecting processing flow Querying in WWW adds noise to the data, For plural only compound noun, we assume we certainly lose some precision compared to that the frequency of the word occurrence in the supervised statistical models, but we assume that plural form is much larger than that in the singu- the size of the WWW will compensate the rough lar form, while for the uncountable noun, the queries. Keller and Lapata (2003) also showed frequency in the singular form is much larger the evidence of the reliability of the web counts than that in the plural form. The main processing for natural language processing. In (Lapata and flow is shown in Figure 1. In the figure, “C- Keller, 2005), they also investigated the count- noun” and “Ns” mean compound noun and the ability leaning task for nouns. However they plural form of the word respectively. “F(Ns)>>F(N)” means that the frequency of the

1 “发育性痛”(fa yu xing tong) which is Chinese compound plural form of the noun is much larger than that noun means “growing pains”. of the singular form.

104 Our approach for detecting countability is ralize the last word. Eg,. grown-ups, based on some simple unsupervised models. stand-bys. Nsf )( Although the rules cannot adapt for each ≥ θ (1) Nf )( compound noun, in our experimental data, all the countable compound nouns follow the rules. In (1), we use the frequency of a word in the We are sure that the rules are viable for most plural form against that in the singular form. θ countable compound nouns. is the detecting threshold above which the word Although we used Google as our search en- can be considered as a plural only. gine, we did not use Google Web API service Nmuchf ),( ≥ θ (2) for programme realization, for Google limits to Nsmanyf ),( 1000 automated queries per day. As we just In (2), we use the frequency of a word in the need web hits returned for each search query, we singular form co-occurring with the determiner extracted the numbers of hits from the web “much” against the frequency of the word in the pages found directly. plural form with many, if above θ , the word can be considered as uncountable word. (2) is used 3 Experimental Data to distinguish between countable and uncount- The main experimental data is from Webster’s able compound nouns. New International Dictionary (Second Edition). areNsf ),( ≥θ (3) The list of compound words of the dictionary is isNf ),( available in the Internet2. We selected the com- The model 3 that compares the frequencies pound words randomly from the list and keep of noun-be pairs (eg,. f(“account books are”), the nouns, for the word list also mixes com- f(“account book is”) is used to distinguish plural pound and adjectives with nouns together. only and countable compound nouns. We repeated the process several times until got With the help of WWW search engine our experimental data. We collected 3000 words Google, the frequencies (web hits) in the models for the training which is prepared for optimizing can be obtained using quoted n-gram queries the detecting thresholds, and 500 words for the (“soft surroundings”). Although in Keller and test set which is used to evaluate our approach. Lapata (2002), they experimentally showed that In the sets we added 180 newcoined compound web-based approach can overcome data sparse- nouns (150 for training; 30 for test). These rela- ness for bigrams, but the problem still exists in tively new words that were created over the past seven years have not yet reached any dictionari- our experiments. When the number of pages 3 found is zero, we smooth zero hits by adding es . them to 0.01. Countable compound nouns create some Countability Training set Test set problems when we need to pluralize them. For Plural only 80 21 no real rules exist for how to pluralize all the Countable 2154 352 words, we summarized from “Guide to English Uncountable 766 127 Grammar and Writing” for some trends. We Total 3000 500 processed our experimental data following the Table 1. The make-up of the experimental data rules below. 1. Pluralize the last word of the compound We manually annotated the countability of noun. Eg,. bedrooms, film stars. these compound nouns, plural only, countable, 2. When “woman” or “man” are the modi- uncountable. An English teacher is a native fiers in the compound noun, pluralize speaker has checked and corrected the annota- both of the words. Eg,. Women-drivers. tions. The make-up of the experimental data is 3. When the compound noun is made up as listed in Table 1. “noun + preposition (or prep. phrase)”, pluralize the noun. Eg,. fathers-in-law. 2 The compound word list is available from http://www. 4. When the compound noun is made up as puzzlers.org/wordlists/dictinfo.php. “verb (or past ) + ”, plu- 3 The new words used in the paper can be found in http: //www.worldwidewords.org/genindex-pz.htm

105 4 Experiments and Results Figure 3 shows the performance evaluated by the three measures when θ 1=8 and 0 ≤ 4.1 Detecting plural only compound nouns θ 2 ≤ 10 with a stepsize of 1. We set θ 2 to 5 for the test later, and accordingly the values of Re- Plural only compound nouns that have not sin- call, Precision and F-score are 91.25%, 82.95% gular forms always occur in plural forms. The and 87.40% respectively. frequency of their singular forms should be zero. Considering the noise data introduced by search Recall/Precision/F-score engine, we used model (1) and (3) in turn to de- tect plural noun. We detected plural only com- 100% pound nouns with the following algorithm (Figure 2), which is used to distinguish between 80% plural only and non-plural only compound. 60% Recall Nsf )( if ( ≥ θ1 ) Precision 40% Nf )( Accuracy F-score then plural only; areNsf ),( 20% else if ( ≥ θ 2 ) isNf ),( 0% then plural only; 012345678910 else threshold countable or uncountable; Figure 3. The Recall/Precision/F-score graph Figure 2. Detecting algorithm for plural only (θ 1=8 and 0 ≤ θ 2 ≤ 10) The problem is how to decide the two thresholds. We preformed exhaustive search to 4.2 Detecting uncountable compound adjust θ 1, θ 2 optimized on the training set. nouns With 0 ≤ θ 1,θ 2 ≤ 20, all possible pair values Uncountable compound nouns that have not plu- are tried with the stepsize of 1. ral form always occur in singular form. A Recall = (4) AB if ( N is not plural only)

A Nf )( Pr ecision = (5) then if ( ≥θ3 ) AC Nsf )( × Pr2 ecision× Re call then uncountable; scoreF =− (6) Pr ecision + Re call Nmuchf ),( else if ( ≥ θ 4 ) We use Recall and Precision to evaluate the Nsmanyf ),( performance with the different threshold pairs. then uncountable; The fundamental Recall/Precision definition is else adapted to IE system evaluation. We borrowed countable; the measures using the following definition for our evaluation. For one experiment with a cer- tain threshold pair, A stands for the number of Figure 4. Detecting algorithm for uncountable plural found correctly; AB stands for the total compound nouns number of plural only compound nouns in train- The algorithm detecting uncountable compound ing set (80 words); AC stands for the total num- nouns is shown in Figure 4. Using model (1) and ber of compound nouns found. The Recall and (2), we attempted to fully make use of the char- Precision are defined in (4) and (5). We also acteristic of uncountable compound nouns, that introduced F-score when we need consider the is the frequencies of their occurrence in the sin- Recall and Precision at the same time, and in the gular forms are much larger than that in the plu- paper, F-score is calculated according to (6). ral forms.

106 The method to obtain the optimal threshold compound nouns in our test set, then when clas- θ 3 and θ 4 is the same to 4.1. We set θ 3 to 24, sify all the test words as countable, we can at θ 4 to 2, and the values of Recall, Precision and least get the accuracy of 70.4%. We used it as F-score are 88.38%, 80.27% and 84.13% respec- our baseline. The accuracy on the total test date tively. is 89.2% that significantly outperforms the base- line. For the 30 newcoined compound nouns, the 4.3 Performance on the test suite detecting accuracy is 100%. This can be ex- plained by their infrequence. Newcoined words We evaluated our complete algorithm with the are not prone to produce noise data than others four thresholds (θ 1=8, θ 2=5, θ 3=24, θ 4=2) just because they are not occurring regularly. on the test set, and the detecting results are summarized in Table 2. There are 352 countable

Correct Incorrect Recall Precision F-score Plural only 18 4 85.71% 81.81% 83.71% Countable 320 22 90.90% 93.57% 92.22% Uncountable 108 28 85.04% 79.41% 82.15% Total 446 54 89.2% 89.2% 89.2%

Table 2. The accuracy on the test suit ence on computational (COLING 5 Conclusion 2002), Taipei, Taiwan. Keller, F, Lapata, M. and Ourioupina, O. 2002. Us- From the results, we show that simple ing the web to overcome data sparseness. In Pro- unsupervised web-based models can achieve the ceedings of the Conference on Empirical Methods promising results on the test data. For we in Natural Language Processing. Philadelphia. roughly adjusted the threshold with stepsize of 1, 230-237. better performance is expected with stepsize of Keller, F and Lapata, M. 2003. Using the web to ob- such as 0.1. tain frequencies for unseen bigrams. Computa- It is unreasonable to compare the detecting tional Linguistics 29, 3, 459-484. results of individual and compound nouns with each other since using web-based models, com- Lapata, M and Keller, F. 2004. The web as a baseline: pound nouns made up of two or more words are Evaluating the performance of unsupervised web- based models for a range of NLP tasks. In Pro- more likely to be affected by data sparseness, ceedings of the Human Language Technology while individual nouns are prone to produce Conference of the North American Chapter of the more noise data because of their high occur- Association for Computational Linguistics. Boston, rence frequencies. Anyway using WWW is an exciting direc- Lapata, M and Keller, F. Web-based Models for Natural Language Processing. To appear in 2005 tion for NLP, how to eliminate noise data is the ACM Transactions on Speech and Language key to improve web-based methods. Our next Processing. step is aiming at evaluating the internet resource, distinguishing the useful and noise data. Lane O.B. Schwartz. 2002. Corpus-based acquisition of head noun countability features. Master’s thesis, References Cambridge University, Cambridge, UK. Guide to English Grammar and Writing. http:// cctc. Baldwin, Timothy and Francis Bond.2003. Learning commnet.edu/grammar/ the countability of English nouns from corpus data. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sap- poro, Japan, 463-470. Francis Bond and Caitlin Vatikiotis-Basteson. 202. Using an ontology to determine English countabil- ity. In Proceeding of the 19th International confer-

107