Corpus-Based Learning of Compound Noun Indexing * Byung-Kwan Kwak, Jung Yun Seo Jee-Hyub Kim, NLP Lab., and Geunbae Lee T Dept

Home , Compound (linguistics)

Corpus-Based Learning of Compound Noun Indexing * Byung-Kwan Kwak, Jung Yun Seo Jee-Hyub Kim, NLP Lab., and Geunbae Lee t Dept. of Computer Science NLP Lab., Dept. of CSE Sogang University Pohang University of [email protected] Science & Technology (POSTECH) {nerguri,gblee} @postech.ac.kr

Abstract definition of compound nouns in terms of information retrieval, so we define a compound In this paper, we present a corpus- noun as "any continuous noun sequence that based learning method that can appears frequently in documents." 1 index diverse types of compound In Korean documents, compound nouns are nouns using rules automatically ex- represented in various forms (shown in Table tracted from a large tagged corpus. 1), so there is a difficulty in indexing all types We develop an efficient way of ex- of compound nouns. Until now, there have tracting the compound noun index- been much works on compound noun indexing rules automatically and perform ing, but they still have limitations of cover- extensive experiments to evaluate ing all types of compound nouns and require our indexing rules. The automatic much linguistic knowledge to accomplish this learning method shows about the goal. In this paper, we propose a corpus- same performance compared with based learning method for compound noun the manual linguistic approach but indexing which can extract the rules automat- is more portable and requires no ically with little linguistic knowledge. human efforts. We also evaluate the seven different filtering meth- Table 1: Various types of Korean compound ods based on both the effectiveness noun with regard to "jeong-bo geom-saeg (in- and the efficiency, and present a formation retrieval)" new method to solve the problems of compound noun over-generation and jeong-bo-geom-saeg (information-retrieval) data sparseness in statistical com- jeong-bo-eui geom-saeg (retrieval of information) jeong-bo geom-saeg (informationretrieval) pound noun processing. jeong-bo-leul geom-saeg-ha-neun (retrieving information) jeong-bo-geom-saeg si-seu-tem 1 Introduction (information-retrievalsystem) Compound nouns are more specific and ex- pressive than simple nouns, so they are more As the number of the documents is growing valuable as index terms and can increase retrieval, efficiency also becomes as important the precision in search experiments. There as effectiveness. To increase the efficiency, we are many definitions for the compound noun focus on reducing the number of indexed spu- which cause ambiguities as to whether a given rious compound nouns. We perform experi- continuous noun sequence is a compound ments on several filtering methods to find the noun or not. We, therefore, need a clean algorithm that can reduce spurious compound " This research was supported by KOSEF special nouns most efficiently. purpose basic research (1997.9 - 2000.8 #970-1020- 301-3) 1 The frequency threshold can be adjusted accord- t Corresponding author ing to application systems.

57 The remainder of this paper • is organized whether they are statistical or linguistic, as follows. Section 2 describes previous com- have their own shortcomings. Statistical pound noun indexing methods for Korean and methods require signiAcant amounts of compound noun filtering methods. We show co-occurrence information for reasonable overall compound noun indexing system ar- performance and can not index the diverse chitecture in Section 3, and expl~.~n each mod- types of compound nouns. Linguistic meth- ule of the system in Section 4 and 5 in de- ods need compound noun indexing rules tail. We evaluate our method with standard described by human and sometimes result Korean test collections in Section 6. Finally, in meaningless compound nouns, which concluding remarks are given in Section 7. decreases the performance of information retrieval systems. They cannot also cover the 2 Previous Research various types of compound nouns because of the limitation of human linguistic knowledge. 2.1 Compound Noun Indexing In this paper, we present a hybrid method There have been two different methods that uses linguistic rules but these rules are for compound noun indexing: statistical automatically acquired from a large corpus and linguistic. In one Statistical method, through statistical learning. Our method gen- (Fagan, 1989) indexed phrases using six erates more diverse compound noun index- different parameters, including information ing rule patterns than the previous standard on co-occurrence of phrase elements, rela- methods (Kim, 1994; Lee et ah, 1997), be- tive location of phrase elements, etc., and cause previous methods use only most gen- achieved reasonable performance. However, eral rule patterns (shown in Table 2) and are his method couldn't reveal consistent sub- based solely on human linguistic knowledge. stantial improvements on five experimental document collections in effectiveness. (Strza- Table 2: Typical hand-written compound lkowski et al., 1996; Evans and Zhai, 1996) noun indexing rule patterns for Korean indexed subcompounds from complex noun Noun without case makers / Noun phrases using noun-phrase analysis. These Noun with a genitive case maker / Noun methods need to find the head-modifier rela- Noun with a nominal case maker or tions from noun phrases and therefore require an accusative case maker [ Verbal common noun or adjectival common noun difficult syntactic parsing in Korean. Noun with an adnominal ending ] Noun For Korean, in one statistical method, (Lee Noun within predicate particle phrase / Noun and Ahn, 1996) indexed general Korean nouns (The two nouns before and after a slash using n-grams without linguistic knowledge in the pattern can form a single compound noun.) and the experiment results showed that the proposed method might be Mmost as effective as the linguistic noun indexing. How- 2.2 Compound Noun Filtering ever, this method can generate many spuri- Compound noun indexing methods, whether ous n-grarn~ which decrease the precision in they are statistical or linguistic, tend to gen- search performance. In linguistic methods, erate spurious compound nouns when they (Kim, 1994) used five manually chosen com- are actually applied. Since an information re- pound noun indexing rule patterns based on trieval system can be evaluated by its effec- linguistic knowledge. However, this method tiveness and also by its efficiency (van Rijs- cannot index the diverse types of compound bergen, 1979), the spurious compound nouns nouns. (Won et al., 2000) used a full parser should be efficiently filtered. (Kando et al., and increased the precision in search experi- 1998) insisted that, for Japanese, the smaller ments. However, this linguistic method can- the number of index terms is, the better the not be applied to unrestricted texts robustly. performance of the information retrieval sys- In summary, the previous methods, tem should be.

58 For Korean, (Won et al., 2000) showed that segmentation of compound nouns is more Noun[ efficient than compound noun synthesis in s~i~ [ Infonnadon [ search performance. There have been many CompoundNoun~-----~-~ //•Compound works on compound noun filtering methods; Indexing I ~_Indexing~'~ / Ruleswig f--...... ?-.~ Find [ (Kim, 1994) used mutual information only, Compound I and (Yun et al., 1997) used mutual informa- Nouns , I tion and relative frequency of POS (Part-Of- Speech) pairs together. (Lee et ai., 1997) used Weighted . Compound stop word dictionaries which were constructed Compound Nouns manually. Most of the previous methods for compound noun filtering utilized only one consistent method for generated compound Figure 2: Compound noun indexing, filtering, nouns irrespective of the different origin of and weighting module (control flow =~, data compound noun indexing rules, and the meth- flow ~) ods cause many problems due to data sparse- hess in dictionary and training data. Our approach solves the data sparseness problem 4 Automatic Extraction of by using co-occurrence information on auto- Compound Noun Indexing Rules matically extracted compound noun elements There are three major steps in automatically together with a statistical precision measure extracting compound noun indexing rules. which fits best to each rule. The first step is to collect compound noun statistical information, and the second step is 3 Overall System Architecture to extract the rules from a large tagged cor- The compound noun indexing system pro- pus using the collected statistical information. posed in this paper Consists of two major The final step is to learn each rule's precision.. modules: one for automatically extracting 4.1 Collecting Compound Noun compound noun indexing rules (in Figure 1) Statistics and the other for indexing documents, filtering the automatically generated compound We collect initial compound noun seeds which nouns, and weighting the indexed compound were gathered from various types of well- nouns (in Figure 2). balanced documents such as ETRI Kemong encyclopaedia2 and many dictionaries on the Internet, and we collected 10,368 seeds, as Compound ~ TaggedCorpus shown in Table 3. The small number of seeds 1 are bootstrapped to extract the Compound ExtractedRules I noun indexing rules for various corpora.

Compound ~R FilteredRules Table 3: Collected compound noun seeds NounStatistical Information No. of 2 3 Total component elements ~ Roleswith Precision ETRI Kemong encyclomedia 5,100 2,088 7,188 Internet dictionaries 2,071 1,109 3,180 Figure 1: Compound noun indexing-rule extraction module (control flow =~, data flow To collect more practical statistics on the compound nouns, we made a 1,000,000 eojeol(Korean spacing unit which corresponds

2 Courteously provided by ETRI, Korea.

59 to an English word or phrase) tagged cor- ending (eCC) in the sentence, only to the pus for a compound noun indexing experi- conjunctive ending of the sentence. A rule ment from a large document set (Korean In- extraction example is shown in Figure 3. formation Base). We collected complete compound nouns (a continuous noun sequence Algorithm 1: Extracting compound noun indexing rules (for 2-noun compounds) composed of at least two nouns on the condi- Read Template tion that both the preceding and the following Read Seed POS of the sequence axe not nouns (Yoon et (Consist of Constituent 1 / Constituent 2) al., 1998)) composed of 1 - 3 no, ms from the TokeD/ze Seed into Constituents Put Constituent 1 into Key1 and Constituent 2 tagged training corpus (Table 4). • into Key2 While (Not(End of Documents)) { Table 4: Statistics for complete compound Read Initial Tag of Sentence nouns While (Not(End of Sentence or eCC)) { No. of 1 2 3 Read NeIt Tag of Sentence component elements If (Read Tag =ffi Key1) Vocabulary 264,359 200,455 63,790 { While (Not(End of Sentence or eCC)) ( Read Next Tag of Sentence 4.2 Extracting Indexing Rules If (Current Tag == Key2) Write Rule according We define a template (in Table 5) to extract to the Template the compound noun indexing rules from a } } POS tagged corpus. } The template means that if a front- condition-tag, a rear-condition-tag, and sub- The next step is to refine the extracted string-tags are coincident with input sentence rules to select the proper ones. We used a rule tags, the lexical item in the synthesis position filtering algorithm (Algorithm 2) using the of the sentence can be indexed as a compound frequency together with the heuristics that noun as "x/ y (for 3-noun compounds, x / the rules with negative lexical items (shown in y / z)". The tags used in the template are Table 6) will make spurious compound nouns. POS (Part-Of-Speech) tags and we use the POSTAG set (Table 17). Algorithm 2: Filtering extracted rules us- The following is an algorithm to extract ing frequency and heuristics compound noun indexing rules from a large I. For each compound noun seed, select the rules whose frequency is greater than 2. tagged corpus using the two-noun compound 2. Among rules selected by step 1, select seeds and the template defined above. The only rules that are extracted rule extraction scope is limited to the end at least by 2 seeds. 3. Discard rules which contain of a sentence or, if there is a conjunctive negative lexical items.

Table 5: The template to extract the com- Table 6: Negative lexical item examples pound noun indexing rules ,o front-condition-tag I negative items (tags) example phrases sub-string-tags (tag 1 tag 2 ... tag n-1 tag n) [ je-oe(MC) (exclude) no-jo-leul je-oe-han hoe-eui rear-condition-tag I (meeting excluding union) synthesis locations (x y) eobs(E) (not exist) sa-gwa-ga eobs~neun na-mu (tree without apple) lexicon x / lexicon y mos-ha(D) (can not) dog-lib-eul mos-han gug-ga (country that (for 3-noun compounds, cannot be liberated) synthesis locations (x, y, z)

lexicon x / lexicon y / lexicon z) We automatically extracted and filtered out

60 fron~com~illm1_~g Isub_s~ring_mgs (~ I lag2 ...tag n-I ~n) ~rcar_cond~iol~.I~ syn~lcslslocation (x y) ~> lexiconx I Icxlco~y Table 8: Comparison between the automatically extracted rules and the manual rules Tagged~t~ B,~baI-Ib, No. of No. of MC< .kong-bo > (i.formafioa~¢uicvall .iC rule patterns used in rule patterns Manual linguistic 16 method Our method 23 78 Indcxlag Rule: B IMC.jC MC Iy I l 3

Figure 3: Rule Extraction Process Example Table 9: Examples of newly added rule patterns Rule Noun + bound noun / Noun 2,036 rules from the large tagged corpus (Ko- Noun + suffix / Noun rean Information Base, 1,000,000 eojeol) us- Noun + suffix + assignment verb + ing the above Algorithm 2. Among the ill- adnominal ending / Noun tered rules, there are 19 rules with negative lexical items and we finally selected 2,017 rules. Table 7 shows a distribution of the final counting how many indexed compound noun rules according to the number of elements in candidates generated by the rule are actual their sub-string-tags. compound nouns: Yactuat Table 7: Distribution of extracted rules by Prec(rule) = Ncandidate number of elements in sub-string-tags where Prec(rule) is the precision of a rule, No. Distribution Example Ndctual is the number of actual compound 2 tags 79.6 % MC MC 3 tags 12.6 % MC jO(eui) MC nouns, and Ncandidate is the number of com- 4 tags 4.7 % MC y eCNMG MC pound noun candidates generated by the au- 5 tags 1.5 % MC MC jO(e) tomatic indexing rules. DI

61 where x and y are two words in the corpus, (MF). and I(x; y) is the mutual information of these 4. Collect the same compound nouns, and sum all the modified frequencies two words (in this order). Table 10 shows for each compound noun. the average MI value of the two and three 5. If the sunm~ed modified frequency is greater elements. than a threshold, add this compound noun to the complete compound noun statistical information. Table 10: Average value of the mutual infor- 6. Calculate all rules' precision again using the changed complete compound noun mation (MI) of compound noun seeds statistical information. .Number of elements [ 2 I 3 7. Calculate the average precision of the rules. Average MI 3.56 3.62 8. If the average precision of the rules is equal to the previous average precision, stop. Othervise, go to step 2. The MI was calculated from the statistics of the complete compound nouns collected from Table 11: Improvement in the average preci- the tagged training corpus (see Section 4.1). sion of rules However, complete compound nouns are continuous noun sequences and cause the Learning 1 2 3 4 5 6 data sparseness problem. Therefore, we need cycles Avg. prec. 0.19 0.23 0.39 0.44 0.45 0.45 to expand the statistics. Figure 4 shows of rules the architecture of the precision learning module by expanding the statistics of the complete compound nouns along with an 5 Compound Noun Indexing, algorithmic explanation (Algorithm 3) of the Filtering, and Weighting process. Table 11 shows the improvement in In this section, we explai n how to use the au- the average precision during the repetitive tomatically extracted rules to actually index execution of this learning process. the compound nouns, and describe how to fil- ter and weight the indexed compound nouns.

Norm Statistical ) 5.1 Compound Noun Indexing To index compound nouns from documents, we use a natural language processing engine, SKOPE (Standard KOrean Processing En- CompoundNorm of Rules ~'~ Rule incision (step5) (s~ 2.7) l~v v[ (steps) gine) (Cha et al., 1998), which processes documents by analysing words into morphemes and tagging part-of-speeches. The tagging results are compared with the automatically learned compound noun indexing rules and, if they are coincident with each other, we index Figure 4: Learning the precision of the com- them as compound nouns. Figure 5 shows a pound noun indexing rules (The steps are process of the compound noun indexing with shown in Algorithm 3) an example.

Algorithm 3: 5.2 Compound Noun Filtering i. Calculate all rules' initial precision Among the indexed compound nouns above, using initial complete compound noun still there can be meaningless compound statistical information. nouns, which increases the number of index 2. Calculate the average precision of the rules. terms and the search time. To solve com- 3. Multiply a rule's precision by pound noun over-generation problem, we ex- the frequency of the compound noun made by the rule. periment with seven different filtering meth- We call this value the modified frequency ods (shown in Table 12) by analyzing their

62 •. " Tagging Result: Among these methods, method B gener- bbal-li jeong-bo-leul B geom-saeg-ha-netm ~ Auaty~ ~. ~ MC ated the smallest number of compound nouns (~evmg I \~'~?_"2:? ¢"/ ~ jc<,e.,> information I \ tagging / /1..~ best efficiency and showed the reasonable ef- • :-t.~..~ ] ~ -- -/ / I.MU fectiveness (Table 16). On the basis of this / I eCNMG filtering method, we develop a smoothing

"~d~"~g"l~ ~es-- ]'X~mpoun"~ Indexed method by combining the precision of rules i 1,2 . with the mutual information of the compound noun elements, and propose our final filtering Complete l/ ~ geom-saeg method (H) as follows: _ C?mp~No~ ~ -- (mf,,~o./ Statistical Information ] retrieval) T(x, y) = log 2 P(x)P(x, x y)P(y) + ~ × Precision Figure 5: Compound noun indexing process where a is a weighting coefficient and Preci- sion is the applied rules learned in Section 4.3. relative effectiveness and efficiency, as shown For the three-element compound nouns, the in Table 16. These methods can be divided MI part is replaced with the three-element MI into three categories: first one using MI, sec- equation 3 (Su et al., 1994). ond one using the frequency of the compound nouns (FC), and the last one using the fre- 6 Experiment Results quency of the compound noun elements (FE). To calculate the similarity between a docu- MI (Mutual Information) is a measure of word ment and a query, we use the p-norm retrieval association, and used under the assumption model (Fox, 1983) and use 2.0 as the p-value. that a highly associated word n-gram is more We also use fhe component nouns in a com- likely to be a compound noun. FC is used pound as the indexing terms. We follow the under the assumption that a frequently en- standard TREC evaluation schemes (Salton countered word n-gram is more likely to be a and Buckley, 1991). For single index terms, compound than a rarely encountered n-gram. we use the weighting method atn.ntc (Lee, FE is ;used under the assumption that a word 1995). n-gram with a frequently encountered specific element is more likely to be a compound. In 6.1 Compound Noun Indexing the method of C, D, E, and F, each threshold Experiments was decided by calculating the average num- This experiment shows how well the proposed ber of compound nouns of each method. method can index diverse types of compound nouns than the previous popular methods which use human-generated compound noun Table 12: Seven different filtering methods indexing rules (Kim, 1994; Lee et al., 1997). (MI) A. Mutual information of compound For simplicity, we filtered the generated com- noun elements (0) pound nouns using the mutual information of (MI) B. Mutual information of compound noun elements the compound noun elements with a thresh- (average of MI of compound noun seeds) old of zero (method A in Table 12). (FC) C. Frequency of compound nouns in the training corpus (4) Table 13 shows that the terms indexed by (FC) D. Frequency of compound nouns previous linguistic approach are a subset of in the test corpus (2) the ones made by our statistical approach. (FE) E. Frequency of compound noun heads in the training corpus (5) This means that the proposed method can (FE) F. Frequency of compound noun modifiers cover more diverse compound nouns than the in the training corpus (5) G. No filtering 3 PD (x, ~, z) I(x;y;z) = log2 Px(x,y,z) (The value in parantheses is a threshold.)

63 6.2 Retrieval Experiments Using Table 13: Compound noun indexing coverage Various Filtering Methods experiment (With a 200,000 eojeol Korean In- formation Base) In this experiment, we compare the seven fil- Manual Our tering methods to find out which one is the linguistic automatic best in terms of effectiveness and efficiency. rule patterns rule patterns No. of For this experiment, we used our automatic generated actual 22,276 30,168 rules for the compound noun indexing, and compound nouns. (+35.4 %) the test collection KTSET2.0. To check the No. of generated actual 7,892 effectiveness, we used recall and ll-point av- compound nouns erage precision. To check the efficiency, we without overlap used the number of index terms. Table 16 shows the results of the various filtering experiments. manual linguistic rule method. We perform a From Table 16, the methods using mu- retrieval experiment to evaluate the automat- tual information reduce the number of in- ically extracted rules. Table 144 and table 155 dex terms, whereas they have lower precision. show that our method has slightly better re- The reason of this lower precision is that MI call and ll-point average precision than the has a bias, i.e., scoring in favor of rare terms manual linguistic rule method. over common terms, so MI seems to have a problem in its sensitivity to probability es- timation error (Yang and Pedersen, 1997). Table 14: Compound noun indexing effective- In this experiment 6, we see that method B ness experiment I generates the smallest number of compound nouns (best efficiency) and our final propos- Manual linguistic Our automatic rule patterns rule patterns ing method H has the best recall and precision Avg. recall 82.66 83.62 •(effectiveness) with the reasonable number •of (+1.16 %) compound nouns (efficiency). We can con- ll-pt. 42.24 42.33 avg. precision (+0.21%) clude that the filtering method H is the best, No. of 504,040 515,801 considering the effectiveness and the efficiency index terms (+2.33 %) at the same time.

7 Conclusion

In this paper, we presented a method to ex- Table 15: Compound noun indexing effective- tract the compound noun indexing rules au- ness experiment II tomatically from a large tagged corpus, and Manual linguistic Our automatic rule patterns rule patterns showed that this method can index compound Avg. recall 86.32 87.50 nouns appearing in diverse types of docu- (+1.35 %) ments. ll-pt, avg. 34.33 34.54 precision (+0.61%) In the view of effectiveness, this method is No. of 1,242,458 1,282,818 slightly better than the previous linguistic ap- index terms (+3.15 %) proaches but requires no human effort. The proposed method also uses no parser and no rules described by humans, therefore, it can be applied to unrestricted texts 4 With KTSET2.0 test collections (Courteously very robustly and has high domain porta- provided by KT, Korea. (4,410 documents and 50 queries)) 6 Our Korean NLQ (Natural Lan- s With KRIST2.0 test collection (Courteously pro- guage Querying) demo system (located in vided by KORDIC, Korea. (13,514 documents and 30 'http:/ /nlp.postech.ac.kz/Resarch/POSNLQ/') queries)) can be tested.

64 Table 16: Retrieval experiment results of various filtering ; methods A B C D E F G H Average 83.62 83.62 83.62 83.62 83.62 83.62 84.32 84.32 recall (+0.00) (+0.00) (+0.00) (+0.00) (+0.00) (+0.84) (.+0.84) ll-pt, avg. 42.45 42.42 42.49 42.55 42.72 42.48 42.48 42.75 precision (-0.07) (+0.09) (+0.24) (+0.64) (+0.07) (+0.07) (+o.71) Precision 52.11 52.44 52.07 52.80 52.26 51.89 52.81 52.98 at 10 Docs. No. of 515,80 508,20 514,54 5~547,27 572,36 574,04 705,98 509,90 index terms (-1.47) (-0.24) (-++6.10) (+10.97) (+11.29) (+36.87) (-1.14) bility. We also presented a filtering method methods for japanase text retrieval. SIGIR fo- to solve the compound noun over-generation rum, 32(2):23-28. problem. Our proposed filtering method (H) Pan Koo Kim. 1994. The automatic indexing shows good retrieval performance both in the of compound words from korean text based on view of the effectiveness and the efficiency. mutual information. Journal of KISS (in Ko- In the future, we need to perform some rean), 21(7):1333-1340. experiments on much larger commercial databases to test the practicality of our Joon Ho Lee and Jeong Soo Ahn. 1996. Using n-grams for korean text retrieval. In SIGIR'96, method. pages 216-224. . Finally, our method doesn't require language dependent knowledge, so it needs to be Hyun-A Lee, Jong-Hyeok Lee, and Geunbae Lee. verified whether it can be easily applied to 1997. Noun phrase indexing using clausal segmentation. Journal of KISS (in Korean), other languages. 24(3):302-311.

Joon Ho Lee. 1995. Combining multiple evidence References from different properties of weighting schemes. Jeongwon Cha, Geunbae Lee, and Jong-Hyeok In SIGIR'95, pages 180-188. Lee. 1998. Generalized unknown morpheme guessing for hybrid pos tagging of korean. Gerard Salton and Chris Buckley. 1991. In Proceedings of SIXTH WORKSHOP ON Text retrieval conferences evaluation pro- VERY LARGE CORPORA in Coling-ACL 98. gram. In .ftp://)2p.cs.corneU.edu/pub/smart/, trec_eval.7.0beta.tar.gz. K. W. Church and P. Hanks. 1990. Word association norms, mutual information, and lexicog- Tomek Strzalkowski, Louise Guthrie, Jussi Karl- raphy. Computational Linguistics, 16(1):22-29. gren, Jura Leistensnider, Fang Lin, Jose Perez- Carballo, Troy Straszheim, Jin Wang, and Jon David A. Evans and Chengxiang Zhai. 1996. Wilding. 1996. Natural language information Noun-phrase analysis in unrestricted text for retrieval: Trec-5 report. In The Fifth Text information retrieval. In Proceedingof the 3~th REtrieval conference (TREC-5), NIST Special Annual Meetinof the Association for Computa- publication, pages 500-238. tional Linguistics, Santa Cruz, CA, pages 17- 24. Keh-Yih Su, Mind-Wen Wu, and Jing-Shin Chang. 1994. A corpus-based approach to au- Joel L. Fagan. 1989. The effectiveness of a non- tomatic compound extraction. In Proceedings syntactic approach to automatic phrase index- of ACL 94, pages 242-247. ing for document retrieval. JASIS, 40(2):115- 132. C. J. van Rijsbergen. 1979. Information Re- E. A. Fox. 1983. Extending the Boolean and Vec- trieval. University of Computing Science, tor Space Models of Information Retrieval with Lodon. P-norm Queries and Multiple Concept Types. Ph.D. thesis, Cornell Univ. Hyungsuk Won, Mihwa Park, and Geunbae Lee. 2000. Integrated multi-level indexing method Noriko Kando, Kyo Kageura, Masaharu Yoshoka, for compound noun processing. In Journal o.f and Keizo Oyama. 1998. Phrase processing KISS, 27(1) (in Korean), pages 84-95.

65 Table 17: The POS (Part-Of-Speech) set of POSTAG Tag Description Tag Description MC common noun MP proper noun MD bound noun T pronoun G adnoun S numeral B adverb K interjection DR regular verb DI irregular verb HI~ regular adjective HI irregular adjective I assignment verb E existential predicate jc case particle js auxiliary particle jo other particle eGE final ending eGS prefmal ending eCNDI attx conj ending eCNDC quote conj ending eCNMM nominal ending eCNMG adnominal ending eCNB adverbial ending eCC conjunctive ending Y predicative particle b auxiliary verb + prefix suffix su unit symbol so other symbol S c left parenthesis s' right parenthesis s. sentence closer s- sentence connection s, sentence comma sf foreign word sh Chinese character

Yiming Yang and Jan O. Pedersen. 1997. A com- parative study on feature selection in text cat- egorization. In Douglas H. Fisher, editor, Pro- ceedings of ICML-97, l~th International Con- ference on Machine Learning, pages 412--420, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. Jun-Tae Yoon, Eui-Seok Jong, and Mansuk Song. 1998. Analysis of korean compound noun indexing using lexical information between nouns. Journal of KISS (in Korean), 25(11):1716- 1725. Bo-Hyun Yun, Yong-Jae Kwak, and Hae-Chang Rim. 1997. A korean information retrieval model alleviating syntactic term mismatches. In Proceedings of the Natural Language Process- ing Pacific Rim Symposium, pages 107-112.