Processing Homonyms in the Kana-To-Kanji Conversion

Home , O (kana), Re (kana), To (kana), U (kana)

Processing Homonyms in the Kana-to-Kanji Conversion

Masahito Takahashi Tsuyoshi Shinchu Kenji Yoshimura Kosho Shudo Fukuoka University Fukuoka University Fukuoka University Fukuoka University 8-19-1, Nanakuma, 8-19-1, Nanakulna, 8-19-1, Nanakuma, 8-19-1, Nanakuma, Jonan-ku, Fukuoka, aonan-ku, Fukuoka, Jonan-ku, Fukuoka, Jonan-ku, Fukuoka, 814-01, Japan 814-01, Japan 814-01, Jal)an 814-01, Jut)an [email protected]. shinchu((}helio.tl. yosilnu ra<¢~tlsun, tl. shudo (~tlsun.tl. fukuoka-u.ac.jp fukuoka-u.~u:.jp fukuoka-u.ac.jp fukuoka-u.ac.jt)

Abstract occurrence relation between a noun and a verb ill a sentence. We have defined two co-occurrence This paper I)roI)oses two new meth- data sets. One is a set of nouns ~companied by a ods to identify the correct meaning of case marking particle, e~:h element of which has a Japanese honmnyms in text based on tile set of co-occurring w~rbs in a sentence. The other iloun:verb co-occIlrrenc(~, ill a sentence is a set of verbs accompanied by a case marking which (:an be obtained easily from cor- partMe, each element of which has a set of co- pora. The first method uses the near occurring nouns in a sentence. We (:all these tv~o co-occurrence data sets, which are con- co-occllrrence data sets near co-occurrence data structed from the above (:o-occurrence sets. Thereafter, we apply the data sets to the relation, to select the most fe~Lsible word 1)ro<:essing of holuonylns. Two strategies are used among homonyms in the s(:ol)e of a sea> to al>l)roach the problem. The first uses the near tence. Tile se(:ond uses the flu' co- co-occurrence data sets to select the most feasible occurrence data sets, which are con- word among homonyms in the scope of a sentence. strutted dynamically fl'om the near co- The aim is to evaluate the possible existen<-e of a occurrence data sets in the course of near co-occurrence relation, or co-occurrence re- processing input sentences, to select the lation betweeu a noun and a verb within a sen- most feasible word among homonyms ill tence. The second ewfluates the possibh' existence the s(:ope of a sequence of sentences. An of a far co-occurrence relation, referring to a co- experiment of kana-to-kanfi(phonogran> occurrence relation among words in different sen- to-ideograph) conversion has shown that tences. This is achieved by constructing fitr co- the conversion is carried out at the ac- occurrence data sets from near co-occurrence data curacy rate of 79.6% per word by the sets in the course of processing input sentences. first method. This accuracy rate of our method is 7.4% higher than that of the 2 Co-occurrence data sets ordinary method based on the word occurrence frequency. The near co-occurrence data sets are (lefined. The first near co-occurrence data set is the set EN ...... each element of which(n) is a triplet con- 1 Introduction sisting of a noun, a case marking partMe, and a set of w~rl)s which co-occur with that noun and Processing hontonynLs, i.e. identifying the correct l)artMe pair in a sentence, as follows: meaning of homonyms in text, is one of the most important phases of kana-to-kanji conversion, cur- n = (noun, par ticle, {(Vl, kl ), (v2, ~;2),"" }) rently the most popular method for int)utting Japanese characters into a computer. Recently, Ill this description, particle is a Japanese case severM new methods fi)r processing homonyms, marking particle, such as 7)'-'; (nominative case), based on neural networks(Kol)ayashi,1992) or tile (ac(:usative case), or tC (dative case), vi(i = co-occurrence relation of words(Yamamot<),1992) 1,2,..-) is a verb, and ki(i ---- 1,2,...) is the fre- , have been proposed. These methods apl)ly to quency of occurren(:e of the combination noun, the co-occurrence relation of words not only in particle and vl, which is del;ermined in the course a sentence but also ill a sequence of sentellC(~s. of constructing EN ...... fi'om corpora. The follow- It appears impra<:ticat)le to prepare a neural net- ing are examl)les of the elements of EN ...... work for co-oecurren(:e data large enough to han- (g[~ (rain), 7)~ (nominative case), dle 50,000 to 100,000 Japanese words. {(~7~ (fall),10),(lk~2e(stop),3), ..}) In this 1)aper, we propose two uew methods for (~(rain), ~ (accusative case), processing Japanese homonyms based on the (:o- {(~JT~-~Xa (take precautions),3), ..})

1135 The second near co-occurrence data set is the Go to Step2. set By,...... each element of which(v) is a triplet Step4 From Ti(i = 1,2,...,m), find the combi- consisting of a verb, a case marking partMe, and nation: a set of nouns which co-occur with that verb and particle pair in a sentence, as follows: (Wl, Vi)(W~, V.2),.-', (Win, Vm) (w, ~) ~ T~(i = 1, 2,..., ,,~) v = (verb,particle, {(nt, ll), (n2,12),"' "}) which has the largest value of In this description, particle is a Japanese case I N(v,, v.~,..., <,,) I. Where the function marking particle, ni(i = 1,2,...) is a noun, and f-](v,, v2,..., v,,,) is deiilled as Mlows. li(i : 1,2," ") is the frequency of occurrence of the combination verb, particle and hi. The follow- ('](Vl, v.~,..., v.D = {(v, y~. k~) I ing are examples of the elements of Ev, ...... Ev,~,~,. i=1 can be constructed fi'om E~ ...... , and vice versa. (~, k,) e v, A... A (v, k,,,) c v,,,} (}Yo (fall), 7~ (nominative case), And ] ["1(1/1, V,x,..., V,,<) i is defined: {(~]~ (rain) ,10) , (~ (snow) ,8), . .}) (~ (fall), }C (dative case), I N(vl, v.~,..., v,,,) I-- ~ k {(JIL'J']'] (Kyushu), 1), . .}) (~.k)6n(v, ,v.2.....vm) 3 Processing homonyms in a The sequence of words WI,W2,...,W,~ is simple sentence the most feasible conibination of homonymic ka'aji-w, riants for tile ini)ut kana-string. Using the near co-occurrcncc data set.s, the most feasible word among possible homonynls can 1)e 3.2 An example of processing homonyms selected within the scope of a sentence. Our hy- in a simple sentence pothesis states that the most feasible noun o1' com- Following is an example of homonynl processing bination of nouns has the largest nulnber of verbs nsiug the abow,, procedures. with which it can co-occur ill a sentence. For the input kana-string The structure of an input Japanese sentence "7~a~{C_l~ b,~e (kawa ni hashi o)" written in kana-characters can be simplified as fol- "D~ (~a.~a)" means a riw'.r dud <'~ b (ha~h0" lows: nleans a bridge. "]0~ (kawa)" and "~ 1_, (hashi)" N~ . P~,N2 . P=,...,N,,, . p,,,, v both have honionyniic kanji-variants: where Ni(i = 1,2,...,m) is a noun, l'i(i = holllonynls of "7~a~]~") (]gawa)" : )[] (river) 1,2,...,m) is a particle and V is a verb. )59. (leather) 3.1 Procedure honlouyuls of "}'k'[ 1~ (hashi)" : ~ (bridge) Following is the procedure for finding the most ~ (chopsticks) feasible combination of words for an input kana- The near co-occurrence data for ")il (river)" and string which has the above simplified Japanese "~ (leather)" followed by the particle "~:-(dative sentence structure. This procedure can also ac- case)" and tile near co-occurrence data for "~} cept an input kana-string which does not include (1,ridge)" and ".~ (chopsticks)" followed by tile a final position verb. t)article "~ (accusative case)" are shown below. Stepl Let m = 0 and Ti = e(i = 1,2,...). ()ll (river), }C,{({-~< (go),8), (~-~70 (build),6), Step2 If an input kana-string is null, go to Step4. (']~:J- (drop), 5) }) Otherwise read one block of kana-string, that (~ (leather), ~Y-,{("~Xa (paint),6), is N • P or V, fi'om tile left side of the in- (~]~7o (touch),3)}) put kana-string. And delete the one block (~(bridge), "~,{(~7o (walk across),9), of kana-string fi'om the left side of the inlmt (~ (build),7), kana-string. ('~j- (drop) ,4) }) Step3 Find all homonyinic ka,q/-variants (~(chopsticks), ~,{(~ (use),7), Wk(k = 1,2,...) for tile kana-string N or V ('~J- (drop),3)}) which is read in Step2. Following tile procedure, the. resultant frequency Increase m by 1. values are as follows: For each Wk(k = 1, 2,...): 1. If W~ is a noun, retrieve (W~, P, V~,) from )il~c ~ 8{i ~9-}

tile near co-occurrence data ,set ~N ...... and add the doublet (W~, V~) to T,~. ~ -~ 0{} 2. If W~ is a verb, add the doublet Therefore, the nlost feasible combination of words (w~, {(w~,0)}) to T,,,. is "),l (river) }:- ~ (bridge) ~."

1136 4 An experiment on processing (:a.se. [n the stone way, the a(:curacy rate is 93.8% homonyms in a simple senten(:e 1)('r word, where the conversion r~Lte is 14.5% per word, in tile se(:ond (:a.se. And then, we. ~flso (:on- 4.1 Preparing a dictionary and a (lu(:ted the sCLme experiment by using the method co-occurrence data file 1)ase(l on the. word oc(:un'ence frequency to (:om- 4.1.1 a noun file 1)~m~ our iue(;hod with an or(lin~ry inethod. It has showu that the accuracy rate is 72.2% per word A noun file in<:luding 323 nOUllS, whi(:h con- in the tirst case, ~md 77.8% per woM in the sec- sists oi" 190 nouns extra.l-ted front text concern- on(l (:~Lse. We (:a,n lind tile ac(:ur;~cy r;Lte I)y our ing (:urrent topics ~til(t their 133 holnollylns, was met;ho(l is 7.4% higher it, the first case ~md 16.9% pret)~m'd. higher ill the secolM case ('oml)~u'ed with the or- 4.1.2 a co-occurrence data file (lilmry m(~thod, it is clarified th~Lt our method is i (:o-occurrence (l~(;;~ ill(', was l)repa.r('d. The lnore etlective than the ordimLry method based on record format of the file is specitied as folh)ws: the word O('Clll'relH:(*. fre(l/tellcy. [I1()1111, C~:K'-le marking 1)~Lrticle, verl), tlw 5 An approximation of the far frequency of occurrence] co-occurrence relation wher(; case marking t)~trti(:le is (:hosen from 8 kinds of particles, i~mlely, "7)~"," "~" ," ~", "~'-" ," &" ," 7) ~ W(, (:~m a,1)l)roxilua£e tile far co-occ'arrc,nce re- 5"," J:. 9","'~". |at'ion, which is (:o-o(:curren(:e relation among It includes 25,665 re.cords of co-o(:curr(mce re- words ill ~L seqllellce of Sellt(~ll(zes, frolil 'li,C(l,'F co- l~d;ion (79 records per noun) for (;he nouns in the occu'rrc'nce data sets. The fivr co-occurrence data nollIl file by inerging 11,294 re(:ords fron~ EDR sets m'c descril)ed ~s follows: Co-o(:(:urrence Di(:tionary(EDR.,1994) with 15,856 EN, .... = { (,,,,, t, ), (',,,~, t~),..., (',<, t,.)} records from handmade simI)le sentences. ~,5 .... = ~(~,,., ',,,), ('~,',,~),..., (<, 'a,~)} 4.1.3 an input file and an answer file where ni(i = 1,2,..., 1,,) is a noun, t i is the 1)ri- An intmt file for ml exl)erilnent , whi(:h in(:hules ority wdue of hi, vi(i = 1, 2,...,Iv) is ~t verb and 1,1.29 silnple sentences written in ks'as a.ll)haJ)et , ui is the priority wdue of vi. and an ~mswer tile, which includes the same 1,129 The l)ro(:(~(lure for 1)roducing the fiw co- sentences written in kanji (:hzLra(:l;ers, were l)re- oceurre'n,c(', data .sets is: pared, ilere, every noun of the sel~ten(:es in the Stepl Clem: the fa,'r co-oceu'r'rcnce data .sets. files was chosen fl'om the nOUl~ file.

4.1.4 a word dictionary E~# .... = e A word dictionary, which consists of 323 lmuns in the llOUll file ;tlld 23,912 verbs in a, ,]at)aJlese Step2 After e~tch fixing of noun N ,mmng homo- dictionary for kana-to-kanfi conversion ', w~Ls l)re - nyms in tim process of ka'ua-to-kanji conver- 1)~r(~(1. It is use(1 to find all honmnymi(: ka, nji- sioll, rClll:W the fivr co-occu'rrencc data 8¢;t,s varimd;s for each noun or verb of the Sellt(Hices in 2Nj .... ~md Ev~,,. by folh)wing these steps: the input lilt. 1. (',ha,lJge all priority values of ti(i = 1, ~,..., t,,) i,~ ~h,, ~t ~,,,,. ~,, f(td (for 4.2 Experiment results exmnl)le, f(tl) = 0.95ti). This process An exl)eriment on processing homonylns in at sim- is intended to de(:rease priority with the 1)le sentence was carried out. In this exper- 1);tss~tge of time. iment, kana-to-kanji conversion was N)plied to e.ach of tim sentences, or the inlmt kana-striugs, ill the al)ove input file ~md the 'near eo-occwrrc'ace data sets wer('. (:onstrlt(:t;ed froll~ I;he ~d)ove co- TM)le 1: Exl)erinmnt results oi1 pl'o(:essing homo- o(:currence (la.t;t ill(.'. Tal)lel shows the resul(,s llylllS ill a. silnl)h~ SlHltellce, of kana-to-ka'aji (:onversion in the fofh)wing two For sell(;elIces For sentellces cases, in tile first (n~se, ~LII inl)ut ks, us-string does without a verb with a verb not include ~L fired position ver]). It lIl(~a.llS I;haJ; Collversion r~tl, e 1053/1129 1661~9 each verl) of the kana..strings ill the input file ix 1)er Selit(~llC(~ (93.3%) (14.7%) neglecte(l. In the s(~,COll(l CILSe, a.ll till)Ill. ]go,'//,a- Ac(:ur~t(:y ra.te 663/1053 1361~66 string includes a final position verb. The ('.xl)eri- 1)el: S(~lltellce (63.0%) (81.9%) Ill(Jilt h;ts showll tfia£ the (:onversiou ix carried Olll; Conversion r~d;e 215502315 500/3444 ~t the accura.(:y ra.te of 79.6% per word, wher(~ l)er word (93.1%) (14.~%) the ('onversion r~te is 93.1% per word, in the first A(:cura.(:y rzd;e 1716/2155 469/~O ~This dictionary was m~Me by A1 Soft; Co. l)er word (79.6%) (93.8%)

1137 2. Change all priority values of ui(i = Step2 The noun Ni which has the greatest T, 1, 2~,-.-,1,) in the set Evj,,, to f(ui) as priority value in Sn is tile most feasible noun well. for the input word written in kana-chara~ters. 3. Let N be the noun determined in 6.2 Procedure for case2 the process of kana-to-kanji conversion. Find M1 Step1 Find set S,:

k,)(i = 1, 2,..., q) sv = which co-occur with the noun N followed Here, Vj(j = 1,2,...)is a homo,tymic kanji- by any particle, in the near co-occurrence variant for the input word written in kana- data set EN ...... Add new elements characters and Uj is the priority value for (v, g(k,))(i = 1, 2,..., q) homonym Vj , which can be retrieved from to the set Evj,,,. If an element with the the far co-occurrence data set Evj .... same verb vi already exists in Evs~,,., add Step2 The verb Iv) which has the greatest Uj pri- the value g(k,) to the priority vMue of ority wflue in S, is the most feasible verb for that element instead of the new element. the input word written in kana-characters. Here, g(k,) is a function for converting fl'equency of occurrence to priority value. 7 Conclusion For example, We have proposed two new methods for processing g(ki) = 1 - (1/k,) Japanese homonyms based on the co-occurrence 4. Let vi be the verb described in the pre- relation between a noun and a verb in a sentence vious step. Find all which can l)e obtained easily from corpora. Using (nj, lj)(j = 1, 2,..., q) these inethods, we can evMuate the co-occurrence relation of words in a simple sentence by using the which co-occur with the verb v, and any near co-occurrence data sets obtained from cor- particle in the near co-occurrence data pora. We can Mso evaluate the co-occurrence rela- set P~v...~. Add new elements tion of words in different sentences by using the far (nj, h(ki, lj))(j = 1,2,..., q) co-occurrence data sets constructed from the near co-occurrence data sets in the course of process- to the set ENj .... If an element with the iug input sentences. The far co-occurrence data same noun nj already exists in Egs~,~,, sets are based on the proposition that it is more add the value h(ki, lj) to the priority practical to maintain a relatively smM1 amount value of that element instead of the new of data on the semantic relations between words, element. Here, h(ki, lj) is a flmction being changed dynamically in the course of pro- for converting frequency of occurrence to cessing, than to maintain a huge universal "the- priority value. For example, saurus" data base, which does not appear to have h(k, b) = g(k,)(1 - (1/l j)) been built successfldly. An experiment of l~ana-to-kanji conversion by 6 Processing homonyms in a the first method for 1,129 input simple sentences sequence of sentences has shown that the conversion is carried out in 93.1% per word and the accuracy rate is 79.6% Using the Jar co-occurrence data sets defined per word. It is clarified that the first method is in the previous section, the most feasible word more effective than the ordinary method base~l on among homonyms can be selected in the scope of the word occurrence frequency. a sequence of sentences according to the following In the next stage of our study, we intend to two CaSeS. ewfluate the second method based on the.far co- Casel An input word written in t~ana-characters occurrence data sets by conducting experiments. is a 1101111. Case2 An input word written in kaua-characters References is a verb. Kobayashi, T., et al. 1992. Realization of 6.1 Procedure for easel Kana-to-Kanji Conversion Using Neural Net- works. Toshiba Review, Vol.47, No.ll, pages 868- Stepl Find set Sn: 870, Japan. S, = {(N,,Tt),(N2, T2),...} Yamamoto, K., et al. 1992. Kana-to-Kanji where Ni(i : 1, 2,...) is a homonynfic l~a'nfi- Conversion Using Co-occurrence Groups. Proc. variant for the input word written in kana- of 4]~th Confi:reuce of IPSJ, 4p-ll, pages 189-190, characters and T, is the priority vMue for Japan. homonym N,, which can be retrieved from EDR. 1994. Co-occurrence Dictionary Vet.2, the Jar co-occurrence data set ENid,.. TR-043, Japan.

1138