IASL Korean-Chinese CLIR System
Total Page:16
File Type:pdf, Size:1020Kb
IASL Korean-Chinese CLIR System Query Translation CLIR System based on Bilingual Dictionary and Co-occurrence Method Yu-Chun Wang, Cheng-Wei Lee, Richard Tzong-Han Tsai, Wen-Lian Hsu* *[email protected] Academia Sinica, Taiwan, R.O.C. WeWe proposepropose anan architecturearchitecture forfor retrievingretrieving ChineseChinese documentsdocuments basedbased onon KoreanKorean queriesqueries inin NTCIRNTCIR CLIRCLIR K-CK-C Task.Task. OurOur systemsystem usesuses aa bbilingualilingual dictionarydictionary toto performperform queryquery translation.translation. WeWe expandexpand ourour bilingualbilingual dictionarydictionary bybyextractingextracting wordswords andand theirtheir translationstranslations fromfrom thethe WikipediaWikipedia sitesite,, anan onlineonline encyclopedia.encyclopedia. ToTo resolveresolve thethe problemproblem ofof translatingtranslating WesternWestern people'speople's namesnames intointo ChChinese,inese, wewe proposepropose aa transliteratransliterationtion mappingmapping method.method. WeWe translatranslatete queriesqueries formform KoreanKorean queryquery toto ChineseChinese byby usingusing aa co-occurrenceco-occurrence method.method. Architecture Query Translation Bilingual Dictionary Query Processing Retrieval We use the free online Korean-Chinese bilingual Rule-based KLT Term dictionary provided by the Daum Korean web site. Term Processing Extractor CIRB Documents Wikipedia Title Other We use Wikipedia to expand our dictionaries for Term Translation CKIP Autotag the proper nouns. The following is the procedure. Bilingual Daum K-C Send Korean Terms to Korean Wikipedia Wikipedia Translation Dictionary CIRB Index Find the Inter-language Links Person Name in Korean Wikipedia Pages Naver Translation People Lucene IR Follow the Inter-language Link Search Term Engine to Chinese Wikipedia Disambiguation Construct the Translation Pair between the Items in Korean Answers Query Processing and Chinese Wikipedia We use two different segmentation methods, one for the Person Name Translation title of the query and the other for other parts. 그린스펀 Korean Transliteration of the Name “Greenspan” Predefined Processing Rules for Title part Naver People Search Our Rules: Greenspan Original English Name Split the title into several eojeols by the space characters CNA English-Chinese Remove Korean postpositions at the end of each eojeols Transliteration Table KLT Term Extractor for other parts 葛林斯潘 Chinese Transliteration of the Name “Greenspan” We use the KLT Term Extractor to extract vital key words and Term Disambiguation remove stop words. Many different Chinese loanwords have the KLT Term Extractor is developed by Kookmin University, Korea. same pronunciation when written in the Hangul alphabet. 理想 以上 Chinese Information Retrieval 이상 異常 異狀 Document Indexing Mutual Information |Q| Z (qtx ) CIRB 4.0 documents are pre-processed to remove Pr(tcij ,tcxy ) MI score(teij | Q) = ∑∑ noise and then segmented by CKIP AutoTag to obtain x==1,x≠i y 1 Pr(tcij ) Pr(tcxy ) words and part-of-speech (POS). Performance We adopt Lucene, an open source information retrieval IASL CLIR K-C Performance engine. Rigid Relax Run Our index is based on Chinese characters. MAP R-prec MAP R-prec IASL-K-C-T-01 0.1118 0.1420 0.1392 0.1781 Lucene Query One Query Example IASL-K-C-D-01 0.1022 0.1331 0.1274 0.1760 日本 or 韓國 or 漁業 or (協定 or 條約^0.25 Acknowledgement or 合約^0.25 or 合同^0.25) We would like to thank CKIP for providing us AutoTag for Chinese word segmentation. Intelligent Agent Systems Lab, Institute of Information Science, Academia Sinica.