USING CROSS-LANGUAGE CUES FOR STORY-SPECIFIC LANGUAGE MODELING

SANJEEV KHUDANPUR AND WOOSUNG KIM Center for Language and Speech Processing / Johns Hopkins University sanjeev, woosung ¡ @clsp.jhu.edu

I. INTRODUCTION III. DATABASE VI. EXPERIMENTAL RESULTS

J Motivation : J Hong Kong news Size Docs J LM perplexity results  Multi-lingual, multi-domain, multi-source linguistic resources are available J Duration : July 1997 to April 2000 Ch En  Most resources are concentrated on popular languages such as English, French Vocab 40.8K 38.8K - J Chinese-English parallel text Language Model Dev PPL Eval PPL and German Train 4.2M 4.3M 16K Char Word Char J Document- and sentence-level align-  Relatively small resources for Chinese, Arabic, and other languages. Dev 255K 263K 750 Baseline 106 23.7 62.5 16.7 ment: CLSP WS01 text summarization Eval 177K 182K 682 E J Stochastic models require a large amount of data for training Baseline + CL LM, given di 89.7 21.1 51.2 14.6 J 18K document pairs OOV (%) 719 (0.4%) 734 (0.4%) - E J How to construct stochastic models in resource deficient languages? Baseline + CLIR LM, guess di 90.1 21.2 51.3 14.6 Boostrap methods by reusing models from resource rich languages, e.g. Baseline + Topic Unigram 94.6 21.9 57.4 15.8  Universal phone-set for ASR Baseline + Topic Trigram 84.4 20.3 49.3 14.2  Exploit parallel texts to project morphological analyzers, POS taggers, etc. Baseline + CLIR LM + Topic Trigram 80.1 19.6 44.6 13.3 IV. BASELINE CHINESE LANGUAGE MODEL ESTIMATION

J We present:  Interpolation weights determined by Dev data  An approach to sharpen an LM in a resource deficient language using compara- ble text from resource rich languages J Word-based LM after automatic segmentation  Story-specific language models from parallel text J Standard trigram LM Good-Turing discounting, Katz back-off  Integration of machine (MT), cross-language information retrieval J Trained with Chinese portion of the Hong Kong news parallel text VII. CONCLUSIONS (CLIR), and language modeling (LM) J Estimation of character perplexity  Calculate log probabilities based on words J Use of resource rich languages to improve the estimation of stochastic models in  Divide cumulative log probabilities by the total number of characters resource deficient languages II. STORY-SPECIFIC CROSS-LANGUAGE LANGUAGE MODELS J Improvements : 28.6% in word-level PPL reduction (combined with topic trigram LM) J A successful integration of MT, CLIR and LM V. CROSS-LANGUAGE LANGUAGE MODEL ESTIMATION J Future work

E ¢ C C  Cross-language lexical triggers J Assume document correspondence, di di , is known for Chinese test doc di , £ ¤ ¥ £ ¤ ¥ £ ¤ ¥©¨  Baseline Chinese Maximum entropy methods to combine cross-language LMs with other LMs E ¦ ˆ E P c di P c e P e di c C

Language Model ∑§ e E  Applications to other tasks: MT, TDT  Applications to other resource deficient languages : Arabic Baseline Chinese J Document correspondence obtained by CLIR

Chinese Speech ASR £ ¤ ¥ Acoustic Model C  For each Chinese test doc di , create English bag-of-words based on P e c  Use it to find the English doc with the highest cosine similarity NGOING NVESTIGATION HINESE EST £ ¥ VIII. O I : C ASR T

Chinese E¦ C E di argmaxSimCL di ¨ d j Dictionary E § E d j D

£ ¤ ¥ £ ¤ ¥ J J Estimation of P c e and P e c GIZA++ translation table Baseline system : CLSP workshop 2000, Mandarin pronunciation modeling Large English Text Automatic Transcription  GIZA++ : statistical tool based on IBM model-4  LM training data : People’s Daily, Xinhua, China Radio 291M words  Input : 16K Chinese-English parallel text docs (sentence aligned)  Test set : 1997, 1998 HUB-4NE test set 1263 utterances, 12K words Cross−Language Story Alignment  Output : machine translation system consisting of several tables £ ¤ ¥ £ ¤ ¥  Lattice rescoring : 1st pass lattices generated from a LM Only translation tables are used : P e c and P c e J No parallel English text exists for this Chinese test set J Cross-Language LM Construction P r ll l C i −E li St ri £ ¤ E ¥ J CLIR : find most similar doc from NAB’97 + TDT2 (NYT, APW) corpora (45K docs) a a e h nese ng sh o es  Build story-specific cross-language LMs, P c di Chinese J Linear interpolation with the baseline trigram LM ¥ £ ¤ ¥ £ ¤ ¥ £ ¥ £ ¤

Language Model PPL WER (%) CER (%) ¦

E E  ¨ ¨ ¨ Machine Translation(GIZA++) English P ck ck 1 ck 2 di λP ck di 1 λ P ck ck 1 ck 2 Baseline Trigram 263.6 46.1 25.1 Small Ch−En Baseline + CLIR with Ref. 231.1 45.3 24.7 Baseline + CLIR with 1-Best List 236.8 45.4 24.7 "Parallel" Corpus J Comparison with topic-dependent LMs Cross−Language LM  Topic clustering : unsupervised K-means clustering  Topic-dependent unigram LM and trigram LM £ ¤ ¥ £ ¥ £ ¥ £ ¤ ¥

¦  ¨ ¨ ¨ P ck ck 1 ck 2 ti λPti ck 1 λ P ck ck 1 ck 2 £ ¤ ¥ £ ¤ ¥ £ ¥ £ ¤ ¥

¦  ¨ ¨ ¨ ¨ P ck ck 1 ck 2 ti λPti ck ck 1 ck 2 1 λ P ck ck 1 ck 2 Presented in