Using Cross-Language Cues for Story-Specific Language Modeling

USING CROSS-LANGUAGE CUES FOR STORY-SPECIFIC LANGUAGE MODELING SANJEEV KHUDANPUR AND WOOSUNG KIM Center for Language and Speech Processing / Johns Hopkins University sanjeev, woosung ¡ @clsp.jhu.edu I. INTRODUCTION III. DATABASE VI. EXPERIMENTAL RESULTS J Motivation : J Words Hong Kong news text corpus Size Docs J LM perplexity results Multi-lingual, multi-domain, multi-source linguistic resources are available J Duration : July 1997 to April 2000 Ch En Most resources are concentrated on popular languages such as English, French Vocab 40.8K 38.8K - J Chinese-English parallel text Language Model Dev PPL Eval PPL and German Train 4.2M 4.3M 16K Word Char Word Char J Document- and sentence-level align- Relatively small resources for Chinese, Arabic, and other languages. Dev 255K 263K 750 Baseline Trigram 106 23.7 62.5 16.7 ment: CLSP WS01 text summarization Eval 177K 182K 682 E J Stochastic models require a large amount of data for training Baseline + CL LM, given di 89.7 21.1 51.2 14.6 J 18K document pairs OOV (%) 719 (0.4%) 734 (0.4%) - E J How to construct stochastic models in resource deficient languages? Baseline + CLIR LM, guess di 90.1 21.2 51.3 14.6 Boostrap methods by reusing models from resource rich languages, e.g. Baseline + Topic Unigram 94.6 21.9 57.4 15.8 Universal phone-set for ASR Baseline + Topic Trigram 84.4 20.3 49.3 14.2 Exploit parallel texts to project morphological analyzers, POS taggers, etc. Baseline + CLIR LM + Topic Trigram 80.1 19.6 44.6 13.3 IV. BASELINE CHINESE LANGUAGE MODEL ESTIMATION J We present: Interpolation weights determined by Dev data An approach to sharpen an LM in a resource deficient language using compara- ble text from resource rich languages J Word-based LM after automatic segmentation Story-specific language models from parallel text J Standard trigram LM Good-Turing discounting, Katz back-off Integration of machine translation (MT), cross-language information retrieval J Trained with Chinese portion of the Hong Kong news parallel text VII. CONCLUSIONS (CLIR), and language modeling (LM) J Estimation of character perplexity Calculate log probabilities based on words J Use of resource rich languages to improve the estimation of stochastic models in Divide cumulative log probabilities by the total number of characters resource deficient languages II. STORY-SPECIFIC CROSS-LANGUAGE LANGUAGE MODELS J Improvements : 28.6% in word-level PPL reduction (combined with topic trigram LM) J A successful integration of MT, CLIR and LM V. CROSS-LANGUAGE LANGUAGE MODEL ESTIMATION J Future work E ¢ C C Cross-language lexical triggers J Assume document correspondence, di di , is known for Chinese test doc di , £ ¤ ¥ £ ¤ ¥ £ ¤ ¥©¨ Baseline Chinese Maximum entropy methods to combine cross-language LMs with other LMs E ¦ ˆ E P c di P c e P e di c C Language Model ∑§ e E Applications to other tasks: MT, TDT Applications to other resource deficient languages : Arabic Baseline Chinese J Document correspondence obtained by CLIR Chinese Speech ASR £ ¤ ¥ Acoustic Model C For each Chinese test doc di , create English bag-of-words based on P e c Use it to find the English doc with the highest cosine similarity NGOING NVESTIGATION HINESE EST £ ¥ VIII. O I : C ASR T Chinese E¦ C E di argmaxSimCL di ¨ d j Dictionary E § E d j D £ ¤ ¥ £ ¤ ¥ J J Estimation of P c e and P e c GIZA++ translation table Baseline system : CLSP workshop 2000, Mandarin pronunciation modeling Large English Text Automatic Transcription GIZA++ : statistical machine translation tool based on IBM model-4 LM training data : People's Daily, Xinhua, China Radio 291M words Input : 16K Chinese-English parallel text docs (sentence aligned) Test set : 1997, 1998 HUB-4NE test set 1263 utterances, 12K words Cross−Language Story Alignment Output : machine translation system consisting of several tables £ ¤ ¥ £ ¤ ¥ Lattice rescoring : 1st pass lattices generated from a bigram LM Only translation tables are used : P e c and P c e J No parallel English text exists for this Chinese test set J Cross-Language LM Construction Parallel Chinese−English Stories £ ¤ E ¥ J CLIR : find most similar doc from NAB'97 + TDT2 (NYT, APW) corpora (45K docs) Build story-specific cross-language LMs, P c di Chinese J Linear interpolation with the baseline trigram LM ¥ £ ¤ ¥ £ ¤ ¥ £ ¥ £ ¤ Language Model PPL WER (%) CER (%) ¦ E E ¨ ¨ ¨ Machine Translation(GIZA++) English P ck ck 1 ck 2 di lP ck di 1 l P ck ck 1 ck 2 Baseline Trigram 263.6 46.1 25.1 Small Ch−En Baseline + CLIR with Ref. 231.1 45.3 24.7 Baseline + CLIR with 1-Best List 236.8 45.4 24.7 "Parallel" Corpus J Comparison with topic-dependent LMs Cross−Language LM Topic clustering : unsupervised K-means clustering Topic-dependent unigram LM and trigram LM £ ¤ ¥ £ ¥ £ ¥ £ ¤ ¥ ¦ ¨ ¨ ¨ P ck ck 1 ck 2 ti lPti ck 1 l P ck ck 1 ck 2 £ ¤ ¥ £ ¤ ¥ £ ¥ £ ¤ ¥ ¦ ¨ ¨ ¨ ¨ P ck ck 1 ck 2 ti lPti ck ck 1 ck 2 1 l P ck ck 1 ck 2 Presented in.

Load more