Using Cross-Language Cues for Story-Specific Language Modeling

USING CROSS-LANGUAGE CUES FOR STORY-SPECIFIC LANGUAGE MODELING SANJEEV KHUDANPUR AND WOOSUNG KIM Center for Language and Speech Processing / Johns Hopkins University sanjeev, woosung ¡ @clsp.jhu.edu I. INTRODUCTION III. DATABASE VI. EXPERIMENTAL RESULTS J Motivation : J Words Hong Kong news text corpus Size Docs J LM perplexity results Multi-lingual, multi-domain, multi-source linguistic resources are available J Duration : July 1997 to April 2000 Ch En Most resources are concentrated on popular languages such as English, French Vocab 40.8K 38.8K - J Chinese-English parallel text Language Model Dev PPL Eval PPL and German Train 4.2M 4.3M 16K Word Char Word Char J Document- and sentence-level align- Relatively small resources for Chinese, Arabic, and other languages. Dev 255K 263K 750 Baseline Trigram 106 23.7 62.5 16.7 ment: CLSP WS01 text summarization Eval 177K 182K 682 E J Stochastic models require a large amount of data for training Baseline + CL LM, given di 89.7 21.1 51.2 14.6 J 18K document pairs OOV (%) 719 (0.4%) 734 (0.4%) - E J How to construct stochastic models in resource deficient languages? Baseline + CLIR LM, guess di 90.1 21.2 51.3 14.6 Boostrap methods by reusing models from resource rich languages, e.g. Baseline + Topic Unigram 94.6 21.9 57.4 15.8 Universal phone-set for ASR Baseline + Topic Trigram 84.4 20.3 49.3 14.2 Exploit parallel texts to project morphological analyzers, POS taggers, etc. Baseline + CLIR LM + Topic Trigram 80.1 19.6 44.6 13.3 IV. BASELINE CHINESE LANGUAGE MODEL ESTIMATION J We present: Interpolation weights determined by Dev data An approach to sharpen an LM in a resource deficient language using compara- ble text from resource rich languages J Word-based LM after automatic segmentation Story-specific language models from parallel text J Standard trigram LM Good-Turing discounting, Katz back-off Integration of machine translation (MT), cross-language information retrieval J Trained with Chinese portion of the Hong Kong news parallel text VII. CONCLUSIONS (CLIR), and language modeling (LM) J Estimation of character perplexity Calculate log probabilities based on words J Use of resource rich languages to improve the estimation of stochastic models in Divide cumulative log probabilities by the total number of characters resource deficient languages II. STORY-SPECIFIC CROSS-LANGUAGE LANGUAGE MODELS J Improvements : 28.6% in word-level PPL reduction (combined with topic trigram LM) J A successful integration of MT, CLIR and LM V. CROSS-LANGUAGE LANGUAGE MODEL ESTIMATION J Future work E ¢ C C Cross-language lexical triggers J Assume document correspondence, di di , is known for Chinese test doc di , £ ¤ ¥ £ ¤ ¥ £ ¤ ¥©¨ Baseline Chinese Maximum entropy methods to combine cross-language LMs with other LMs E ¦ ˆ E P c di P c e P e di c C Language Model ∑§ e E Applications to other tasks: MT, TDT Applications to other resource deficient languages : Arabic Baseline Chinese J Document correspondence obtained by CLIR Chinese Speech ASR £ ¤ ¥ Acoustic Model C For each Chinese test doc di , create English bag-of-words based on P e c Use it to find the English doc with the highest cosine similarity NGOING NVESTIGATION HINESE EST £ ¥ VIII. O I : C ASR T Chinese E¦ C E di argmaxSimCL di ¨ d j Dictionary E § E d j D £ ¤ ¥ £ ¤ ¥ J J Estimation of P c e and P e c GIZA++ translation table Baseline system : CLSP workshop 2000, Mandarin pronunciation modeling Large English Text Automatic Transcription GIZA++ : statistical machine translation tool based on IBM model-4 LM training data : People's Daily, Xinhua, China Radio 291M words Input : 16K Chinese-English parallel text docs (sentence aligned) Test set : 1997, 1998 HUB-4NE test set 1263 utterances, 12K words Cross−Language Story Alignment Output : machine translation system consisting of several tables £ ¤ ¥ £ ¤ ¥ Lattice rescoring : 1st pass lattices generated from a bigram LM Only translation tables are used : P e c and P c e J No parallel English text exists for this Chinese test set J Cross-Language LM Construction Parallel Chinese−English Stories £ ¤ E ¥ J CLIR : find most similar doc from NAB'97 + TDT2 (NYT, APW) corpora (45K docs) Build story-specific cross-language LMs, P c di Chinese J Linear interpolation with the baseline trigram LM ¥ £ ¤ ¥ £ ¤ ¥ £ ¥ £ ¤ Language Model PPL WER (%) CER (%) ¦ E E ¨ ¨ ¨ Machine Translation(GIZA++) English P ck ck 1 ck 2 di lP ck di 1 l P ck ck 1 ck 2 Baseline Trigram 263.6 46.1 25.1 Small Ch−En Baseline + CLIR with Ref. 231.1 45.3 24.7 Baseline + CLIR with 1-Best List 236.8 45.4 24.7 "Parallel" Corpus J Comparison with topic-dependent LMs Cross−Language LM Topic clustering : unsupervised K-means clustering Topic-dependent unigram LM and trigram LM £ ¤ ¥ £ ¥ £ ¥ £ ¤ ¥ ¦ ¨ ¨ ¨ P ck ck 1 ck 2 ti lPti ck 1 l P ck ck 1 ck 2 £ ¤ ¥ £ ¤ ¥ £ ¥ £ ¤ ¥ ¦ ¨ ¨ ¨ ¨ P ck ck 1 ck 2 ti lPti ck ck 1 ck 2 1 l P ck ck 1 ck 2 Presented in.

Using Cross-Language Cues for Story-Specific Language Modeling

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support