A Latent Semantic Approach Yik-Cheung Tam CMU-LTI-09-016

Rapid Unsupervised Topic Adaptation – a Latent Semantic Approach Yik-Cheung Tam CMU-LTI-09-016 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Tanja Schultz (Chair) Alex Waibel Stephan Vogel Sanjeev P. Khudanpur, Johns Hopkins University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies c 2009, Yik-Cheung Tam To Cindy, and my family, iv Abstract In open-domain language exploitation applications, a wide variety of topics with swift topic shifts has to be captured. Consequently, it is crucial to rapidly adapt all language components of a spoken language system. This thesis addresses unsupervised topic adaptation in both monolingual and crosslingual settings. For automatic speech recognition we rapidly adapt a language model on a source language. For statistical machine translation, we adapt a language model of a target language, a translation lexicon and a phrase table using a source text. For monolingual adaptation, we propose latent Dirichlet-Tree allocation for Bayesian latent semantic analysis. Our model enables rapid incremental language model adaptation via caching the fractional topic counts of word hypotheses decoded from previous speech utterances. Latent Dirichlet-Tree allocation models topic correlation in a tree-based hi- erarchy and thus addresses the model initialization issue. To address the “bag-of-word” assumption in latent semantic analysis, we extend our approach to N-gram latent Dirichlet- Tree allocation. We investigate a fractional Kneser-Ney smoothing approach to handle fractional counts for topic models. The algorithm produces a more compact model com- pared to the Witten-Bell smoothing. Using multi-stage language model adaptation via N-gram latent Dirichlet-Tree allocation, we achieve significant reduction in speech recognition errors using our large-scale GALE systems on two different languages: Mandarin and Arabic. For end-to-end translation on speech inputs, applying topic adaptation on automatic speech recognition is beneficial to translation performance. For crosslingual adaptation, we propose bilingual latent semantic analysis for statistical machine translation. A key feature of bilingual latent semantic analysis is a one-to-one topic correspondence between models of a source and a target language. Since topical information is language independent, our model enables transfer of a topic distribution inferred from a source text to a target language for crosslingual adaptation. Our approach has two advantages: first, it can be applied before translation, and thus has immediate impact on translation. Secondly, it does not rely on an translation output for adaptation, and v therefore does not suffer from translation errors. Together with N-gram latent Dirichlet- Tree allocation on a target language, we achieve significant improvement in translation performance using our large-scale GALE systems for text translation. A limitation of bilingual latent semantic analysis is the requirement of parallel corpora that are relative expensive to collect. We propose a semi-supervised approach to incorporate non-parallel documents into model training. We achieve improvement in crosslingual language model adaptation performance, especially when bilingual resources are deficient. vi Acknowledgments First and foremost, thanks to my advisor, Tanja Schultz, for offering me a chance to learn and grow in the ISL/InterACT lab. I express my deep appreciation for her support, guidance and freedom to work on my thesis. Many thanks to Tanja Schultz, Stephan Vogel, Alex Waibel, and Sanjeev Khudanpur for reading my thesis. Their valuable comments have improved the quality of my thesis. Special thanks to Thomas Schaaf for teaching me how to incorporate C++ modules into our IBIS speech decoder. I am very thankful for the opportunity to learn from Thomas especially during the RT04 and TCSTAR evaluation. Many thanks to Christian Fügen, Sebastian Stüker, Florian Metze, Hua Yu and Stan Jou for sharing their experience on the decoder. I would like to thank Stephan Vogel for his advice on the STTK decoder for statistical machine translation. I am very grateful to Ian Lane for his insightful comments on my research and his contribution to our spoken language translation system for our ACL paper. My work has been benefited from his suggestion and encouragement. Many thanks to my colleague for offering help and pointers over the past years: Mo- hamed Noamany, Mark Fuhs, Roger Hsiao, Udhyakumar Nallasamy and Qin Jin for my experiment on automatic speech recognition; Silja Hildebrand, Qin Gao, Nguyen Bach, Thuylinh Nguyen, Qin Gao, Matthias Eck, Matthias Paulik, Anthony D’Auria, Bing Zhao, Joy Zhang for my experiment on statistical machine translation. Many results in this thesis would not be possible without their help. Special thanks to Freya Fridy, Ian Lane, Isaac Harris, Lisa Mauti, Matthias Eck, Matthias Paulik and Roger Hsiao for evaluating the quality of my translation output. Thanks to Isaac Harris for his faithful service and effective maintenance of the computing environment in our lab. Thanks to Lisa Mauti and Kelly Widmaier for making a happier workplace filled with fun parties, and Jie Yang for organizing dumpling parties during the Chinese new year celebration. I am deeply grateful to Ciprian Chelba and Milind Mahajan for mentoring me at the speech group of Microsoft Research in summer 2003. Their passion and guidance have vii had significant impact on my research direction. I am indebted to Brian Mak, my professor in Hong Kong University of Science and Technology, for his guidance in the past. Roger Hsiao has been a faithful friend and confidant. Thank you for your friendship and discussion on my research. I enjoyed the great accompany with Fei Huang and Wen Wu as my squash and jogging partners, and Jing-Hao Fei for sharing his interesting life experience with me. Thank you for the friendship of my LTI classmates including An- toine Raux, Banerjee Satanjeev, Betty Cheng, Brian Langner, Jason Zhang, Jean Oh, June Sisson, Robert Chen, Pam Suebvisai, Peter Suen and Yifen Huang. I am grateful to the love and support from the brothers and sisters of Antioch cell group, including Steve Sheng and Chiapi Chao, Yunghui Li and Hwaning Lee, Roger Hsiao and Rui Zhao, Eugene Ooi and Meiru Ooi, Jason Tong and Ying Liu, Steve Sun, Chiachi Chuang, Jen Lee, Kai-Zhe Huang, Francis Hong, Mike Tang and Mary Kung. Special thanks to Yunghui for bringing me into the big family, and Steve Sheng for his friendship and mentorship. Finally, I would like to thank my family for their love and encouragement. I am very grateful to my mom for her love and patience over the past years. Thank you with all my heart! viii ix Abbreviations AM AcousticModel ASR Automatic Speech Recognition BC Broadcast Conversation BN Broadcast News CER Character Error Rate EM ExpectationMaximization FSA Feature Space Adaptation GALE Global Autonomous Language Exploitation GMM Gaussian MixtureModel HMM Hidden MarkovModel LDA Latent Dirichlet Allocation LDC LinguisticData Consortium LDTA Latent Dirichlet-Tree Allocation LM LanguageModel LSA LatentSemanticAnalysis bLSA Bilingual Latent Semantic Analysis LSI LatentSemanticIndexing pLSA Probabilistic Latent Semantic Analysis MAP MaximumaPosteriori MERT Minimum Error Rate Training MFCC Mel-Frequency Cepstral Coefficient MLE Maximum Likelihood Estimation MLLR Maximum Likelihood Linear Regression fMLLR Feature Maximum Likelihood Linear Regression MMIE Maximum Mutual Information Estimation SAT Speaker Adaptive Training SMT Statistical Machine Translation SVD Singular Value Decomposition VTLN Vocal Tract Length Normalization WER Word Error Rate x Contents Abstract...................................... v Acknowledgments................................. vii Abbreviations ................................... x 1 Introduction 1 1.1 Motivation.................................. 1 1.2 ProposedResearch ............................. 2 1.2.1 MonolingualAdaptation . 3 1.2.2 CrosslingualAdaptation . 5 1.3 ThesisOrganization............................. 7 2 Background 9 2.1 VariationalBayes .............................. 9 2.2 LatentSemanticAnalysis. 12 2.2.1 LatentSemanticIndexing . 12 2.2.2 ProbabilisticLatentSemanticAnalysis . ... 13 2.2.3 LatentDirichletAllocation. 15 2.2.4 CorrelatedTopicModel . 18 2.2.5 PachinkoAllocation . 19 2.2.6 BigramLatentSemanticAnalysis . 20 2.2.7 Sentence-LevelTopicMixtures . 21 2.3 BilingualLatentSemanticAnalysis . ... 23 xi 2.3.1 BilingualLatentSemanticIndexing . 23 2.3.2 BilingualTopicAdmixtureModel . 24 2.4 LanguageModelSmoothing . 25 2.4.1 Kneser-NeySmoothing. 26 2.4.2 Witten-BellSmoothing. 27 2.5 UnsupervisedLanguageModelAdaptation . ... 28 2.5.1 WordCaching ........................... 28 2.5.2 WordTriggering .......................... 29 2.5.3 MarginalAdaptation . 29 2.6 Summary .................................. 32 3 Baseline Transcription and Translation Systems 35 3.1 Background................................. 35 3.2 MandarinTranscription. 36 3.2.1 Chinese-SpecificIssues. 37 3.2.2 AudioSegmentationandSpeakerClustering . .. 38 3.2.3 AcousticModeling . .. .. .. .. .. .. 39 3.2.4 LanguageModeling,TextData,Normalization . ... 44 3.2.5 PronunciationLexicon . 46 3.2.6 DecodingStrategy . .. .. .. .. .. .. 47 3.3 ArabicTranscription ............................ 48 3.3.1 Arabic-SpecificIssues . 48 3.3.2 DecodingStrategy

A Latent Semantic Approach Yik-Cheung Tam CMU-LTI-09-016

Multi-View Learning for Hierarchical Topic Detection on Corpus of Documents

Nonstationary Latent Dirichlet Allocation for Speech Recognition

Large-Scale Hierarchical Topic Models

Journal of Applied Sciences Research

Incorporating Domain Knowledge in Latent Topic Models

Extração De Informação Semântica De Conteúdo Da Web 2.0

An Unsupervised Topic Segmentation Model Incorporating Word Order

A New Method and Application for Studying Political Text in Multiple Languages

Semi-Automatic Terminology Ontology Learning Based on Topic Modeling

Song Lyric Topic Modeling

Topic Modeling Based on Keywords and Context

Efficient Learning for Spoken Language Understanding Tasks with Word Embedding Based Pre-Training