SpeechQoogle: An Open-Domain System with Speech Interface

Guoping Hu1, Dan Liu2, Qingfeng Liu2, Renhua Wang1

1. iFly Speech Lab, University of Science and Technology of China, Hefei 2. Research of Anhui USTC iFlyTEK Co., Ltd., Hefei [email protected]

Abstract. In this paper, we propose a new and valuable research task: open- domain question answering system with speech interface, and first prototype (SpeechQoogle) is constructed with three separated modules: , question answering (QA) and . Speech interface improves the utility of QA system, but also brings several new challenges, including 1) distorting effect of speech recognition error; 2) just one answer could be returned; and 3) the returned answer should be understandable just in speech. To conquer these challenges in SpeechQoogle’s construction, we first carefully choose FAQ-based question answering technique for QA module because of its inherent advances and 600,000 QA pairs are collected to support this technique. Then corresponding acoustic model and language model are particularly developed for speech recognition module which promotes the character ACC to 87.17%. Finally, in open-set testing the integrated prototype successfully answers 56.25% spoken questions, which is a quite satisfied and inspiring performance because many potential improving approaches are still unexploited.

Keywords: Question Answering, Speech Interface, Speech Recognition, Speech Synthesis,

1 Introduction

Question answering (QA) is a task to answer human’s natural language questions. The task has received a great deal of attention and great progression has been achieved in recent years, especially since the launch of the QA track at TREC in 1999. Introducing speech interface into QA system has promise for improving the utility of QA system on two aspects. First, user can communicate with the system just in the most convenient method: spoken language. Second, the speech interface enables QA system to be accessed through telephone or mobile from anywhere in anytime, while traditional QA system can only demoed or put into practical use by employing an input textbox on a web page and let user to input their query in natural language. Motivated by these advantages derived from speech interface, what we are going to do in this paper is trying to construct one open-domain question answering system with speech interface, named as SpeechQoogle, with three separated modules: speech recognition, question answering and speech synthesis. SpeechQoogle is the first prototype of QA system with speech interface to the best of our knowledge. Together with the improvement of convenience and availability, several new challenges are embedded in question answering system with speech interface, including: 1) distorting effect of speech recognition error; 2) just the one answer candidate could be replied to user; and 3) the answer should be syntactic and semantic understandable just in speech approach. These three challenges defeat traditional question answering techniques, including question type prediction and matching, syntactic or semantic and some other question understanding techniques. Frequently asked questions (FAQ) based QA technique was proposed in recent years. In this technique, question answering is recast as a retrieval problem: given a question, the most relevant QA pair is retrieved from a large scale collection of QA pairs, and the answer part of the retrieved pair is returned to user as final answer for given question. Obviously, this FAQ-based QA technique can generate answer with syntactic and semantic integrity ensured and can tolerate more speech recognition errors. The scale and quality of QA pair collection is the most important thing for FAQ-based QA technique. Fortunately, the popularity of question/answering websites provides us a very easy way to collect QA pairs. In our experiments, about 600,000 high quality Chinese QA pairs are collected from just one question/answering website: http://zhidao.baidu.com. And with this support, our question answering module successfully answers 72.50% text questions in open-set testing. Speech recognition performance is still the biggest problem though proper question answering technique is employed. Therefore, corresponding acoustic and language model should be developed for the speech recognition module. We trained our acoustic model on a collected spontaneous and trained our language model on the question part of the QA pair collection. Recognition experiment on 400 spoken questions shows that our character ACC achieves 87.17%, which is almost a high enough performance to support the next QA module. Several powerful toolkits are employed to build our final SpeechQoogle prototype. Lucene.NET [1] is employed as in question answering module. Julius [14] is employed as LVCSR engine in speech recognition module and Mandarin TTS system InterPhonic 4.0 [22] is utilized in the speech synthesis module. Experimental result shows that 56.25% of spoken questions can be successfully answered by SpeechQoogle, which is a quite satisfied and inspiring responding performance. The rest of the paper is organized as follows. We review the related work in the next section. Then we emphasize both the values and challenges embedded in the question answering system with speech interface in section 3. The architecture of our SpeechQoogle system is described in section 4. Section 5 presents the experimental results. Finally, we conclude and summarize the future work in section 6.

2 Related Work

Lots of research work is related to the question answering system with speech interface. We categorize them into three aspects: 2.1 Text-based Question Answering

Text-based question answering has received a great deal of attention in recent years, especially since the launch of the QA track at TREC in 1999 [19]. In early years, the QA task was generally restricted to answering factoid questions [5, 8, 11 and 15], while answering the question beyond factoid was proposed and researched in the last few years [13 and 17]. The work in [13] presents a novel and practical approach to answer natural language questions by using the large number of FAQ pages available on the web. What we want to build is also an open domain and non-factoid question answering system, so we adopt this FAQ-based question answering technique in the question answering module of our SpeechQoogle system. Another trend of question answering is the popularity of question answering websites, e.g., http://zhidao.baidu.com and http://iask.sina.com.cn. These web sites provide platforms for millions of web users to share their knowledge in the methods of someone posting questions on web and waiting for someone else to answer. All the question/answering histories are available on the web sites. We just utilize these question/answering resources to complete our QA pair collection.

2.2 Speech-Driven Web Retrieval and Speech-Driven Question Answering

Speech-driven web retrieval is to retrieve needed information from web via speech queries. This research is most motivated by expectation of providing solution for accessing internet just through telephone or mobile devices. Speech-driven web retrieval system first employs speech recognition to transcribe the spoken query into keywords sequence and then these keywords are sent to a traditional web retrieval system. How to return the retrieval result to user is leaving for future work or just display the top-ranked several short snippets or titles on the small screen of mobile devices. Several methods has been proposed to improve the speech recognition performance, including language models adaptation [10], mobile channel processing [6], and extracting key semantic term [18] etc. Speech-driven question answering was proposed on Japanese and a series of work has been published [2, 3 and 4] since 2002. In their work, speech recognition is integrated into traditional QA system to accept spoken question but the output of their system is still several ranked text snippets, while in SpeechQoogle, only the best answer should be synthesized into speech and returned to user. Without the speech output limitation, their question answering technique is still based on relevant passage retrieval and answer extraction, while FAQ-based QA technique is carefully chosen for SpeechQoogle to satisfy the limitation. Another difference between their research work to SpeechQoogle lies in that their research just focuses on language model tuning while we develop both corresponding acoustic model and language model to improve speech recognition performance, and also more techniques are investigated and employed on language model training in our SpeechQoogle system. 2.3 Spoken Dialog System and Speech-to-Speech Translation System

Speech-driven dialog systems [21] and speech-to-speech translation system [16] have been intensively researched in the field of spoken language processing. However, state-of-the-art dialog systems only operate for specific-domain question answering dialogs and speech-to-speech translation system also just work in a very narrow domain, such as booking room and asking the way. But open-domain questions are expected to be handled in question answering with speech interface system, including our SpeechQoogle. The basic idea under speech-driven dialog system or speech-to- speech translation system is trying to understand specific-domain question according to rules or syntax and semantic parsers, while we just recast the question answering problem as one retrieval problem in prepared large number of QA pairs.

3 Values and Challenges Embedded in Question Answering System with Speech Interface

Question answering system with speech interface can be simply regarded as integrating speech interface into traditional text-based question answering system, but this integration brings several additional and far-reaching values to the new QA system. First, the speech interface enables user to communicate with the system just in the most natural and convenient method: spoken language. Second, question answering system with speech interface is able to provide service to some users who are illiterate or unable to write. The third value is that the new QA system can work in speech communication channel, that is to say, it can be accessed by telephone or mobile. This new characteristic enables users who are not familiar with computer or internet be able to operate QA system through telephone or mobile easily. This is an especially important characteristic because there are about 700 million telephone and mobile customers in China, while the number of internet users is just about 100 million. Ideal question answering system with speech interface should be able to answer users’ question just through ‘talking’. Ideal QA system should utilize talking history information and can try to obtain additional information from users by using automatically generated disambiguating queries. Obviously there is a far way to construct such ideal system. Considering the limitation of speech recognition and language understanding technologies so far, we reduce the complexity of system by assuming following conditions have been satisfied: 1) Context independency: Similar to traditional question answering, we assume user can entirely define his/her question just in one or two short sentences (5~40 syllables), and what the system need to do is just to analyze the input spoken sentences and generate corresponding answer. Neither any communication history nor any additional information needs to be considered. 2) Question on common knowledge: Still similar to traditional question answering, the expected questions from user must be such question on common knowledge as “how long is the Yellow River?” but not as “is it going to rain tomorrow?” 3) High quality speech: We assume the users of question answering system with speech interface are quite cooperative, including clearer and more standard pronunciation, quiet operating circumstance and collecting and transmitting speech in high quality format and channel. These assumptions greatly increase the chance of building a practical question answering system with speech interface. But it is still a very difficult task because following challenges are still embedded: 1) The speech input method: Large vocabulary continuous speech recognition (LVCSR) should be employed to handle spoken language input, but LVCSR is still far to be a very reliable technique. Therefore how to improve the speech recognition performance orienting the question answering system is an important research task. 2) The inherent conflict between speech recognition and question answering: speech recognition always try to select the hypothesis with maximum likelihood, while for question answering, the with lower probability always carry more information. That is to say, the words which have most contributions in question answering have the lowest probabilities to be correctly transcribed. This conflict is an inherent and almost unsolvable conflict. The only approach to relief this conflict is to improve the speech recognition accuracy as high as possible. 3) The speech output method: Speech can only be outputted in serial method and the speed is limited to only 4~6 syllables per second. Therefore almost only one answer candidate can be returned to user, which leads great challenge to the ranking function in question answering module. Further, the only returned answer should be with syntactic and semantic integrity to ensure the answer be understandable just in speech approach. 4) The coverage of QA pair collection. What we want to build is an open-domain question answering system, so millions of different questions could be sent into the system. While the employed question answering technique is to retrieve relevant question from a prepared QA pair collection, so only the covered question can be answered quite well. Therefore, how to improve the coverage of QA pair collection is an essential challenge to the question answering system.

4 SpeechQoogle System Description

In this section, we describe our question answering system with speech interface, named as SpeechQoogle. The architecture of SpeechQoogle is presented in Figure 1. There are three separate modules that handle various states in the system’s pipeline. The first module is speech recognition, in which spoken question from user is transcribed into text question. The second module is question answering system, in which the text question is treated as retrieval query and the most relevant question- answering pair is retrieved from a large scale collection of QA pairs, and then answer part of the retrieved QA pair is regarded as the text answer corresponding to user’s spoken question. The final module is speech synthesis, in which text answer is synthesized into speech answer and returned to user.

Spoken Question Speech Answer

Speech Speech Recognition Synthesis

Text Question Text Answer Text Answer Question QA Pair Answering Collection Text Question

Fig. 1. Architecture of SpeechQoogle.

4.1 Speech Recognition

Speech recognition is employed to transcribe the spoken question into Chinese characters. Obviously, large vocabulary continuous speech recognition (LVCSR) should be employed for SpeechQoogle system, and according to former experiences, corresponding acoustic model and language model should be developed for LVCSR system to ensure the recognition performance. We adopted traditional HMM-based acoustic model training algorithm which is a mature technique powered by HTK toolkits [20]. What we need to pay more attention is the model training speech corpus. Various speech corpuses have been developed in the long research history of speech recognition, and most of them are recorded in reading style. But what we need in SpeechQoogle is spontaneous style. Therefore spontaneous speech corpus is recorded and annotated to train corresponding acoustic model for SpeechQoogle. We also adopted traditional language model technique: N-gram based language model training, and is chosen in our experiments. A special effort on language model developing is the preparation of language model training corpus. Obviously, the text we want to model in SpeechQoogle is question sentences, so traditional newspaper corpus does not help. Large scale of QA pairs is collected to support the next question answering module, and so we just pick up all the question part of these QA pairs and build an exactly matched . segmentation based on a large dictionary of 270,000 entries [7] is employed to split the text corpus into words, together with automatic proper name recognition. And finally the CMU-Cambridge statistical language modeling toolkit [9] is utilized to build the language model. Julius is employed as LVCSR engine in the speech recognition module, with the acoustic model and language model specially developed for SpeechQoogle. The weight and penalty of language model is also carefully tuned on one developing set.

4.2 Question Answering

Question answering is the kernel module of SpeechQoogle. Given the text question, which is automatically transcribed by speech recognition module, the question answering module should generate corresponding answering. Not like traditional question answering system, only one answer candidate should be returned to user and the unique returned answer should be understandable if just be presented to user in synthesized speech. Further, what we want to build is open-domain question answering. Therefore we should choose proper question answering technique which has inherent advances to conquer the above problems. After all the above requirements are carefully considered, we finally employed the FAQ based question answering technique in SpeechQoogle. The FAQ based QA technique was motivated by lots of FAQ webpage and question/answering websites are available from internet. In this technique, large number of QA pairs are first collected from these resources and indexed first. Given a question, FAQ based QA ranks the QA pairs according the relevance between QA pair and the input question, and the answer part of QA pairs are regarded as answer candidates and returned to user. For SpeechQoogle, large number of question answering pair was first collected from one question/answering website. We indexed all the QA pairs with toolkits of Lucene.NET, only question part is indexed for search in our current version of SpeechQoogle. Given a question, we search in this indexed collection just by the search function provided by Lucene.NET. The relevant score defined in Lucene.NET is just the inner product of query vector and document vector described in the tf*idf value of each word.

4.3 Speech Synthesis

Speech synthesis is an almost solved problem, and here we just integrate commercial Mandarin TTS system InterPhonic 4.0 into our SpeechQoogle system. The text answer generated in question answering module is directly sent into InterPhonic 4.0 system and then quite fluent speech will be synthesized and finally returned to user.

5 Experiments

SpeechQoogle is a quite complex system and its construction involves several sub tasks. We need to prepare, improve and evaluate all the three separated modules to ensure the finally integrated SpeechQoogle system can achieve a satisfied performance. 5.1 Building QA Pair Collection

We collected the question answering pair from one question/answering website: http://zhidao.baidu.com. We totally crawled about one million webpages. One question/answering pair can be extracted from each webpage. Improper QA pairs such as having too long question or answer are filtered in our experiment. Finally we built a large collection of about 600,000 high quality QA pairs. We indexed all the QA pairs with Lucene.NET toolkits, so we can retrieve top N relevant QA pairs from this collection given any question.

5.2 Building Dataset for Evaluation

To evaluate our question/answering module, we built two question sets from another question/answering website: http://iask.sina.com.cn, which can be regarded as open test sets for the QA pairs collection based on http://zhidao.baidu.com. Dataset 1 contains 320 questions and used as developing set in speech recognition tuning experiment and dataset 2 contains 80 questions, which will be used as testing set in all experiments. Then we retrieved top 100 relevant QA pairs with Lucene.NET toolkits for each question in the two datasets and employed one assistant to annotate the relevant grade for each QA pair. Three relevant grades are defined: useful grade denotes that the answer can be returned to user directly, relevant grade means the answer contains some useful information to user and none grade indicates totally a useless answer. We measure the question answering performance by the R@N, which is defined as the ratio of questions for which useful answer or relevant answer (counted as 0.5 useful answers) can be found in the top N ranked answers. For example, totally 100 question is tested, and in the top N ranked answers, useful answer can be found for 50 questions and only relevant answer can be found for 10 questions, the final R@N score is (50+0.5*10)/100*100%=55%. Table 1 shows some statistical information on the two question sets. We can see that the R@1 and R@100 are extremely high, which is unusual. After analysis, we found the main reason is that only rational, compact and frequently asked questions are selected from http://iask.sina.com.cn by assistant in our experiments. But this experiment still proves the validity of question answering module because this is an open-set testing and what we want to cover in SpeechQoogle is just rational and frequently asked questions, and we want the question to be presented in a compact language.

Table 1. Statistical information on the developing and testing question sets.

Question set No. of question Average length (in syllables) R@1 R@100 Developing set 320 10.28 76.72% 97.81% Testing set 80 9.85 72.50% 96.87%

We also employed 4 assistants (2 males and 2 females) to record all the 400 questions (80 questions in developing question set and 20 questions in testing set for each assistant) in a formal but spontaneous style. We recorded the spoken question in a quiet office room through a normal mic connected with computer.

5.3 Speech Recognition Experiments

In speech recognition experiment, we evaluate two acoustic models. One model is built on the published 863 speech (about 137 hours) which is recorded in reading style, and the other model is trained on a self collected spontaneous speech database (about 70 hours). The two acoustic models are named as AM_R and AM_S separately. For comparison, we trained three language models on two different text corpuses. One corpus is traditional newspapers corpus (about 300M bytes), and the other is the question part of all collected QA pairs (including bad QA pairs), which is about 40M bytes. Text normalization was employed to convert special characters such as Arabic number into corresponding Chinese characters. Word segmentation based on one large dictionary with 270,000 word entries was applied on the question text corpus, together with automatic proper name recognition. Then 60000 top frequency words are selected as word list for language model and other words are re-split into these 60000 words. Finally we trained our language model (named as LM_Q_A) on the word segmentation result of question text corpus. To evaluate the effects of refining the word segmentation, we also employed one traditional 60000 words dictionary to simply segment the question text corpus and trained one language model (named as LM_Q_B). Language model based on the newspaper text corpus (named as LM_N) is also developed with the same word segmentation setting with LM_Q_B. Finally, we tuned the language model weight and penalty on the developing dataset for each acoustic model and language model combination. The speech recognition performance measured in Chinese character ACC are shown in table 2. From the experiments result, we can see that character ACC of speech recognition improved 33% by specially developed language model and acoustic model. Most of the improvement is contributed by the language model adaptation. Refining the word segmentation result can also bring 2% improvement. There are about 6% absolute improvements by substituting AM_R with AM_S. And finally the character ACC achieves more than 87%, which provides a quite strong support for QA module.

Table 2. Recognition performance of some acoustic and language model combinations.

Acoustic & Language Best LM Weight ACC of Developing ACC of Testing Set Model and Penalty Set (improvement) (improvement) AM_R & LM_N 15.0 -10.0 63.72% (baseline) 65.37% (baseline) AM_ S & LM_N 15.0 -10.0 67.58% (+6.06%) 68.32% (+4.51%) AM_R & LM_Q_A 15.0 -5.0 84.19% (+32.12%) 80.48% (+23.11%) AM_ S & LM_Q_B 15.0 -5.0 86.46% (+35.69%) 85.70% (+31.10%) AM_ S & LM_Q_A 15.0 -10.0 88.63% (+39.09%) 87.17% (+33.35%) 5.4 Question Answering Experiments

After the spoken question was transcribed into text question, what we need to do is to retrieve the question from the prepared QA pair collection. In current version of SpeechQoogle, only the best recognition hypothesis in Chinese character string is utilized in question answering module. Because just the top relevant answer should be synthesized into speech and returned to user, so here we adopt R@1 as our measure criteria for question answering module and for the final SpeechQoogle’s responding performance. Table 3 shows the final experiment results. From the experiment results we can see that the question answering performance is heavily depends on the speech recognition accuracy, and the best experimental setting can answer 56.25% spoken question in open-set testing, which only loses about 22.41% performance comparing text-based question answering system.

Table 3. Responding performance of SpeechQoogle.

Acoustic & Language R@1 of Developing Set R@1 of Testing Set Model (improvement) (improvement) AM_R & LM_N 33.13% (baseline) 33.13% (baseline) AM_ S & LM_N 32.03% (-3.32%) 35.00% (+5.64%) AM_R & LM_Q_A 57.97% (+74.98%) 53.13% (+60.37%) AM_ S & LM_Q_B 59.38% (+79.23%) 52.50% (+58.47%) AM_ S & LM_Q_A 62.50% (+88.65%) 56.25% (+69.79%) Original Text Question 76.72% (upper bound) 72.50% (upper bound)

6 Conclusion and Future Work

In this paper, we propose a new and valuable research task in spoken language processing: open-domain question answering system with speech interface, which is a natural extending of traditional text-based question answering system. We first analyze both the significant values and enormous challenges embedded in this system. Speech recognition, question answer and speech synthesis are employed as three separated modules in the pipeline of this system and special technique is carefully determined based on the analysis, especially the FAQ based question answering technique for the question answering module. Then SpeechQoogle system, the first open-domain question answering system with speech interface, is built and evaluated. About 600,000 high quality QA pairs were collected from one question/answering website to support the question answering module, and 400 rational, frequently asked and compact questions were also collected from another question/answering website to act as testing question set. Experiment result shows that with the support from collection of 600,000 QA pairs, question answering module can answer 72.50% open- set testing questions. And 400 spoken questions are recorded for tuning and testing the responding performance of SpeechQoogle. Experimental results show that about 87.17% character ACC can be achieved by corresponding acoustic and language model are developed for speech recognition module, and finally SpeechQoogle can successfully answer 56.25% spoken questions in an open-set testing, which is a quite satisfied and inspirering performance because many potential improving approaches are still left to be investigated in future. The future works includes: 1) More information than just the best transcribed Chinese character string can be extracted from speech recognition and utilized in question answering model. The potential information includes word posterior probability (WPP), N-best recognition hypothesis, pinyin level recognition result etc. 2) More information can be utilized in question answering module, especially the answer part of QA pair. And also deeper syntactic and syntactic parser such as question-type prediction also can be considered in the ranking function of QA pairs. 3) Optimization goal of speech recognition module should be modified to maximize the final question answer performance, instead of just maximum likelihood or character ACC and so on. For example, the goal of maximizing the sum of idf value of correct transcribed word maybe contributes to the question answering system with speech interface.

References

[1] Apache Lucene, A high-performance, full-featured text search engine library. http://lucene.apache.org. [2] Tomoyosi Akiba, Atsushi Fujii, Katunobu Itou. Effects of Language Modeling on Speech-driven Question Answering, CoRR cs.CL/0407028: (2004) [3] T. Akiba, A. Fujii, and K. Itou. Collecting spontaneously spoken queries for . In proceedings of 4th International Conference on Language Resources and Evaluation, volume 4, pages 1439-1442, 2004 [4] Tomoyosi Akiba, Katunobu Itou, Atsushi Fujii and Tetsuya Ishikawa. Towards Speech- Driven Question Answering: Experiments Using the NTCIR-3 Question Answering Collection. Proceedings of the Third NTCIR workshop, 2003 [5] Eric Brill, Susan Dumais and Michele Banko, An Analysis of the AskMSR Question- Answering System, EMNLP 2002 [6] Eric Chang, Frank Seide, Helen M. Meng, Zhuoran Chen, Yu Shi, Yuk-Chi Li, A System for Spoken Query Information Retrieval on Mobile Devices, IEEE Trans. Speech and Audio Processing, 2002, 10(8): 531-541 [7] Ben-Feng CHEN, Guo-Ping HU, Ren-Hua WANG, Large lexicon construction for TTS system, ISCSLP' 2002,2002.8., Taipei [8] Charles L. A. Clarke, Gordon V. Cormack, Thomas R. Lynam, C. M. Li, G. L. McLearn, Web Reinforced Question Answering (MultiText Experiments for TREC 2001). TREC 2001 [9] P. Clarkson and R. Rosenfeld, Statistical Language Modeling Using the CMU- Cambridge Toolkit, Proc. of Eurospeech-97, September, 1997. [10] Alexander Franz, Brian Milch. Searching the Web by Voice. COLING 2002 [11] Eduard H. Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk, Chin-Yew Lin, Question Answering in Webclopedia. TREC 2000 [12] Abraham Ittycheriah, Salim Roukos, IBM's Statistical Question Answering System- TREC 11. TREC 2002 [13] Valentin Jijkoun, Maarten de Rijke. Retrieving answers from frequently asked questions pages on the web. CIKM 2005: 76-83 [14] A. Lee, T. Kawahara, and K. Shikano. Julius – an open source real-time large vocabulary recognition engine. In Proceedings of European Conference on Speech Communication and Technology, pages 1691-1694, Sept. 2001. [15] Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, Amardeep Grewal, Probabilistic question answering on the web. WWW 2002: 408-419 [16] M. Rayner and P. Bouillon. A flexible speech to speech phrasebook translator. In Proc. Workshop on Speech-to-Speech Translation: Algorithms and Systems, Philadelphia, Pennsylvania, 2002 [17] Radu Soricut, Eric Brill. Automatic question answering using the web: Beyond the Factoid. Inf. Retr. 9(2): 191-206 (2006) [18] Gang Wang, Tat-Seng Chua and Yong-Cheng Wang. Extracting Key Semantic Terms from Chinese Speech Query for Web Searches. 41st Annual Meeting of the Association for Computational Linguistics (ACL’03), Sapporo, Japan. July 7-12, 2003. 248-255. [19] E. Voorhees and D. Tice. The TREC-8 question answering track evaluation. In Proceedings of the 8th Text Retrieval Conference, pages 83-106, 1999. [20] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland. The HTK Book. Revised for HTK Version 3.2 Dec. 2002. http://htk.eng.cam.ac.uk/ [21] V. Zue, S. Sene_, J. Glass, J. Polifroni, C. Pao, T. J. Hazen, and L. Hetherington. JUPITER: A telephone-based conversational interface for weather information. IEEE Transactions on Speech and Audio Processing, 8(1). 2000 [22] InterPhonic 4.0 Text-To-Speech System Introduction, http://iflytek.com/product_speech/product_speechInterPhonic4.0_2.htm