3 Spoken Technology 3-1 Overview of Spoken Language Communication Technologies

KASHIOKA Hideki

The goal of Spoken Language Communication Laboratory, Universal Communication Research Institute, NICT, is to realize multi language communication technologies with spoken language regardless of who, where, when, how and in which language users speak. Toward this goal, we will intensively develop ICT for a human-machine interface, such as multilingual recognition, multilingual speech synthesis, and spoken dialogue technology. In this pa- per, we indicate these technologies overview. Keywords Speech recognition, Speech synthesis, Dialogue processing

1 Introduction log system technologies, which are the major technologies that constitute spoken language With the development of information and communication technology. communication technologies, it is desired to achieve communication between various peo- 2 Spoken language ple in various environments and situations, for communication technology example, communication between people in distant locations or who use different languag- Spoken language communication technol- es. To realize such , multilin- ogy is the technology of not only recording gual spoken language communications need to and utilizing the information of various be studied for people’s smooth interactions in speech-based daily communications but also any , at any time and in any place overcoming communication barriers. The ma- with any expressions. The research and devel- jor technologies for spoken language commu- opment of spoken language communication nication are speech recognition technology for technology is an important issue since spoken transforming speech dialog to text, speech language communication is one of the most synthesis technology for providing the speech natural human communications. In our daily output of text information, and dialog system life, along with the rapid spread of smart technology for supporting speech-based inter- phones, voice services to access various kinds actions. of information are now available and used by many people. 2.1 Speech recognition technology In this paper we give an overview of We often receive a large amount of speech speech recognition, speech synthesis, and dia- information not only in conversations with

KASHIOKA Hideki 11

JM-3-1-下版-20121126-KASHIOKA.indd 11 13/01/11 16:24 others but also from various announcements, ken dialog system used in devices like smart TV, radio, and videos on the Internet. Speech phones. Applications in call center systems are recognition technology transforms such speech also expected. The utilization of speech recog- into text. nition for captioning news and other TV pro- For the transform, the technology learns grams or internet videos has been requested from many corpora the models of extracting for a variety of reasons (e.g. for supporting speech features and using them for the trans- people with disabilities or for recording mo- form of speech. It then learns the models of tion pictures). To use speech recognition prac- verbalizing the obtained character strings as tically in these applications, we need to work , phrases and sentences, and checks the on the issues of noise processing, long sen- input speech against the learned models. The tence processing, and precision improvement models of extracting speech features and using etc. Multilingual support is also an important them for the transform of speech are called issue. acoustic models and the models of verbalizing the obtained character strings as words, phras- 2.2 Speech synthesis technology es and sentences are called language models. Speech information is often necessary in Spoken Language Communication Laboratory various situations. In particular speech synthe- has developed a high-speed highly accurate sis is actually used in public transportation and speech recognition system[1] using Weighted disaster-prevention announcements. When in- Finite State Transducer (WFST) as a search formation that people want to receive in a spo- model, as well as acoustic and language mod- ken form is given as text, speech synthesis els, to browse the results of checking speech technology is used to create speech informa- against these models. The speech recognition tion from that text. system is used in a spoken dialog system and a To synthesize speech from text, the speech speech translation system. We have a Japanese synthesis technology is comprised of a text dictionary of 650,000 words and can process a analysis unit for parsing the text with syntactic six- speech within 1 RTF (Real Time and semantic information, and a synthesis en- Factor). gine for determining how intonation and Speech that we receive in various environ- rhythm are used to produce sounds from the ments may contain non-speech sounds. To words and phrases in the text. The text analy- perform speech recognition, we therefore need sis unit extracts the word and phrase informa- not only the above-mentioned speech-to-text tion and the synthesis engine uses the acoustic transform technology but also technology to model for speech synthesis. At present, Spoken recognize the non-speech sounds as noise and Language Communication Laboratory uses an a voice activity detection technology to extract HMM speech synthesis method to implement the speech section that includes the target speech synthesis supporting not only Japanese speech. In addition, microphones that receive but also English, Chinese, Korean, Indonesian, are not always ideal for speech rec- Vietnamese, and Malay[2] . Since speech syn- ognition. A familiar example is applications thesis acoustic models cover different lan- on mobile terminals such as smart phones. To guages and have different voice qualities and process various kinds of noises and speech, speech styles, the speech style and voice quali- such applications use noise reduction process- ty can be changed by switching the models. ing and build a speech recognition model (in With a focus on this fact, a voice selector was particular an acoustic model) of higher noise developed. It produces speech in a voice simi- resistance from noise-contained sound data lar to the original speaker’s by selecting a before making speech recognition. model that has similar voice characteristic to A familiar application of speech recogni- the speaker’s in speech translation and other tion is the speech translation system and spo- processes. The naturalness of synthesized

12 Journal of the National Institute of Information and Communications Technology Vol. 59 Nos. 3/4 2012

JJM-3-1-下版-20121126-KASHIOKA.inddM-3-1-下版-20121126-KASHIOKA.indd 1122 113/01/113/01/11 116:246:24 speech is also enhanced by improving the filter can be acknowledged using speech expres- used for the development of the model. sions, intonations and other speech informa- The speech synthesis system is necessary tion. Spoken Language Communication to build a speech translation system and a Laboratory built a tourist information speech speech dialog system. Since it is used in con- dialog corpus and developed a Dialog versations with people, synthesized speech Management System [3] with WFST to with enriched naturalness is required. Speech achieve rapid and accurate speech language synthesis acoustic models are built from a understanding. speech corpus of the same person. However, The dialog system technology is used di- research suggests that the model’s naturalness rectly for building a spoken dialog system. is higher when the corpus is made from con- Unlike QA-systems, users can keep a dialogue versations than when it is made from reading with this system and obtain appropriate infor- text. Since building the speech synthesis mation. It can also be applied to not only the acoustic models costs a lot, the automatic dialog system but also a mechanism that un- building of models is another important issue. derstands and predicts the context. This is an important technology to realize appropriate 2.3 Dialog system technologies context-based processing in various speech To maintain a spoken dialog and make ap- communication technologies. propriate questions and answers, it is neces- sary to perceive the situation and environment 3 Conclusions of the dialog and understand the dialog itself. In speech communications, not only the dia- We have given an overview of the speech log’s content but also the speech information recognition, speech synthesis, and dialog sys- may provide information related to the situa- tem technologies, which all constitute spoken tion and environment. Information can also be language communication technology. The obtained by considering a series of speeches. Spoken Language Communication Laboratory, Dialog system technology is a technology for Universal Communication Research Institute, managing dialogs comprehensively, under- NICT, has been promoting research and devel- standing them, and predicting and producing opment with a focus on the problems and fu- the next speech in various occasions. ture of these elemental technologies. It not Dialog system technologies can be only performs research and development into classified into speech understanding technolo- the individual elemental technologies but also gy for the understanding of speech, context combines them with those developed by other processing technology for producing a re- laboratories to realize an integrated system sponse according to the underlying dialog act that can be used in actual society. Specifically, as obtained in the speech understanding pro- the Laboratory has developed the speech trans- cess, by considering the association with sur- lation system VoiceTra by integrating these rounding information services, and speech speech recognition, speech synthesis, and mul- producing technology for producing a re- tilingual translation technologies, and the spo- sponse sentence from the response content. To ken dialog system AssisTra by integrating understand dialog, it is necessary to under- speech recognition, speech synthesis, and dia- stand the dialog content and dialog act. The log system technologies. These systems are re- dialog content can be understood by describ- leased as applications available on smart ing the content as a logical proposition based phone devices for demonstration and field ex- on the understanding of proper nouns, various periment purposes. In the future we will pro- expressions, homonyms, and different expres- mote the research and development of technol- sions of the same target. The dialog act, such ogy for building a speech archive and of as a request, question, or information offering, multilingual communication technology for

KASHIOKA Hideki 13

JJM-3-1-下版-20121126-KASHIOKA.inddM-3-1-下版-20121126-KASHIOKA.indd 1133 113/01/113/01/11 116:246:24 providing smooth communications against speech information contained in speech data various communication barriers by utilizing and video data as well as text information.

References 1 Dixon Paul Richard, Chiori Hori, and Hideki Kashioka, “A COMPARISON OF DYNAMIC WFST DECODING APPROACHES,” In Proc. ICASSP, 2012. 2 Yoshinori Shiga, “EFFECT OF ANTI-ALIASING FILTERING ON THE QUALITY OF SPEECH FROM AN HMM- BASED SYNTHESIZER,” In Proc. ICASSP, 2012. 3 C. Hori, K. Ohtake, T. Misu, H. Kashioka, and S. Nakamura, “Statistical Dialog Management Applied to WFST-based Dialog Systems,” In Proc. ICASSP, pp. 4793–4796, 2009.

(Accepted June 14, 2012)

KASHIOKA Hideki, Ph.D. Director, Spoken Language Communication Laboratory, Universal Communication Research Institute Spoken Language Processing, Speech Translation, Spoken Dialogue

14 Journal of the National Institute of Information and Communications Technology Vol. 59 Nos. 3/4 2012

JM-3-1-下版-20121126-KASHIOKA.indd 14 13/01/11 16:24