Japanese LVCSR on the Spontaneous Scheduling Task with JANUS-3

JAPANESE LVCSR ON THE SPONTANEOUS SCHEDULING TASK WITH

JANUS-3

T. Schultz, D. Kol l, and A. Waibel

Interactive Systems Lab oratories

University of Karlsruhe Germany, Carnegie Mellon University USA

ftanja,koll,waib [email protected]

ABSTRACT

Dialogs Utts Units Vo cab

This pap er presents our ndings during the devel-

Japanese Verbmobil Database

opment of the recognition engine for the Japanese

Total 800 11308 322K 2668

part of the VERBMOBIL sp eech-to-sp eech transla-

Training data

tion pro ject. We describ e an ecient metho d to

Bo otstrap 190 3122 92K 1879

b o otstrap a large vo cabulary sp eech recognizer for

Final system 730 10278 305K 2598

sp ontaneously sp oken Japanese sp eech from a Ger-

Test data OOV

man recognizer and show that the amount of e ort in

Acoustic mo dels 10 189 4K 0.35

Language mo del 70 1030 17K 0.6

developing the system could b e reduced by using this

rapid cross language b o otstrapping technique. The

Table 1: The Japanese sp ontaneous scheduling task

Japanese recognizer is integrated into the VERBMO-

BIL system and shows very promising results achiev-

ing 9.3 word error rate.

ing the JANUS-3 Sp eech Recognition To olkit in two

1. INTRODUCTION

passes: First a fast b o otstrap technique from a Ger-

man recognizer was applied on a subset of 190 di-

The overall goal of the rst phase of the VERBMO-

alogs of the training database. In the second step

BIL pro ject is to build a sp eech-to-sp eech transla-

the training pass on the full database with 730 di-

tion system from b oth German and Japanese sp on-

alogs lead to the nal system.

taneously sp oken input sp eech to English, German

and Japanese output in an app ointment scenario [1].

The Japanese recognizer describ ed in this pap er is

2. THE JAPANESE SPONTANEOUS

b eeing designed to be part of this translation sys-

SCHEDULING TASK

tem. Unlike Japanese dictation systems [2] there is

The domain of the VERBMOBIL scenario is limited

no need for our recognizer to map the output onto

to app ointment scheduling. Native sp eakers were

Japanese kanji/kana characters. Due to the many

given a time schedule and were asked to schedule

homophones in the Japanese language this mapping

a meeting in a human-to-human dialog session. The

requires syntactic and semantic knowledge ab out the

sp eaking style of the dialog partners is not restricted,

uttered sentence. In the VERBMOBIL system this

therefore sp ontaneous phenomena like noise, stutter-

knowledge can b e p ostp oned to the syntactic and se-

ing, false starts and nongrammatical sentences o ccur.

mantic analysis mo dules [3].

A push-to-talk button and a close sp eaking micro-

One p eculiarity of Japanese is, that sentences are

phone were used to avoid cross-talk e ects.

written in strings of kanji and kana characters with-

The Japanese part of the VERBMOBIL data-

out white space b etween adjacentwords. Thus the

base consists of 800 dialogs sp oken by 324 di erent

Japanese language lacks the natural segmentation of

native sp eakers. On average the dialogs cover 14 ut-

sp eechinto words, which can conveniently b e used as

terances each of a length of ab out 30 words. All di-

basic units for recognition purp oses, as known from

alogs are collected byATR Interpreting Telecommu-

Indo europ ean languages like English. In order to de-

nication Lab oratories and the University of Electro-

termine such basic sp eech units for the recognition

Communications in Tokyo Japan. Further infor-

system the transcrib ed sp eech data has b een seg-

mation ab out the corpus and collection pro cedures

mented by a semi-automatic morphological analysis

are given in [4].

program describ ed in [4].

The recognition engine was evaluated based on Human-to-human dialogs tend to b e at a higher

the database describ ed in the following section us- level of sp ontaneity than in human-to-computer sce-

narios, used for example in the ATIS-Task. A com-

Phone Typ e with German without

parison of cross-talk vs push-to-talk scenario for the

counterparts counterparts

Spanish scheduling task showed that cross-talk is

harder to recognize b ecause it is more noisy [5]. On

Vowels ae io 4&0

the other hand push-to-talk leads to longer and more

a: e: i: o: 4:

complex utterances 38 vs 10 words p er utterance on

Semi vowels j

average, making this task more dicult for sp eech

Consonants bdfghklm

translation.

nn2psStvz

A ricates ts tS dZ

Misc 1 Silence mo del

3. SYSTEM BOOTSTRAPPING

1 Glottal stop

It has b een shown earlier in [6] that for small vo cabu-

11 noise mo dels

lary continuous read sp eech the cross language b o ot-

Table 2: Japanese phoneme mo dels [Worldb et]

strapping technique from English to Japanese leads

to reasonable results. We expand this approachto

large vo cabulary sp ontaneously sp oken sp eech.

and with dialect sp eci c variants. For full coverage

3.1. Phoneme Set

of the 190 segmented training data a dictionary with

Running an alignment of a German phoneme rec-

1879 word units was built. The segmentation ap-

ognizer on Japanese input sp eech gave us the im-

proachwe applied results in relatively small vo cab-

pression, that the Japanese phonetic is very similar

ulary growth rates compared to the not segmented

to German. Therefore we decided to b o otstrap the

data. The size of the dictionary is even smaller than

Japanese phoneme set with German mo dels devel-

for the German scheduling task as can b e seen from

op ed for the German VERBMOBIL recognizer sys-

gure 1.

tem. Only 4 of the 31 phonemes required for acous-

7000

tic mo deling have no German counterparts. These

with segmentation

ere b o otstrapp ed as follows:/4/ from /u/, /4:/ from w no segmentation

6000

/u:/, /&0/ from /s/, and /dZ/ from /tS/. To cop e

with the e ects arising in the sp ontaneously sp oken

5000

data, like i.e. stuttering, false starts, or mumbling,

sp ecial noise mo dels [7] were included. All together

4000

44 di erent phonemes are used to mo del Japanese

sp eech: 31 sp eech mo dels, 11 noise mo dels, 1 silence

3000 table 2. and 1 glottal stop ref. GSST

entries in dictionary * e determined

After training a Japanese system w 2000

the similaritybetween the Japanese and German pho-

set by p erforming the following exp eriment:

neme 1000

we trained a SCHMM for the combined phoneme set

b oth German and Japanese input and ran a with 0

0 dialogs 50 100 150 200

pro cedure. This leads to the result that clustering

not segmented units 11000 23000 35000 52000

the most similar phonemes in this order are the con-

Figure 1: Vo cabulary growth of segmented vs not

sonant /z/ and /b/, the a ricate /ts/ and the semi

segmented word units; lab el "GSST" gives the dic-

vowel /j/. Japanese short vowels are similar to the

tionary size for the German VERBMOBIL database

long version of their German counterparts.

3.2. Pronunciation dictionary

3.3. Acoustic Mo dels Bo otstrapping

The Japanese syllable based writing system kana

To b o otstrap the Japanese system we to ok a German provides an almost phonetic transcription of the Ja-

context-indep endent 3-state HMM recognizer. Each panese sp oken language. Given a kana-transcription

state of the HMM is mo deled by one co deb o ok. Each of the sp oken dialogs the pronunciation dictionary

co deb o ok contains 16 mixture Gaussian distribution required for training the recognition system could

of a 32 dimensional feature space. 16 Mel-scale co ef- be built automatically using a simple mapping al-

cients, p ower and their rst and second derivatives gorithm. The output was p ost-edited by native ex-

are calculated from the 16kHz sampled input sp eech. p erts, adding pronunciation variants mainly for cop-

Mean subtraction is applied. The amount of features ing with di erences in the sp eaking styles of male

is reduced to 32 co ecients by computing a Linear and female sp eakers e.g. "atashi" as a variant for

Discriminant Analysis LDA. "watashi" in female sp eech for the English word "I"

We created a copy of the German recognizer con-

Systems Word Error

taining the co deb o oks and distributions of the Ger-

man counterparts for the Japanese mo dels as de-

Japanese recognizer

scrib ed in table 2. Additional copies for co deb o oks

First b o otstrapp ed version 65.0

and distributions from similar German phonemes are

Context-indep endent system 20.9

added for the 4 phonemes not found in the German

Context-dep endent system 13.0

phoneme set. As can b e seen in table 3 this recog-

Final system 9.3

nizer has a reasonable p erformance, though it has

German recognizer

never seen any Japanese data. In the next step this

Used for b o otstrapping 38.1

b o otstrapp ed version was trained 3 iterations. A new

Currently b est system 12.5

LDA had b een calculated, and new co deb o oks were

clustered. Another 3 training iteration made out the

Table 3: System p erformance

context-indep endent Japanese system, which leads

to 20.9 word error rate.

b een mo deled as a mixture of 32 Gaussians with di-

3.4. Further Improvements

agonal covariance. In order to increase recognition

After training and testing the context-indep endent

sp eed the dimensionality of the feature set was re-

system a context-dep endent system was develop ed

duced to the rst 24 LDA parameters of the feature

based on the JANUS-3 to olkit. Analyzing the schedul-

set describ ed in section 3.3. Lab el b o osting using

ing database shows that only 54000 di erent sept-

sup ervised MLLR-adaption was used to improve the

phones a context of 3 phonemes to the right and

quality of the database lab eling.

to the left could b e found. The Japanese phonetic

seems to be more restricted compared to German

sp eech in the equivalent app ointmentscheduling task

4.2. Language Mo deling

200.000 quint-phones and for English in the switch-

b oard task 500.000 triphones.

For the nal system, trigram language mo dels have

We mo deled 54000 sept-phones in the context-

b een built on the 730 dialogs of the training set using

dep endent Japanese system and clustered these sept-

a Kneser/Ney backo scheme for unseen bi- resp ec-

phones to 600 decision-tree-clustered p olyphone mo-

tively trigrams.

dels as describ ed in [8]. The nal context-dep endent

Our previous studies suggest that mo deling noises

system has ab out 2000 distributions over the 600

like regular words improves the recognition p erfor-

p olyphone mo dels. The phonetic questions needed

mance mo derately [7]. Breathing and key click noises

for the clustering pro cedure could be derived from

are for example much more common at the b egin-

the German phonetic questions.The resulting context-

ning and the end than in the middle of an utter-

dep endent Japanese sp eech recognition system achieves

ance. Therefore the noise events and hesitations e.g.

13.0 word error rate.

/eeto/ and /ano/ are mo deled like regular words in

computing their language mo del probabilities.

The morphological tagger used to segment the

4. FINAL SYSTEM

Japanese text transcriptions of the training dialogs

lead to a reduction of the vo cabulary size from roughly

For the initial b o otstrap only 190 out of the 800 di-

10.000 to 2500. In doing so we get shorter, more fre-

alogs were used for training the system. This lead

quent sequences. One drawbackis however the re-

to reasonable word accuracy on a system with com-

duction of the predictive power of n-gram mo dels.

paratively few parameters in short time. The nal

Another is the necessity for checking the output of

system however made use of the full data-set for b oth

the segmentation to ol which has to b e done by native

language mo deling and re ning the acoustic mo dels.

exp erts. In [9] a fully automatic statistical approach

Using all training data the dictionary size increased

is describ ed which nd sequences of text that are

to 2598 words.

b oth statistically imp ortant and semantically mean-

ingful.

4.1. Acoustic mo deling

Whereas the data b eing sucient for acoustic

mo deling, the corpus of approximately 300K words Using the full data-set for the estimation of the HMM-

over a 2598 word vo cabulary might still be a little parameters we changed the system structure to a

small for accurate trigramm estimation. However the fully continuous approach with an increased number

task limitation lead to a low p erplexity of 12.8 and of co deb o oks. The p olyphonic tree of all o ccurring

a slow growth of vo cabulary as observed in gure 1, sept-phones containing cross-word mo dels with up

resulting in a OOV rate of only 0.6 on the 17K to one phoneme lo okahead to adjacent words has

words test set. b een clustered to 2000 co deb o oks, each of which has

ural Language Pro cessing and Sp eechTechnol- 4.3. Deco ding

ogy: Results of the 3rd KONVENS Conference,

During recognition a two pass searchis p erformed:

Bielefeld Germany 1996.

in the rst pass the search dictionary is organized as

[4] A. Kurematsu, M. Kitamura, T. Nakasuji, and

a phoneme tree, the second op erates on a at word

A. Waib el: Data Col lection of Japanese Spon-

list containing the most likely word-hyp othesis of the

taneous Speech on Scheduling Task and Devel-

rst pass relative error reduction: 6. The re-

opment of Speech Dictionary in: Pro c. of Eu-

sulting word-lattices are rescored, making full use of

rosp eech, Rho des 1997.

the trigram language mo del relative error reduction:

10. The recognition accuracy of the resulting nal

[5] P. Zhan, K. Ries, M. Gavalda, D. Gates,

system using the full dictionary of 2598 words could

A. Lavie, and A. Waib el: JANUS-II: To-

thus b e improved to 9.3 word error rate.

wards Spontaneous Spanish Speech Recognition

Table 3 compares the word error rate b etween the

in: Pro c. of ICSLP, Philadelphia 1996.

describ ed Japanese systems and the German recog-

[6] B. Wheatley, K. Kondo, W. Anderson, and Y.

nition engine of Karlsruhe University [10], whichwas

Muthusamy: On Evaluation of Cross Language

the winner of the VERBMOBIL evaluation in 1996.

Adaptation for Rapid HMM Development in a

New Language in: Pro c. of ICASSP, Adelaide

5. CONCLUSION

1994.

From the ab ove exp eriments we conclude that the

[7] T. Schultz and I. Rogina: Acoustic and

cross language b o otstrapping of Japanese acoustic

Language Modeling of Human and Nonhu-

mo dels from German mo dels is a very ecient tech-

man Noises for Human-to-Human Spontaneous

nique even for large vo cabulary sp ontaneously sp o-

Speech Recognition in: Pro c. of ICASSP, Detroit

ken sp eech. Phonetic mismatches b etween the Ger-

1995.

man and Japanese phoneme set are tolerable. A Ja-

[8] M. Finke and I. Rogina: Wide Context Acous-

panese LVCSR system with go o d p erformance could

tic Modeling in Read vs Spontaneous Speech in:

b e develop ed in short time. The resulting nal sys-

Pro c. of ICASSP, Munich 1997.

tem is nowintegrated into the VERBMOBIL sp eech-

to-sp eech translation pro ject. With a word error rate

[9] L.J. Tomokyo-May eld and K. Ries: What

of 9.3 it achieves very promising results.

makes a word: Learning base units in Japanese

for speech recognition in: Pro c. of the ACL Nat-

ural Language Learning workshop, Madrid July

6. ACKNOWLEDGMENTS

1997.

The JANUS pro ject is partly funded by grant 413-

[10] M. Finke, P. Geutner, H. Hild, T. Kemp, K.

4001-01IV101S3 from the German Ministry of Sci-

Ries, and Martin Westphal: The Karlsruhe-

ence and Technology BMBF as a part of the VERB-

Verbmobil Speech Recognition Engine in: Pro c.

MOBIL pro ject. We gratefully acknowledge supp ort

of ICASSP, Munich 1997.

and co op eration with ATR Interpreting Telecommu-

nication Lab oratories and the University of Electro-

Communications in Tokyo, Japan. The authors wish

to thank all memb ers of the Interactive Systems Lab-

oratories, esp ecially Michael Finke for useful discus-

sion and active supp ort.

7. REFERENCES

[1] T. Bub, W. Wahlster, and A. Waib el: VERB-

MOBIL: The Combination of Deep and Shal low

Processing for Spontaneous Speech Translation

in: Pro c. of ICASSP, Munich 1997.

[2] T. Matsuoka, K. Ohtsuki, T. Mori, S.

Furui, and K. Shirai: Japanese Large-

Vocabulary Continous-Speech Recognition Us-

ing a Business-Newspaper Corpus in: Pro c. of

ICASSP, Munich 1997.

[3] M. Siegel: De niteness and Number in Japa-

nese to German Machine Translation in: Nat-