JAPANESE LVCSR ON THE SPONTANEOUS SCHEDULING TASK WITH
JANUS-3
T. Schultz, D. Kol l, and A. Waibel
Interactive Systems Lab oratories
University of Karlsruhe Germany, Carnegie Mellon University USA
ftanja,koll,waib [email protected]
ABSTRACT
Dialogs Utts Units Vo cab
This pap er presents our ndings during the devel-
Japanese Verbmobil Database
opment of the recognition engine for the Japanese
Total 800 11308 322K 2668
part of the VERBMOBIL sp eech-to-sp eech transla-
Training data
tion pro ject. We describ e an ecient metho d to
Bo otstrap 190 3122 92K 1879
b o otstrap a large vo cabulary sp eech recognizer for
Final system 730 10278 305K 2598
sp ontaneously sp oken Japanese sp eech from a Ger-
Test data OOV
man recognizer and show that the amount of e ort in
Acoustic mo dels 10 189 4K 0.35
Language mo del 70 1030 17K 0.6
developing the system could b e reduced by using this
rapid cross language b o otstrapping technique. The
Table 1: The Japanese sp ontaneous scheduling task
Japanese recognizer is integrated into the VERBMO-
BIL system and shows very promising results achiev-
ing 9.3 word error rate.
ing the JANUS-3 Sp eech Recognition To olkit in two
1. INTRODUCTION
passes: First a fast b o otstrap technique from a Ger-
man recognizer was applied on a subset of 190 di-
The overall goal of the rst phase of the VERBMO-
alogs of the training database. In the second step
BIL pro ject is to build a sp eech-to-sp eech transla-
the training pass on the full database with 730 di-
tion system from b oth German and Japanese sp on-
alogs lead to the nal system.
taneously sp oken input sp eech to English, German
and Japanese output in an app ointment scenario [1].
The Japanese recognizer describ ed in this pap er is
2. THE JAPANESE SPONTANEOUS
b eeing designed to be part of this translation sys-
SCHEDULING TASK
tem. Unlike Japanese dictation systems [2] there is
The domain of the VERBMOBIL scenario is limited
no need for our recognizer to map the output onto
to app ointment scheduling. Native sp eakers were
Japanese kanji/kana characters. Due to the many
given a time schedule and were asked to schedule
homophones in the Japanese language this mapping
a meeting in a human-to-human dialog session. The
requires syntactic and semantic knowledge ab out the
sp eaking style of the dialog partners is not restricted,
uttered sentence. In the VERBMOBIL system this
therefore sp ontaneous phenomena like noise, stutter-
knowledge can b e p ostp oned to the syntactic and se-
ing, false starts and nongrammatical sentences o ccur.
mantic analysis mo dules [3].
A push-to-talk button and a close sp eaking micro-
One p eculiarity of Japanese is, that sentences are
phone were used to avoid cross-talk e ects.
written in strings of kanji and kana characters with-
The Japanese part of the VERBMOBIL data-
out white space b etween adjacentwords. Thus the
base consists of 800 dialogs sp oken by 324 di erent
Japanese language lacks the natural segmentation of
native sp eakers. On average the dialogs cover 14 ut-
sp eechinto words, which can conveniently b e used as
terances each of a length of ab out 30 words. All di-
basic units for recognition purp oses, as known from
alogs are collected byATR Interpreting Telecommu-
Indo europ ean languages like English. In order to de-
nication Lab oratories and the University of Electro-
termine such basic sp eech units for the recognition
Communications in Tokyo Japan. Further infor-
system the transcrib ed sp eech data has b een seg-
mation ab out the corpus and collection pro cedures
mented by a semi-automatic morphological analysis
are given in [4].
program describ ed in [4].
The recognition engine was evaluated based on Human-to-human dialogs tend to b e at a higher
the database describ ed in the following section us- level of sp ontaneity than in human-to-computer sce-
narios, used for example in the ATIS-Task. A com-
Phone Typ e with German without
parison of cross-talk vs push-to-talk scenario for the
counterparts counterparts
Spanish scheduling task showed that cross-talk is
harder to recognize b ecause it is more noisy [5]. On
Vowels ae io 4&0
the other hand push-to-talk leads to longer and more
a: e: i: o: 4:
complex utterances 38 vs 10 words p er utterance on
Semi vowels j
average, making this task more dicult for sp eech
Consonants bdfghklm
translation.
nn2psStvz
A ricates ts tS dZ
Misc 1 Silence mo del
3. SYSTEM BOOTSTRAPPING
1 Glottal stop
It has b een shown earlier in [6] that for small vo cabu-
11 noise mo dels
lary continuous read sp eech the cross language b o ot-
Table 2: Japanese phoneme mo dels [Worldb et]
strapping technique from English to Japanese leads
to reasonable results. We expand this approachto
large vo cabulary sp ontaneously sp oken sp eech.
and with dialect sp eci c variants. For full coverage
3.1. Phoneme Set
of the 190 segmented training data a dictionary with
Running an alignment of a German phoneme rec-
1879 word units was built. The segmentation ap-
ognizer on Japanese input sp eech gave us the im-
proachwe applied results in relatively small vo cab-
pression, that the Japanese phonetic is very similar
ulary growth rates compared to the not segmented
to German. Therefore we decided to b o otstrap the
data. The size of the dictionary is even smaller than
Japanese phoneme set with German mo dels devel-
for the German scheduling task as can b e seen from
op ed for the German VERBMOBIL recognizer sys-
gure 1.
tem. Only 4 of the 31 phonemes required for acous-
7000
tic mo deling have no German counterparts. These
with segmentation
ere b o otstrapp ed as follows:/4/ from /u/, /4:/ from w no segmentation
6000
/u:/, /&0/ from /s/, and /dZ/ from /tS/. To cop e
with the e ects arising in the sp ontaneously sp oken
5000
data, like i.e. stuttering, false starts, or mumbling,
sp ecial noise mo dels [7] were included. All together
4000
44 di erent phonemes are used to mo del Japanese
sp eech: 31 sp eech mo dels, 11 noise mo dels, 1 silence
3000 table 2. and 1 glottal stop ref. GSST
entries in dictionary * e determined
After training a Japanese system w 2000
the similaritybetween the Japanese and German pho-
set by p erforming the following exp eriment:
neme 1000
we trained a SCHMM for the combined phoneme set
b oth German and Japanese input and ran a with 0
0 dialogs 50 100 150 200
pro cedure. This leads to the result that clustering
not segmented units 11000 23000 35000 52000
the most similar phonemes in this order are the con-
Figure 1: Vo cabulary growth of segmented vs not
sonant /z/ and /b/, the a ricate /ts/ and the semi
segmented word units; lab el "GSST" gives the dic-
vowel /j/. Japanese short vowels are similar to the
tionary size for the German VERBMOBIL database
long version of their German counterparts.
3.2. Pronunciation dictionary
3.3. Acoustic Mo dels Bo otstrapping
The Japanese syllable based writing system kana
To b o otstrap the Japanese system we to ok a German provides an almost phonetic transcription of the Ja-
context-indep endent 3-state HMM recognizer. Each panese sp oken language. Given a kana-transcription
state of the HMM is mo deled by one co deb o ok. Each of the sp oken dialogs the pronunciation dictionary
co deb o ok contains 16 mixture Gaussian distribution required for training the recognition system could
of a 32 dimensional feature space. 16 Mel-scale co ef- be built automatically using a simple mapping al-
cients, p ower and their rst and second derivatives gorithm. The output was p ost-edited by native ex-
are calculated from the 16kHz sampled input sp eech. p erts, adding pronunciation variants mainly for cop-
Mean subtraction is applied. The amount of features ing with di erences in the sp eaking styles of male
is reduced to 32 co ecients by computing a Linear and female sp eakers e.g. "atashi" as a variant for
Discriminant Analysis LDA. "watashi" in female sp eech for the English word "I"
We created a copy of the German recognizer con-
Systems Word Error
taining the co deb o oks and distributions of the Ger-
man counterparts for the Japanese mo dels as de-
Japanese recognizer
scrib ed in table 2. Additional copies for co deb o oks
First b o otstrapp ed version 65.0
and distributions from similar German phonemes are
Context-indep endent system 20.9
added for the 4 phonemes not found in the German
Context-dep endent system 13.0
phoneme set. As can b e seen in table 3 this recog-
Final system 9.3
nizer has a reasonable p erformance, though it has
German recognizer
never seen any Japanese data. In the next step this
Used for b o otstrapping 38.1
b o otstrapp ed version was trained 3 iterations. A new
Currently b est system 12.5
LDA had b een calculated, and new co deb o oks were
clustered. Another 3 training iteration made out the
Table 3: System p erformance
context-indep endent Japanese system, which leads
to 20.9 word error rate.
b een mo deled as a mixture of 32 Gaussians with di-
3.4. Further Improvements
agonal covariance. In order to increase recognition
After training and testing the context-indep endent
sp eed the dimensionality of the feature set was re-
system a context-dep endent system was develop ed
duced to the rst 24 LDA parameters of the feature
based on the JANUS-3 to olkit. Analyzing the schedul-
set describ ed in section 3.3. Lab el b o osting using
ing database shows that only 54000 di erent sept-
sup ervised MLLR-adaption was used to improve the
phones a context of 3 phonemes to the right and
quality of the database lab eling.
to the left could b e found. The Japanese phonetic
seems to be more restricted compared to German
sp eech in the equivalent app ointmentscheduling task
4.2. Language Mo deling
200.000 quint-phones and for English in the switch-
b oard task 500.000 triphones.
For the nal system, trigram language mo dels have
We mo deled 54000 sept-phones in the context-
b een built on the 730 dialogs of the training set using
dep endent Japanese system and clustered these sept-
a Kneser/Ney backo scheme for unseen bi- resp ec-
phones to 600 decision-tree-clustered p olyphone mo-
tively trigrams.
dels as describ ed in [8]. The nal context-dep endent
Our previous studies suggest that mo deling noises
system has ab out 2000 distributions over the 600
like regular words improves the recognition p erfor-
p olyphone mo dels. The phonetic questions needed
mance mo derately [7]. Breathing and key click noises
for the clustering pro cedure could be derived from
are for example much more common at the b egin-
the German phonetic questions.The resulting context-
ning and the end than in the middle of an utter-
dep endent Japanese sp eech recognition system achieves
ance. Therefore the noise events and hesitations e.g.
13.0 word error rate.
/eeto/ and /ano/ are mo deled like regular words in
computing their language mo del probabilities.
The morphological tagger used to segment the
4. FINAL SYSTEM
Japanese text transcriptions of the training dialogs
lead to a reduction of the vo cabulary size from roughly
For the initial b o otstrap only 190 out of the 800 di-
10.000 to 2500. In doing so we get shorter, more fre-
alogs were used for training the system. This lead
quent sequences. One drawbackis however the re-
to reasonable word accuracy on a system with com-
duction of the predictive power of n-gram mo dels.
paratively few parameters in short time. The nal
Another is the necessity for checking the output of
system however made use of the full data-set for b oth
the segmentation to ol which has to b e done by native
language mo deling and re ning the acoustic mo dels.
exp erts. In [9] a fully automatic statistical approach
Using all training data the dictionary size increased
is describ ed which nd sequences of text that are
to 2598 words.
b oth statistically imp ortant and semantically mean-
ingful.
4.1. Acoustic mo deling
Whereas the data b eing sucient for acoustic
mo deling, the corpus of approximately 300K words Using the full data-set for the estimation of the HMM-
over a 2598 word vo cabulary might still be a little parameters we changed the system structure to a
small for accurate trigramm estimation. However the fully continuous approach with an increased number
task limitation lead to a low p erplexity of 12.8 and of co deb o oks. The p olyphonic tree of all o ccurring
a slow growth of vo cabulary as observed in gure 1, sept-phones containing cross-word mo dels with up
resulting in a OOV rate of only 0.6 on the 17K to one phoneme lo okahead to adjacent words has
words test set. b een clustered to 2000 co deb o oks, each of which has
ural Language Pro cessing and Sp eechTechnol- 4.3. Deco ding
ogy: Results of the 3rd KONVENS Conference,
During recognition a two pass searchis p erformed:
Bielefeld Germany 1996.
in the rst pass the search dictionary is organized as
[4] A. Kurematsu, M. Kitamura, T. Nakasuji, and
a phoneme tree, the second op erates on a at word
A. Waib el: Data Col lection of Japanese Spon-
list containing the most likely word-hyp othesis of the
taneous Speech on Scheduling Task and Devel-
rst pass relative error reduction: 6. The re-
opment of Speech Dictionary in: Pro c. of Eu-
sulting word-lattices are rescored, making full use of
rosp eech, Rho des 1997.
the trigram language mo del relative error reduction:
10. The recognition accuracy of the resulting nal
[5] P. Zhan, K. Ries, M. Gavalda, D. Gates,
system using the full dictionary of 2598 words could
A. Lavie, and A. Waib el: JANUS-II: To-
thus b e improved to 9.3 word error rate.
wards Spontaneous Spanish Speech Recognition
Table 3 compares the word error rate b etween the
in: Pro c. of ICSLP, Philadelphia 1996.
describ ed Japanese systems and the German recog-
[6] B. Wheatley, K. Kondo, W. Anderson, and Y.
nition engine of Karlsruhe University [10], whichwas
Muthusamy: On Evaluation of Cross Language
the winner of the VERBMOBIL evaluation in 1996.
Adaptation for Rapid HMM Development in a
New Language in: Pro c. of ICASSP, Adelaide
5. CONCLUSION
1994.
From the ab ove exp eriments we conclude that the
[7] T. Schultz and I. Rogina: Acoustic and
cross language b o otstrapping of Japanese acoustic
Language Modeling of Human and Nonhu-
mo dels from German mo dels is a very ecient tech-
man Noises for Human-to-Human Spontaneous
nique even for large vo cabulary sp ontaneously sp o-
Speech Recognition in: Pro c. of ICASSP, Detroit
ken sp eech. Phonetic mismatches b etween the Ger-
1995.
man and Japanese phoneme set are tolerable. A Ja-
[8] M. Finke and I. Rogina: Wide Context Acous-
panese LVCSR system with go o d p erformance could
tic Modeling in Read vs Spontaneous Speech in:
b e develop ed in short time. The resulting nal sys-
Pro c. of ICASSP, Munich 1997.
tem is nowintegrated into the VERBMOBIL sp eech-
to-sp eech translation pro ject. With a word error rate
[9] L.J. Tomokyo-May eld and K. Ries: What
of 9.3 it achieves very promising results.
makes a word: Learning base units in Japanese
for speech recognition in: Pro c. of the ACL Nat-
ural Language Learning workshop, Madrid July
6. ACKNOWLEDGMENTS
1997.
The JANUS pro ject is partly funded by grant 413-
[10] M. Finke, P. Geutner, H. Hild, T. Kemp, K.
4001-01IV101S3 from the German Ministry of Sci-
Ries, and Martin Westphal: The Karlsruhe-
ence and Technology BMBF as a part of the VERB-
Verbmobil Speech Recognition Engine in: Pro c.
MOBIL pro ject. We gratefully acknowledge supp ort
of ICASSP, Munich 1997.
and co op eration with ATR Interpreting Telecommu-
nication Lab oratories and the University of Electro-
Communications in Tokyo, Japan. The authors wish
to thank all memb ers of the Interactive Systems Lab-
oratories, esp ecially Michael Finke for useful discus-
sion and active supp ort.
7. REFERENCES
[1] T. Bub, W. Wahlster, and A. Waib el: VERB-
MOBIL: The Combination of Deep and Shal low
Processing for Spontaneous Speech Translation
in: Pro c. of ICASSP, Munich 1997.
[2] T. Matsuoka, K. Ohtsuki, T. Mori, S.
Furui, and K. Shirai: Japanese Large-
Vocabulary Continous-Speech Recognition Us-
ing a Business-Newspaper Corpus in: Pro c. of
ICASSP, Munich 1997.
[3] M. Siegel: De niteness and Number in Japa-
nese to German Machine Translation in: Nat-