8th International Seminar on Production 353

Performing Identification and Discrimination Experiments for Vowels and Voiced Plosives by Using a Neurocomputational Model of and

Bernd J. Kröger, Jim Kannampuzha, Christiane Neuschaefer-Rube RWTH Aachen University and University Hospital Aachen E-mail: [email protected], [email protected], [email protected]

Abstract a phonetotopic ordering of speech items, the feedback loop of the production model was extended in order A neurocomputational model of speech production to integrate perception for speech sounds and sylla- and is introduced. After training, bles. It is described in this paper how the model now i.e. after mimicking early phases of speech acqui- can be used in experiments focusing on speech sound sition, the model is capable of producing and per- identification and discrimination. ceiving vowels and CV-syllables (C = voiced plo- sives). Different instances of the model were trained 2 The Model for representing different “virtual subjects” which are then used as listeners in identification and The organization of the model is given in Fig. 1. The discrimination experiments. First results indicate that model comprises neural maps for motor planning and a typical feature of speech perception – i.e. motor programming/execution, for auditory and so- categorical perception – occurs in a straight forward matosensory representations of speech items (sounds, way from our neurocomputational production- syllables, or words), for phonetic features, and for perception model. linguistic information (i.e. phonemic representations of speech items). The neural mappings connecting these maps are trained using pre-linguistic proto- 1 Introduction vocalic and proto-consonantal training data (babbling training data) and later on -specific vocalic Our work on this modeling topic started with the and consonantal training data (imitation training data, development of a neural model of speech production Kröger et al. in press). Thus the model separates (Kröger et al. 2006a and Kröger et al. 2006b). The neural structure (i.e. maps) and knowledge (i.e. work was inspired by the approach introduced by synaptic link weight adjustment for the mappings du- Guenther et al. (2006) and Guenther (2006). A main ring training). For speech production feed-forward feature of our model is the separation of the control and feedback control loops (or control streams) are module into feedforward and feedback control implemented. For speech perception the dual-stream components. First training results indicated that approach (Hickok and Poeppel 2007) is implemented. motor and sensory states are ordered with respect to The ventral stream can be interpreted in terms of phonetic features like “front-back” and “high-low” in passive non-motor assisted perception using a direct the case of vowels and with respect to phonetic link from auditory maps to the mental lexicon while features like “place of articulation” in the case of the dorsal stream – activating frontal motor areas – voiced plosives at the neural level of sensorimotor uses existing neural networks of the feed-forward associations. This phenomenon is called “phoneto- speech production part of the model. It is assumed topy” (Kröger et al. in press). Since sensory (auditory that the dorsal stream of speech perception is mainly and somatosensory) feedback is important for getting active during lower level perception tasks like

ISSP 2008 8th International Seminar on Speech Production 354

recognition of phonetic-phonological sound features, training stages. This results in a self-organization of while the ventral stream directly activates lexical the phonetic map and its phonetic to sensory and items as a whole. For the perception experiments phonetic to motor plan mappings (Kröger et al. in described in this paper, exclusively the dorsal stream press). The motor plan of an infrequent syllable is is used. generated by the motor planning module. The motor plan of frequent as well as of infrequent speech items is executed via the motor execution module and leads to a time course of position and velocities for all model articulators of the vocal tract model. A description of the vocal tract model is given by Birkholz et al. (2006), Birkholz and Kröger (2006 and 2007), and by Kröger and Birkholz (2007). The vocal tract model is capable of genera- ting articulatory move- ment patterns (i.e. lo- cation, velocity, and shape of each model articulator for each time instant) and the acoustic speech signal. These output signals of the vocal tract model serve as input Figure 1: Neurocomputational model of speech pro- information for the feedback control component. Fol- duction and speech perception. Boxes with black lowing preprocessing, the current sensory (auditory outline represent neural maps; arrows indicate pro- and somatosensory) signals activate neurons within cessing paths or neural mappings. Boxes without the primary sensory maps. Then a comparison of the black outline indicate processing modules (not speci- current sensory state with the prestored sensory state fied in detail currently). for the speech item under production is done within the sensory-phonetic processing modules. As a result Feed-forward control starts with a linguistic repre- an error signal is calculated for correcting the current sentation of a speech item, activated on the level of motor execution, if the current sensory state deviates the phonemic map. Each frequent syllable activates from the stored sensory state for the current speech its prestored sensory states (auditory and somatosen- item. sory; somatosensory comprises tactile and proprio- ceptive states) and its prestored motor plan state (a detailed description of motor plan states is given in Kröger et al. in press). Prestored motor and sensory states are trained or learned during speech acquisition

ISSP 2008 8th International Seminar on Speech Production 355

3 Experiments using the trained model (a)

Identification and discrimination experiments are based on a vocalic or on a consonantal stimulus con- tinuum (e.g. [i]-[e]-[a]-continuum or [ba]-[da]-[ga]- continuum). Each stimulus continuum comprises 13 stimuli in the case of our experiments. Classic identi- fication and discrimination experiments are perfor- med by human subjects. In the case of identification experiment subjects are asked to assign each stimulus to a phoneme category (forced choice test). In the case of discrimination experiments subjects are asked to determine whether two stimuli are identical or not. The discrimination experiment is always designed in a way that both stimuli are never identical but exhibit a consonant (small) distance on the physical scale de- fining the stimulus continuum. (b) In this study human subjects are replaced by “virtual subjects” or “virtual listeners”. Each virtual listener is realized by an instance of our neurocompu- tational model. All instances of the model indicate the same organization as is given in Fig. 1 but they are trained using (i) a different (random) initialization of link weights and (ii) different (random) ordering of the training items during learning. After training or learning these different random orderings lead to qualitatively the same organization of the phonetic map and the phonetic to sensory, phonetic to motor plan, and phonetic to phonemic mapping. But it can be seen that the phonetic map and the related mappings indicate individual differences from model instance to model instance. Examples for phonetic (c) maps of different model instances are given in Fig. 2 (for training details see Kröger et al. in press and Kröger et al. accepted).

Figure 2: Self-organizing phonetic map (15x15 neurons) for vowels for three different instances of the neurocomputational model. Bars from left to right within each neuron square: phonemic link weight values for /i/, /e/, /a/, /o/, and /u/. Horizontal lines: bark scaled values of the first three formants, i.e. auditory link weight values. Outlined boxes indicate phoneme link weight values above 80% for a phoneme and thus indicate neurons, representing different realizations of a phoneme. Neurons, realizing the same phoneme, form clusters within the phonetic map.

ISSP 2008 8th International Seminar on Speech Production 356

4 Modeling Identification and Discrimination References Birkholz P, Jackel D, Kröger BJ (2006) Construction and While a speech sound (e.g. a vowel) is produced by control of a three-dimensional vocal tract model. activating the appropriate neuron in the phonemic Proceedings of the International Conference on map and subsequently a neuron of the phonetic map Acoustics, Speech, and Signal Processing (ICASSP (see Fig 2 and Kröger et al. accepted) speech sound 2006, Toulouse, France) pp. 873-876 Birkholz P, Kröger BJ (2006) Vocal tract model adap- identification is done by looking for the most acti- tation using magnetic resonance imaging. vated neuron within the phonemic map activated by Proceedings of the 7th International Seminar on the neural auditory pathway. The sensory state of the Speech Production (Belo Horizonte, Brazil) pp. 493- speech sound under identification activates a neuron 500 within the sensory map and subsequently in the Birkholz P, Kröger BJ (2007) Simulation of vocal tract phonetic and in the phonemic map (dorsal stream of growth for articulatory speech synthesis. Proceedings of the 16th International Congress of Phonetic speech perception; the ventral stream deals with more Sciences (Saarbrücken, Germany) pp. 377-380 complex speech items like words). It is assumed in Guenther FH (2006) Cortical interaction underlying the our model that the ability for sound discrimination is production of speech sounds. Journal of Communi- proportional to the distance of activation for both cation Disorders 39: 350-365 stimuli on the level of the phonetic map. Thus Guenther FH, Ghosh SS, Tourville JA (2006) Neural discrimination of two speech sounds is simple if the modeling and imaging of the cortical interactions distance between the neurons mainly activated by underlying syllable production. Brain and Language 96: 280-301 both sound is large. Discrimination becomes more Hickok G, Poeppel D (2007) Towards a functional and more difficult the more the distance between both neuroanatomy of speech perception. Trends in Cogni- speech items decreases on the level of the phonetic tive Sciences 4: 131-138 map (Kröger et al. accepted). Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer- Rube C (2006a) Learning to associate speech-like 5 Results sensory and motor states during babbling. Proceedings of the 7th International Seminar on Speech Production. Belo Horizonte, Brazil, pp. 67-74 The results of our preliminary perception experiments Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer- performed by using our neurocomputational model Rube C (2006b) Modeling sensory-to-motor are in agreement with the results of the classical mappings using neural nets and a 3D articulatory th identification and discrimination experiments speech synthesizer. Proceedings of the 9 performed by humans. Identification scores exhibit International Conference on Spoken Language Processing (Interspeech 2006 – ICSLP, Pittsburgh, typical phoneme regions and phoneme boundaries. Pennsylvania) pp. 565-568 Measured discrimination (i.e. discrimination scores Kröger BJ, Birkholz P (2007) A gesture-based concept resulting from discrimination experiments) is higher for speech movement control in articulatory speech than calculated discrimination (i.e. calculated discri- synthesis. In: Esposito A, Faundez-Zanuy M, Keller mination on the basis of individual identification E, Marinaro M (eds.) Verbal and Nonverbal scores) in the case of vowels and comparable to Communication Behaviours, LNAI 4775 (Springer measured discrimination in the case of consonants Verlag, Berlin, Heidelberg) pp. 174-189, http://dx.doi.org/10.1007/978-3-540-76442-7_16 indicating that consonant perception is more Kröger BJ, Kannampuzha J, Neuschaefer-Rube C (accep- categorical than vowel perception (Kröger et al. ted) Towards a neurocomputational model of speech accepted). production and perception. Speech Communication, http://dx.doi.org/10.1016/j.specom.2008.08.002 6 Acknowledgments Kröger BJ, Kannampuzha J, Lowit A, Neuschaefer-Rube C (in press) Phonetotopy within a neurocomputational model of speech production and speech acquisition. This work was supported in part by the German In: Fuchs S, Loevenbruck H, Pape D, Perrier P (eds.) Research Council Grand Nr. KR 1439/13-1. Some aspects of speech and the brain. (Peter Lang, Berlin) pp. 59-90

ISSP 2008