Aspects of Phonological Fusion James E
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Experimental Psychology: Human Perception and Performance 1975, Vol. 104, No. 2, 105-120 Aspects of Phonological Fusion James E. Cutting Yale University and Haskins Laboratories, New Haven, Connecticut Phonological fusion occurs when the phonemes of two different speech stimuli are combined into a new percept that is longer and linguistically more complex than either of the two inputs. For example, when PAY is presented to one ear and LAY to the other, the subject often perceives PLAY. The present article is an investigation of the conditions necessary and sufficient for fusion to occur. The rules governing phonological fusion appear to be the same for synthetic and natural speech, but synthetic stimuli fuse more readily. Fusion occurs considerably more often in di- chotic stimulus presentation than in binaural presentation. The phenomenon is remarkably tolerant of differences in relative onset time between the to-be-fused stimuli and of relative differences in fundamental frequency, intensity, and vocal tract configuration. Although phonological fusion is insensitive to such nonlinguistic stimulus parameters, it is sensitive to lin- guistic variations at the semantic, phonemic, and acoustic levels. Most of the dichotic listening literature ments of both stimuli are combined to form has dealt with the phenomenon of perceptual a new percept that is longer and linguistically rivalry. Given a different stimulus presented more complex than either of the two inputs. to each ear at the same time, the subject This phenomenon is called phonological typically reports hearing one or both of fusion because it conforms to phonological them. The different information contained rules of English: Given BANKET/LAN- in each stimulus is not combined into a sin- KET, the subject reports hearing BLAN- gle percept. Thus, for example, given the KET, not LBANKET (Day, Note 1). dichotic digits FOUR/SIX, the subject Nonword fusions also occur; for example, never reports hearing SORE or FIX. Per- GAB/LAB yields GLAB, not LGAB (Day, ceptual fusion does occur when certain vari- 1968). According to phonological rules of ables are taken in account. In several types cluster information in English, initial stop of fusion phenomena, the stimulus variables consonant + liquid clusters are allowed, but that facilitate fusion appear to be psycho- initial liquid + stop clusters are not. Day linguistic in nature. For example, given the (1970) found that when stimuli are pre- dichotic pair PAHDUCT/RAHDUCT, the sented dichotically with these constraints re- subject often reports hearing PRODUCT moved, fusion occurs in both directions. (Day, 1968). In this type of fusion, seg- Given the stimuli TASS/TACK, for exam- This research was supported by National Insti- ple, the subject reports hearing TASK on tute of Child Health and Human Development some trials and TACKS on others. Both Grant HD-01994 to the Haskins Laboratories, and /sk/ and /ks/ clusters are permissible in was based in part on a PhD thesis submitted to final position in English. the Department of Psychology of Yale University. Thanks go to Ruth S. Day, with whom I have Phonological fusion cannot be explained as never had a colorless conversation; Franklin S. a response bias for acceptable English words. Cooper, with whom I have never had a fruitless Day (1968) found that when different pro- conversation; and the members of the New Haven ductions of PAHDUCT are presented to Dance Ensemble, with whom I never talked about this at all. each ear, subjects report hearing PAH- The author is now at Wesleyan University, DUCT. That is, they do not report hearing Middletown, Connecticut and Haskins Laborato- the acceptable English word that corre- ries. Requests for reprints should be sent to James E. Cutting, Haskins Laboratories, 270 sponds most closely to the nonword inputs. Crown Street, New Haven, Connecticut 06510. Likewise, RAHDUCT/RAHDUCT yields 105 106 JAMES E. CUTTING RAHDUCT. Only when the stimuli are each formant gliding to the resting frequency of PAHDUCT/RAHDUCT does fusion occur, the following vowel. Stop consonants began with 50 msec of formant transitions gliding into the and it occurs regardless of which stimulus following vowel. Voice onset time was 0 msec is presented to which ear. In the present for voiced consonants and +50 msec for voiceless article, I will consider a fusion response to consonants. Natural speech versions of the same occur when a stimulus pair, each item of items were spoken by the author and recorded on audio tape. An effort was made to match utter- which has n phonemes, yields a percept of ances in terms of pitch, intensity, and duration, n + 1 phonemes (following Day, 1968). but some variation was unavoidable. Both syn- The studies presented here are an anal- thetic and natural speech stimuli were digitized ysis of phonological fusion and the condi- and stored on disk file for preparation of diotic tions necessary and sufficient for fusion to and dichotic tapes (Cooper & Mattingly, 1969). Subjects. Sixteen Yale University undergradu- occur. Of primary interest are the modes ates participated in identification and fusion tasks. of presentation that optimize the occurrence Each received course credit for his or her efforts. of fusion responses and the stimulus vari- As in all experiments in this paper, subjects were ables under those conditions that enhance right-handed native American English speakers with no history of hearing trouble and with no and detract from the fused percept. Both experience listening to synthetic speech. They linguistic and physical (nonlinguistic) vari- listened in groups of four to tapes played on an ables are of interest. Ampex AG-500 dual-track tape recorder. Audi- To undertake such a project, one must be tory signals were sent through a listening sta- tion to matched Telephonies earphones (Model able to control the stimuli in a precise man- TDH39). Gains on the tape recorder and listen- ner, and such control is only available with ing station were adjusted so that stimuli were pre- a computer-driven speech synthesizer. How- sented at approximately 80 db. (re 20 /u,N/m2). ever, synthetic speech has a characteristic These criteria and procedures were used for all experiments in the present article. No subject "cold-in-the-nose" quality which may effect served in more than one experiment. The fusion phonological fusion. To insure that the task in the present study preceded the identification present studies are not concerned with arti- task, but it is more instructive to consider them facts inherent to synthetic speech, fusion was in reverse order. first observed for synthetic and natural Task 1: Identification of Individual Stimuli speech tokens of the same items. All stimuli, natural and synthetic, were recorded on a diotic identification tape. The tape consisted SYNTHETIC VERSUS NATURAL SPEECH of a random sequence of 120 trials: (4 sets of Experiment 1: Baseline Data stimuli) X (2 versions of each set: natural and synthetic) X (3 stimuli per set) X (5 observa- General Method tions per stimulus). Subjects wrote down the en- tire word that they heard presented. There was a Tape preparation. Four sets of stimuli of the 3-sec interval between items. same general pattern were selected: the PAY set (PAY, LAY, RAY) ; the BED set (BED, LED, Results RED) ; the CAM set (CAM, LAMB, RAM) ; and the GO set (GO, LOW, ROW). Two ver- All stimuli were highly identifiable. Each sions of each stimulus were prepared, one in syn- synthetic and natural speech item was cor- thetic speech and one in natural speech. Synthetic stimuli were prepared on the Haskins Laborato- rectly identified on better than 93% of all ries' parallel-resonance synthesizer. All stimuli trials. Cluster responses (such as PLAY) were identical in pitch contour, rising from 116 to were reported on less than 2% of all trials. 124 Hz and then falling to 82 Hz. Within each Identification tasks were a part of each ex- set all stimuli were identical in duration (3SO, 450, 400, and 350 msec for the four sets, respectively), periment presented in this article. Since the and all had the same acoustic structure except for stimuli and results generally did not differ the initial ISO msec. Liquid stimuli within each from those in Experiment 1, the identifica- set (those beginning with /!/ and /r/) differed tions are not discussed in all experiments. only in the direction and extent of the third- formant transition. This acoustic cue distinguishes Task 2: Identification of Fusible Pairs the two liquids (O'Connor, Gerstman, Liberman, D&lattre, & Cooper, 1957). Each liquid stimulus All stimuli used in the identification task were began with 50-msec steady state prerelease formant also used in the fusion task. However, instead of resonances, followed by 100-msec transitions in presenting one stimulus at a time, two stimuli PHONOLOGICAL FUSION 107 were presented, one to each ear. The dichotic pair TABLE 1 consisted of a stop stimulus and a liquid stimulus PERCENT FUSION RESPONSES FOR EACH PAIR IN from the same set; for example, PAY/LAY or EXPERIMENT 1, IN BOTH SYNTHETIC SPEECH PAY/RAY. All stimuli and possible fusions were AND NATURAL SPEECH VERSIONS monosyllabic, high-frequency English words: PAY/ LAY -» (yields) PLAY, PAY/RAY -» PRAY, Percent fusion BED/LED -» BLED, BED/RED -» BREAD, Fusible pair CAM/LAMB -» CLAM, CAM/RAM -> CRAM, Synthetic Natural M GO/LOW -» GLOW, and GO/ROW -» GROW. Most of these items and fusions have Thorndike- PAY/LAY 78 41 59 Lorge (1944) word frequencies of at least 100 per PAY/RAY 70 33 51 million. No alveolar stop-consonant stimuli were (55) chosen because /tl/ and /dl/ clusters do not ap- BED/LED 48 33 40 pear in initial position in English. BED/RED 46 19 33 (37) Two dichotic tapes were prepared, one using CAM/RAM 63 29 46 synthetic stimuli and one using natural speech CAM/LAMB 57 17 37 stimuli.