Taming the Heathkit-Votrax Speech Synthesizer
Total Page:16
File Type:pdf, Size:1020Kb
Behavior Research Methods, Instruments, & Computers 1990, 22 (2), 256-259 Taming the Heathkit-Votrax speech synthesizer PHILLIP 1. EMERSON, DORIS C. KARNISKY, and CARLA J. KASTANIS Cleveland State University, Cleveland, Ohio Three inexpensive text-to-speech synthesizers are described, intelligibility data from a pilot experiment are reported, and software is offered that has been written to facilitate the phonemic programming of the Heathkit-Votrax synthesizer. Our main contribution is a set of ffiM PC programs segment of memory is required to produce an 8-sec stretch designed to ease the task of finding intelligible phoneme ofPCM speech. The computer produces the sound by put sequences to substitute for English word spellings in ting the amplitude numbers through a D/A converter at programming the Heathkit HV2000 speech synthesizer, equally spaced time intervals, to an audio amplifier and which is based on the Votrax SC02 IC chip. However, speaker. A personal computer CPU can make 8,000 D/A it would be hard to understand why such software is im conversions per second, but it has little time left over to portant and useful, without a bit of background in the do other things concurrently. The 64K segment is men rapidly developing area of computer speech processing. tioned because the INTEL 8088/86 computer design Therefore, we will begin with some distinctions and facts makes it easy to address 64K of data memory, but it is concerning the art of speech processing on personal com a bit of a programming nuisance to address more. Popu puters. Then we will briefly describe some experimental lar versions of C and Pascal for MS-DOS have the capa data that we have collected, concerning the intelligibili bility to address more than 64K, but some versions of ties of three inexpensive synthesizers, including the BASIC do not. Heathkit-Votrax. Our experiences in conducting those ex LPC takes advantage ofcommonly occurring stretches periments led us to realize the need for software of the of the acoustic wave of natural speech, in which one PCM kind that we are now making available to other scientific sample is very closely predictable from the preceding one and educational users. or from a short preceding series. Thus, a predictive al gorithm can be invoked, sometimes implemented as a Speech Processing on Small Computers slave CPU programmed with ROM, to fill in with the A large tree diagram might be the best way to organize needed PCM outputs, thereby leaving the central CPU all that is known about speech processing on small com freer to do other chores. The central CPU is not in such puters. We mention this large tree only to keep our own a tight timing loop, and its primary memory representa modest twig of a contribution in perspective; there are tion of the sound wave can be much more compact. The many large and interesting limbs that cannot be explored memory representation is a record, in some form, of oc here. The first big fork occurs between computer recog casional changes that need to be made in the output sound nition and computer production of speech. Our software wave, rather than a detailed record of the sound wave is concerned with speech production, and up that branch itself. we find that the next main fork has three limbs. These FC is a more direct attempt to utilize findings from are three different methods, which can be ordered along research on natural speech. There are several broad bands a dimension that consists of the amounts of high-speed of frequencies observable in speech spectrograms that primary memory required. They are: (1) pulse code have a lot to do with how speech sounds are perceived. modulation, or PCM; (2) linear predictive coding, or They are calledfonnants, and each may change in center LPC; and (3) formant coding, or FC. This nomenclature frequency for different vowel sounds. Such changes tend is not entirely self-explanatory, but it has become rather to be fairly smooth, except at boundaries between voiced standard. and unvoiced speech segments. The formants tend to dis PCM is the straightforward method in which a sound appear for nonvoiced consonants, which look more like wave is represented as a digitized recording ofamplitude a splattering in the spectrogram than a set of bands, but numbers taken at equally spaced time intervals. As a the formant transitions between voiced and unvoiced seg general rule, the Nyquist requirement for intelligible ments are important even for the perception of unvoiced speech is about 8,000 PCM samples per second. Thus, phonemes. Before small computers were available for if each amplitude is encoded as an 8-bit number (which speech processing, speech was synthesized by an FC is often found adequate, as well as convenient), a 64K method of sketching formants on a chart, which was then fed to a device much like a player piano. Correspondence shouldbe addressed to Phillip L. Emerson,Cleveland LPC and FC implementationsare often based on mathe State University, Cleveland, OH 44115. matical models of the acoustical characteristics of the vo- Copyright 1990 Psychonomic Society, Inc. 256 TAMING THE HV2000 257 cal system. The goals of LPC and FC are to produce significant, and the word X synthesizer interaction was acceptable speech from a much smaller memory represen very significant. Thus, across synthesizers, some words tation than is needed with PCM, and with lower data trans were consistently intelligible and some unintelligible, and fer rates. The catch is that most inexpensive implemen other words were quite intelligible with one synthesizer tations do not sound very natural, and that sometimes they but not with another. The Venn diagrams of Figures I are not very intelligible, whereas PCM sounds essentially and 2 show the best and worst words, for each synthesizer, the same as the natural voice that produced the sound that and the overlapping subsets. has been recorded. The Three Inexpensive Synthesizers The Appendix gives characteristics of the three syn thesizers that were available to us for testing on an ffiM PC. The main purpose of the Heathkit and B. G. Micro round r~d products is to have the computer produce artificial speech n.~ blood \ from ASCII text. That function is included with the ""hi t e ~~ foot root \ h~ad qr-een dry hair- I dOQ COVOX product, but as just one option. The COVOX r-e a n st.. r- know mouth ~ho ••rth system also includes a PCM mode, which is used with bi t e liv e r' some of the supplied software to do experimental editing that noae QrPA,. <, 11 ••h of soundwaves. We have not yet tried to use the COVOX breAsts -, bark "QQ leAther system in the PCM mode, because so far we have been h.art p.th focusing on the production of speech from text. cold covax Each synthesizer uses its own set of rules for the text horn full to-sound conversions. These rules are built into the syn not lous claw long mOllny hear thesizers and are not specified explicitly in the user wal~ smoke Io:i11 tree 1ie fi,.... documentation. However, it is known from other sources hot small that FC is the method of the Votrax IC used by Heathkit, HEATHKIT and that LPC is the method of the Smoothtalker software (from First Byte, Inc.) used by COVOX. It is quite evi dent from listening to each synthesizer that the set ofrules is a crude subset of those that would be needed to produce high-quality speech. For this reason, each synthesizer comes with facilities for using an auxiliary software dic Figure 1. Words with average intelligibility scores below 17%, for three synthesizers. tionary me (that the user must create) containing alter nate spellings of words or phrases that an, mispronounced by the built-in rules. Before conducting the experiment to be described, the experimenters worked for a number of hours to construct such a lookup me for each syn thesizer, repeatedly trying different spellings to get a set B. G. MICRO of common words to sound acceptable to our ears. This nos.. bird ~y~ turned out to be a very tedious and fatiguing task, requir sun moon not §kin sl••p ing much retyping of almost the same strings while try tooth I round smok.. gr••n ing to rememeber the sounds produced by the various black retypings. This is the task for which we developed the tongue new software to be described below. \ sit. mount.in covax fish The Experiment yello... ...at.r- cloud This experiment is described quite briefly, because it 'Stone was little more than a pilot study to teach us more about the needed methodology than about the synthesizers sleep new sun root per se. Each of three groups of 10 subjects was tested for flesh d1lf belly that the correct recognition of 99 words (English versions of blood Swadesh's universal concepts; Miller, 1981), with one HEATHKIT of the three synthesizers. Each word was presented in iso lation on a single trial, in which the listener was allowed up to two repetitions of the word but was penalized in the scoring for each repetition. The test for overall differ ences among the synthesizers was not significant at the Figure 2. Words with average intelligibility scores above 83%, for .05 level. However, the word main effects were highly three synthesizers. 258 EMERSON, KARNISKY, AND KASTANIS The grand mean of intelligibility scores averaged over a word, playback is obtained whenever desired by the synthesizers, words, and subjects was about 43 %. This stroke ofa certain key. Thus, the usual editing functions may seem too low for any useful application, but it is ofBACKSPACE, DELETE, INSERT, OVERSTRIKE, really not as bad as it may seem.