US005706392A United States Patent (19) 11 Patent Number: 5,706,392 Goldberg et al. 45 Date of Patent: Jan. 6, 1998

54 PERCEPTUAL SPEECH CODER AND 4,856,068 8/1989 Quatieri, Jr. et al...... 395/2.36 METHOD 4,972,484 11/1990 Theile et al...... 395/2.36 5010,574 4/1991 Wang ... 395/2.31 5,054,073 10/1991 Yazu ... 395/2.4 75 Inventors: Randy G. Goldberg, Princeton; James 5,157,760 10/1992 Akagiri ... 395/2.36 L. Flanagan, Warren, both of N.J. 5,264.846 11/1993 Oikawa ...... 395/2.36 5,305,420 4/1994 Nakamura et al ...... 395/2.8 73) Assignee: Rutgers, The State University of New Jersey, Piscataway, N.J. Primary Examiner-Benedict V. Safourek Attorney, Agent, or Firm-Weil, Gotshal & Manges LLP 21 Appl. No.: 457.517 57 ABSTRACT 22 Filed: Jun. 1, 1995 Simultaneous and temporal masking of digital speech data is 51) Int. Cl. ... G10 L9/18 applied to an MBE-based speech coding technique to achieve additional, substantial compression of coded speech 52) U.S. Cl...... 395/2.19; 395/2.36 over existing coding techniques, while enabling synthesis of 58 Field of Search ...... 395/2.17, 2.19, coded speech with minimal perceptual degradation relative 395/2.23, 2.24, 2.35. 2.36, 2.38: 455/72; to the human auditory system. A real-time perceptual coder 341/87, 76 and decoder is disclosed in which speech may be sampled at References Cited 10 kHz. coded at an average rate of less than 2 bits/sample, 56) and reproduced in a manner that is perceptually transparent U.S. PATENT DOCUMENTS to a human listener. The coder compresses speech segments 3,349,183 10/1967 Campenella ...... 395/2.91 that are inaudible due to simultaneous or temporal masking, 3,715,512 2/1973 Kelly ...... 395/228 while audible speech segments are not compressed. 4,053,712 10/1977 Reindl ...... 395/2.24 4,461,024 7/1984 Rengger et al...... 395/242 5 Claims, 7 Drawing Sheets

NPUT DEVICE 101 ANAOG SPEECH INPUT 100 SIGNAL 1 O ANALOG ANALYSS MODULE 102 DGITAL SPEECH 04 SIGNAL 06 108 110 MBE ANALYSS MODULE W/UV FUNDAMENTAL BITS FREQUENCY MAGNTUDESCOMPLEX 2 114 116 AUDITORY ANALYSISM ODULE PERCEPTUAL WEIGHT LABELS AUDIBITY THRESHOLDING 120 MODULE ADJUSTED COMPLEX 22 MAGNTUDES

QUANTIZATION MODULE

CUANTIZED 126 128- QUANTIZED REAL FunOAMENTAL MAGNITUDES FREGUENCY 130 NFORMATION PACKING MODUE

OUTPUT DATA STREAM 132 U.S. Patent Jan. 6, 1998 Sheet 1 of 7 5,706,392

ANALOG SPEECH INPUT SGNAL 10 102 DIGITAL SPEECH 108 SIGNAL 106 is 110 W/UV FUNDAMENTAL FREGUENCY MAGNITUDEScoMPLEX 112- 114 AUDITORY ANALYSS MODULE

PERCEPTUAL WEIGHT LABELS AUDBLITY THRESHOLDING MODULE

ADJUSTED COMPLEX MAGNITUDES

GUANTZATION MODULE

GUANTIZED 126 OUANTIZED REAL FUNDAMENTA MAGNUDES FRECUENCY

130 NFORMATION PACKING MODULE

OUTPUT DATA STREAM 132 FG.1 U.S. Patent Jan. 6, 1998 Sheet 2 of 7 5,706,392

ACOUSTIC MASKING

RELATIVE TIME (rns) RELATIVE FREQ. (Hz)

FG.2 U.S. Patent Jan. 6, 1998 Sheet 3 of 7 5,706,392

SPECTROGRAM AREASTEMPORALLY

- r - a m - - - - -a - - AND FREGUENCY t MASKED iI : 302

| l / - INTENSE FREOUENCY COMPLEX SIGNAL 300

TIME FIG.3 U.S. Patent Jan. 6, 1998 Sheet 4 of 7 5,706,392

CJBOIOA

ONS

ONTS

1-2|9uu29uueB1-,

S30N3 no MO

U.S. Patent Jan. 6, 1998 Sheet 5 of 7 5,706,392

* Un Voiced

Un voiced

Un voiced

†29. /oi 089 BONETIS£ U.S. Patent Jan. 6, 1998 Sheet 6 of 7 5,706,392

DECODER 602 INPUT STREAM 600

604 UNPACKING MODULE

FUNDAMENTAL REA Vf UV REGUENCY MANUDEs 610 BITS >60 608 1 612 s MBE SPEECH SYNTHESIS MODULE SYNTHESIZED DIGITAL SPEECH 616 SIGNAL

ANALOG SYNTHESIS MODULE 618

ANALOG SPEECH 620 OUTPUT SIGNAL OUTPUT DEVICE 622

FG.6 U.S. Patent Jan. 6, 1998 Sheet 7 of 7 5,706,392

710

702 7 14 SERIAL SERIAL 32 BIT BUS N OUT SHARED TMNG 32 BT BUS SRAM NETWORK DSP PROCESSOR

32 BIT BUS

GENERAL-PURPOSE PROCESSOR 5,706,392 1. 2 PERCEPTUAL SPEECH CODER AND compressed form. A period of silence may be determined by METHOD comparison to a reference level, which may vary with frequency. Such coding techniques are illustrated, for BACKGROUND OF THE INVENTION example, in U.S. Pat. No. 4,053,712 to Reindl, titled "Adap tive Digital Coder and Decoder," and U.S. Pat No. 5,054. 1. Field of the Invention 073 to Yazu, titled "Voice Analysis and Synthesis Dependent The present invention relates generally to speech coding, Upon a Silence Decision." and in particular, to a method and apparatus for perceptual In addition, it is known that in a complex spectrum of speech coding in which monaural masking properties of the sound, certain weak components in the presence of stronger human auditory system are applied to eliminate the coding 10 ones are not detectable by a human's auditory system. This of unnecessary signals. occurs due to a process known as monaural masking, in 2. Description of the Prior Art which the detectability of one sound is impaired by the Digital transmission of coded speech is becoming increas presence of another sound. Simultaneous masking relates to ingly important in a wide variety of applications, such as frequency. If a tone is sounded in the presence of a strong multi-media conferencing systems, cockpit-to-tower speech 15 tone nearby in frequency (particularly in the same critical transmissions for pilot/controller communications, and band but this is not essential), its threshold of audibility is wireless telephone transmissions. By reducing the amount of elevated. Temporal masking relates to time. If a loud sound data needed to code speech, one may optimally utilize the is followed closely in time by a weaker one, the loud sound limited resources of transmission bandwidth. The impor can elevate the threshold of audibility of the weaker one and tance of efficient digital storage of coded speech is also 20 render it inaudible. Temporal masking also arises, but to a becoming increasingly important in contexts such as voice lesser extent, when the weaker sound is presented prior to messaging, answering machines, digital speech recorders, the strong sound. Masking effects are also observed when and storage of large speech databases for low bit-rate speech one or both sounds are bands Of ; that is, distinct coders. Economies in storage memory may be obtained masking effects arise from tone-on-tone masking, tone-on through high quality low bit-rate coding. 25 noise masking, noise-on-tone masking, and noise-on-noise Vocoders (derived from the words "VOice CODER") are masking. devices used to code speech in digital form. Successful Acoustic coding algorithms utilizing simultaneous mask speech coding has been achieved with channel vocoders, ing are in use today to compress wide-band (7 kHz to 20 kHz formant vocoders, linear prediction (LPC) vocoders, homo bandwidth) acoustic signals. Two examples are Johnston's morphic vocoders, and code excited linear prediction techniques and the Motion Picture Experts Group's (MPEG) (CELP) vocoders. In all of these vocoders, speech is mod standard for audio coding. The duration of the analysis eled as overlapping time segments, each of which is the window used in these wide-band coding techniques is 8 ms response of a linear system excited by an excitation signal to 10 ms, yielding frequency resolution of about 100 Hz. typically made up of a periodic impulse train (optionally Such methods are effective for wide-band audio above 5 modified to resemble a glottal pulse train), random noise, or 35 kHz, in which critical bandwidths are greater than 200 Hz. a combination of the two. For each time segment of speech, But, for the 0 to 5 kHz frequency region that comprises excitation parameters and parameters of the linear system speech, these methods are not at all effective, as 25 Hz may be determined, and then used to synthesize speech frequency resolution is required to determine masked when needed. regions of a signal. Moreover, because speech coding may Another vocoder that has been used to achieve successful be performed more efficiently than coding of arbitrary speech coding is the multi-band excitation (MBE) vocoder. acoustic signals (due to the additional knowledge that MBE coding relies on the insight that most of the energy in speech is produced by a human vocal tract), a speech-based voiced speech lies at harmonics of a fundamental frequency coding method is preferable to a generic one. (i.e., the pitch frequency), and thus, an MBE vocoder has 45 segments centered at harmonics of the pitch frequency. Also, SUMMARY OF THE INVENTON MBE coding typically recognizes that many speech seg Accordingly, it is an objective of the present invention to ments are not purely voiced (i.e., speech sounds, such as apply properties of human speech production and auditory vowels, produced by chopping of a steady flow of air into systems to greatly reduce the required capacity for coding quasi-periodic pulses by the vocal chords) or unvoiced (i.e., 50 speech, with minimal perceptual degradation of the speech speech sounds, such as the fricatives "f" and "s." produced signal. by noise-like turbulence created in the vocal tract due to By applying simultaneous masking and temporal masking constriction). Thus, while many vocoders typically have one to coded speech, one may disregard certain unnecessary Voiced/Unvoiced (V/UV) decision per frame, MBE vocod speech signals to achieve additional, substantial compres ers typically implement a separate V/UV decision for each 55 sion of coded speech over existing coding techniques, while segment in each frame of speech. enabling synthesis of coded speech with minimal perceptual MBE coding as well as other speech-coding techniques degradation. With the method and apparatus of the present are known in the art. For a particular description of MBE invention, speech may be sampled at 10 kHz, coded at an coding and decoding, see D. W. Griffin, The multi-band average rate of less than 2 bits/sample, and reproduced in a excitation vocode, PhD Dissertation: Massachusetts Insti manner that is perceptually transparent to a listener. tute of Technology (February 1987); D. W. Griffin & J. S. The perceptual speech coder of the present invention Lim, "Multiband excitation vocoder." IEEE Transactions on operates by first filtering, sampling, and digitizing an analog Acoustics, Speech, and Signal Processing, Vol. 36, No. 8 speech input signal. Each frame of the digital speech signal (August 1986), which are incorporated here by reference. is passed through an MBE coderfor obtaining a fundamental Also, a number of techniques are known in the art to 65 frequency, complex magnitude information, and V/UV bits. compress coded speech. In particular, techniques are in use This information is then passed through an auditory analysis to code speech in which periods of silence are represented in module, which examines each frame from the MBE coder to 5,706,392 3 4 determine whether certain segments of each frame are module 106 performs multi-band excitation (MBE) analysis inaudible to the human auditory system due to simultaneous on digital speech signal 104 to produces MBE output 108 or temporal masking. If a segment is inaudible, it is zeroed comprising a fundamental frequency 110. complex magni out when passed through the next block, an audibility tudes 112, and V/UV bits 114. MBE analysis module 106 may be assembled in the manner suggested by Griffin et at. thresholding module. In the preferred embodiment, this cited above, or other known ways. module eliminates segments that are less than 6 dB above a MBE analysis module 106 first calculates fundamental calculated audibility threshold and also eliminates entire frequency 110, which is the pitch of the current frame of frames of speech that are identified as being silent. The digital speech signal 104. The fundamental frequency, or reduced information signal is then passed through a quan pitch frequency, may be defined as the reciprocal of an tization module for assigning quantized values, which are 10 interval on a speech waveform (or a glottal waveform) that passed to an information packing module for packing into an defines one dominant period. Pitch plays an important role output data stream. The output data stream may be stored or in an acoustic speech signal, as the prosodic information of transmitted. When the output data stream is recovered, it an utterance is primarily determined by this parameter. The may be unpacked by a decoder and synthesized into speech ear is more sensitive to changes of fundamental frequency that is perceptually transparent to a listener. 15 than to changes in other speech signal parameters by an The present invention represents a significant advance order of magnitude. Thus, the quality of speech synthesized ment over known techniques of speech coding. Through use from a coded signal is influenced by an accurate measure of of the present invention, codes for both silent and non-silent fundamental frequency 110. Literally hundreds of pitch periods of speech may be compressed. Applying principles extraction methods and algorithms are known in the art. For of monaural masking, only speech that is audibly perceptible 20 a detailed survey of several methods of pitch extraction, see to a human is coded, enabling significant, additional com W. Hess, Pitch Determination of Speech Signals, Springer pression over known techniques of speech coding. Verlag. New York, N.Y. (1983), which is incorporated here BRIEF DESCRIPTION OF THE DRAWINGS by reference. 25 After calculating fundamental frequency 110, MBE The foregoing advantages of the present invention are analysis module 106 computes complex magnitudes 112 and apparent from the following detailed description of the V/UV bits 114 for each segment of the current frame. preferred embodiment of the invention with reference to the Segments, or sub-bands, are centered at harmonics of fun drawings, in which: damental frequency 110 with one segment per harmonic FIG. 1 shows a block diagram of the perceptual speech 30 within the range of 0 to 5 kHz. The number of segments per coder of the present invention; frame typically varies from as few as 18 to as many as 60. FIG. 2 shows representative psycho-acoustic masking From the estimated spectral envelope of the frame, the data for simultaneous and temporal masking; energy level in each segment is estimated as if excited by FIG. 3 illustrates masking effects on a speech signal of purely voiced or purely unvoiced excitation. Segments are 35 then classified as voiced (V) or unvoiced (UV) excitation by simultaneous and temporal masking; computing the error in fitting the original speech signal to a FIG. 4 shows sample frames of quantized coded speech periodic signal of fundamental frequency 110 in each seg data prior to packing; ment. If the match is good, the error will be low, and the FIG. 5 shows bit patterns for the sample frames of FIG. segment is considered voiced. If the match is poor, a high 4 after packing. error level is detected, and the segment is considered FIG. 6 shows a block diagram of the perceptual speech unvoiced. With regard to complex magnitudes 112, voiced decoder of the present invention; and segments contain both phase and magnitude information, FIG. 7 shows a block diagram of a representative hard and are modeled to have energy only at the pertinent ware configuration of a real-time perceptual coder, harmonic; unvoiced segments contain only magnitude 45 information, and are modeled to have energy spread uni DETALED DESCRIPTION OF A PREFERRED formly throughout the pertinent segment. EMBODIMENT Although the preferred embodiment has been shown and described as using MBE analysis for transforming digital 1. Perceptual Speech Coding speech signal 104 into the frequency domain, otherforms of FIG. 1 shows a block diagram of the perceptual speech 50 coding are suitable for use with the present invention. For coder 10 of the present invention, in which analog speech example, in embodiments that only perform temporal input signal 100 is provided to the coder, and output data masking, most classical methods of speech coding work stream 132 is produced for transmission or storage. well. In addition, varying degrees of simultaneous masking Analog speech input signal 100 enters the system via a may be obtained with other coding schemes. While MBE microphone, tape storage, or other input device 101, and is 55 analysis provides a preferred solution in terms of allowing processed in analog analysis module 102. Analog analysis closely-spaced frequencies to be easily discerned and module 102 filters analog input speech signal 100 with a masked, those of skill in the art will find other forms of lowpass anti-aliasing filter preferably having a cut-off fre spectral analysis and speech coding useful with the temporal quency of 5 kHz so that speech can be sampled at a and/or simultaneous masking features of the present inven frequency of 10 kHz without aliasing. The signal is then tion. sampled and windowed, preferably with a Hamming MBE output signal 110 is then passed through auditory window, into 10 ms frames. Finally, the signal is quantized analysis module 116. This module determines whether any into digital speech signal 104 for further processing. segments are inaudible to the human auditory system due to In MBE analysis module 106, each frame of digital simultaneous or temporal masking. To perform the masking speech signal 104 is transformed into the frequency domain 65 process, auditory analysis module 116 associates with each to obtain a spectral representation of this digital, time segment of each frame of MBE output signal 110 (the domain signal. In the preferred embodiment, MBE analysis segment outputs), a perceptual weight label 118, which 5,706,392 5 6 indicates whether the speech in the segment is masked by is below a pre-determined level (T, then the whole frame speech in certain other segments. is labeled as silence, and parameters for this frame need be An illustrative but not exclusive way to calculate a transmitted in substantially compressed form. The threshold perceptual weight label 118 is by comparing segment out level. T. may be fixed if the noise conditions in the puts and determining how much above the threshold of 5 application are known, or it may be adaptively set with a audibility, if at all, each segment is. Psycho-acoustic data simple energy analysis of the input signal. Such as that shown in FIG. 2 is used in this calculation. For The collective effect of auditory analysis module 116 and each segment, the masking effects of the frequency compo audibility thresholding module 120 is to isolate islands of nents in the present frame, the previous 10 frames, and the next 3 frames (resulting in a 30 ms delay) are calculated. A perceptually-significant signals, as illustrated in FIG. 3. segment's label-originally initialized to an arbitrary high 10 Digital capacity may then be assigned to efficiently and value of 100 dB (relative to 0.0002 micro-bars)-is then set accurately code intense complex signal 300, while using a equal to the difference between the threshold of audibility compressed coding scheme for areas temporally and fre for the unmasked signal and the largest masking effect. The quency masked 302. preferred embodiment assumes that masking is not additive, After MBE, auditory, and audibility analysis has been so only the largest masking effect is used. This assumption 5 performed in modules 106, 116, and 120, the resulting provides a reasonable approximation of physical masking in parameters are quantized to reduce storage and transmittal which masking is somewhat additive. demands. Both the fundamental frequency and the complex To calculate the degree that one frequency segment masks magnitudes are quantized. The V/UV bits are stored as one another, four parameters are calculated: A the amplitude bit per decision and do not require further quantization. To difference between the two frequency components; t the perform quantization, fundamental frequency 110 and time difference between the two frequency components;f adjusted complex magnitudes 122 are passed to quantization the frequency difference between the two frequency com module 124, which produces quantized fundamental fre ponents; and The the level above the threshold of quency 126 and quantized real magnitudes 128. audibility, without masking, of the masked frequency seg 25 In the preferred embodiment, a 10-bit linear quantization ment. Psycho-acoustic data (like that shown in FIG. 2) is scheme is used to code fundamental frequency 110 of each then utilized to determine masking. In the preferred frame. A minimum (80 Hz) and maximum (280 Hz) allow embodiment, one of four psycho-acoustic data sets is uti able fundamental frequency value is used for the quantiza lized depending on the classification of each of the masking tion limits. The range between these limits is broken into and masked segments as tone-like (voiced) or noise-like linear quantization levels and the closest level is chosen as (unvoiced); that is, separate data sets are preferably used for our initial quantization estimate. Since the number of seg tone-on-tone masking, tone-on-noise masking, noise-on ments perframe is directly calculated from the fundamental tone masking, and noise-on-noise masking. The amount of frequency of the frame, quantization module 124 must masking (M) is calculated by interpolating to the point in ensure that quantization of fundamental frequency 110 does the psycho-acoustic data determined by the calculated 35 not change the calculated number of segments per frame parameters Aartar far and The The value of Mpe from the actual number. To do this, the module calculates the is subtracted from the value of The to calculate a new number of segments per frame (the 5 kHz bandwidth is threshold of audibility, Th. The lowest value of divided by the fundamental frequency) for fundamental The for each segment-based on an analysis of the frequency 110 and quantized fundamental frequency 124. If masking effects, if any, of each of the masking segments in these values are equal, fundamental frequency quantization the 14 frames reviewed-is stored as a masked segment's is complete. If they are not equal, quantization module 124 perceptual weight label 118. adjusts quantized fundamental frequency 126 to the nearest Perceptual weight labels 118 and complex magnitudes quantized value that would make its number of segments per 112 are then passed to audibility thresholding module 120, frame equal that of fundamental frequency 110. With ten which Zero-outs unnecessary segments. If the effective 45 bits, the quantization levels are small enough to ensure the intensity of a segment is less than the threshold of audibility existence of such a quantized fundamental frequency 126. for the segment, it is not perceivable to the human auditory Adjusted complex magnitudes 122 at each harmonic of System and comprises an unnecessary signal. At a minimum, the fundamental frequency are also quantized in module segments having a negative or zero perceptual weight label 124. They are first converted into their polar coordinate 118 fall into this class, and such segments may be zeroed-out representation, and the resulting real magnitude and phase by setting their respective complex magnitudes 112 to zero. components are quantized into separate 8-bit packets. The In addition, certain segments having positive perceptual real magnitudes and phases are each quantized, preferably weight labels 118 may also be zeroed-out (and preferably are using adaptive differential pulse-code modulation to permit additional data compression). This result arises out (ADPCM) coding with a one word memory. This technique of the fact that the threshold at which a signal can be heard 55 is well known in the art. For a particular description of in normal room conditions is a bit greater than the threshold ADPCM coding, see P. Cummiskey, et at., "Adaptive quan of audibility, which is empirically defined under laboratory tization in differential pcm coding of speech,” Bell System conditions for isolated frequencies. In particular, the can Technical Journal, 52-7:1105-18 (September 1973) and N. cellation of segments having perceptual weight labels 118 S. Jayant, "Adaptive quantization with a one-word less than or equal to 6 dB was found to be perceptually memory." Bell System Technical Journal, 52-7:1119-44 insignificant in normal room conditions. When signals (September 1973), which are incorporated here by reference. above this level were removed, perceptual degradation was In the preferred embodiment, magnitudes are quantized in observed in the synthesized quantized speech signal. When 256 steps, requiring 8 data bits. Phases are quantized in 224 signals below this level were removed, maximum compres steps, also requiring 8 data bits. The 224 eight bit words sion was not achieved. 65 decimally represented as 0 to 223 (00000000 to 11011111 Preferably, silence detection is also performed in audibil binary) are used to represent all possible output code words ity thresholding module 120. If the total energy in a frame of the phase quantization scheme. The unused 32 words in 5,706,392 7 8 the phase data are reserved to communicate information (not tive silence frames in the buffer is sent to output data stream related to the phase of the complex magnitudes) about 132. If the buffer is full and a seventeenth consecutive zeroed-segments (i.e. segments zeroed-out by audibility silence frame is detected, the code representing 16 frames of thresholding module 120) and silence frames. silence is sent to output data stream 132, the buffer is reset Silence frames are often clustered together in time, that is, to zero, and the process continues. silence often is found in intervals greater than 10 ms. So as Second, if the frame is non-silence but starts with a not to waste 8 bits for each silence frame detected, 16 words zeroed-segment (which corresponds to either the highest or are reserved to represent silence frames. The sixteen 8-bit lowest harmonic in a frame based on whether the frame is words decimally represented as 224 to 239 (11100000 to odd or even), one of the sixteen codes reserved for this 11101111 binary) are reserved to represent 1 to 16 frames of 10 situation is used. Abuffer like the one used for silence-frame silence. When one of these code words is encountered where coding is used to determine how many consecutive zeroed an 8-bit phase is expected, the present frame (and up to the segments to code. next 15 frames) are silence. All the silence codes begin with Third, if the frame is non-silence but starts with a mag the 4-bit sequence 1110, which fact may be used to increase nitude corresponding to an unvoiced segment, then phase efficiencies in decoding. 15 information is not used in re-synthesis and an arbitrary code Due to the formant structure of speech and the varying (00000000 binary) representing zero phase is sent to output nature of speech in the timeffrequency plane, magnitudes are data stream 132. often zeroed (due to masking) in clusters. So as not to waste Fourth, if the frame is non-silence and starts with a a full eight bits for each zeroed-segment, 16 words are magnitude corresponding to a voiced segment, then the reserved to represent 1 to 16 consecutive zeroed-segments. 20 quantized phase value is sent to output data stream 132. The sixteen 8-bit words decimally represented as 240 to 255 The next 10-bits sent to output data stream 132 after an (11110000 to 11111111 binary) are reserved to represent 1 to 8-bit non-silence phase value is sent is the 10-bit codeword 16 zeroed-segments. When one of these code words is representing the quantized fundamental frequency of the encountered where an 8-bit phase is expected, the present frame, quantized fundamental frequency 126. Every frame magnitude (and up to the next 15 magnitudes) need not be 25 of speech has one quantized fundamental frequency associ produced. All the zero magnitude codes begin with the 4-bit ated with it, and this data must be sent for all non-silence sequence 1111, which fact may be used to increase efficien frames. Even when every segment in a frame is unvoiced, an cies in decoding. arbitrary or default fundamental frequency was used in Preferably, quantization module 124 quantizes in a circu dividing the spectrum into frequency segments, and trans lar order. This method capitalizes on the fact that a process 30 mission of this frequency is pertinent so that the number of that limits the short-time standard deviation of a signal to be frequency bins in the frame may be calculated. quantized causes reduced quantization error in differential The rest of the information in the frame pertains to coding. Thus, magnitude coding starts with the lowest magnitudes, phases, and V/UV decisions. The next bit sent frequency sub-band of the first frame and continues with the to output data stream 132 contains V/UV information for the next highest sub-band until the end of the frame is reached. 35 first non-zeroed segment of speech. An 8-bit word contain The next frame begins with the highest frequency sub-band ing the magnitude of the first segment follows the V/UV bit. and decreases until the lowest frequency level is reached. All The rest of the data in the frame is sent as follows: a V/UV odd frames are coded from lowest frequency to highest bit, followed by either an 8-bit word representing quantized frequency, and all even frames are coded in the reverse magnitude (Unvoiced segments) or two 8-bit words repre order. A silence frame is not included in the calculations of senting quantized phase and quantized magnitude (Voiced odd and even frames. Thus, if a first frame is non-silence, a segments). Phase information is sent before magnitude second frame is silence, and a third frame is non-silence, the information so that zeroed-segments are coded without first frame is treated as an odd frame and the third frame is sending a dummy magnitude. When a string of 1 to 15 treated as an even frame. Phase quantization is in the same zeroed-segments is sent to output data stream 132, a V/UV order to keep congruence in the decoding process. 45 bit is sent delimiting a phase codeword to be sent next, and After quantization has been performed in module 124, the then the correct codeword is sent. resulting information is packed into output data stream 132 To illustrate the information packing process that occurs in packing module 130. For each frame, the information to in information packing module 130, FIG. 4 shows sample pack includes quantized fundamental frequency 126, quan 50 frames of quantized coded speech data prior to packing. Six tized real magnitudes 128, and V/UV bits 114. The percep frames of speech are shown in FIG. 4, each with differing tual coder is designed to code all this information in output numbers of harmonics. The number of harmonics shown in data stream 132 comprising data at or below 20 kbits/s for each frame is much less than would occur for actual speech all speech utterances. This real-time data rate translates to 2 data, the amount of information shown being limited for bits/sample for a 10 kHz sampling rate of analog speech 55 purposes of illustration. input signal 100. FIG. 5 shows bit patterns for the sample frames of FIG. Since the portion of quantized real magnitudes 128 com 4 after packing. The first eight bits 510 of packed data prising the quantized phase track contains the encoding contains phase information for the lowest harmonic sub packing information, the first eight bits in each frame will be band 412 of the first frame 410. The next ten bits 512 of the phase of the first harmonic. There are four possible packed data code the fundamental frequency of the first situations. frame 410. The next bit 514 is a bit classifying the lowest First, if the frame being quantized is labeled a silence harmonic sub-band 412 as voiced. The next eight bits 516 frame, then one of the 16 codes representing silence frames contain magnitude information for the lowest harmonic is used. Only one code-word is used for up to 16 frames of sub-band 412 of the first frame 410. silence. A buffer holds the number of silence frames previ 65 The remaining three harmonic sub-bands 414 in the first ously seen (up to 15), and as soon as a non-silence frame is frame 410 as well as the two highest harmonic sub-bands detected, the 8-bit code representing the number of consecu 422 in the second frame 420 are zeroed-segments. Since 5,706,392 9 10 ordering is circular, a code that represents five segments of prising real magnitudes 610. Buffers in these decoders for zeroes must be transmitted. One bit 518 classifying the quantization step size, predictor values, and multiplier val second frame 420 as voiced is transmitted, followed by an ues are set to default parameters used to encode the first eight-bit code 520 indicating the five segments of zeroes. segment. The default values are known prior to transmission The fundamental frequency of the first segment is encoded of signal data. Codes will be deciphered and quantization with the number of segments in the first frame, indicating step size will be adjusted by dividing by the multiplier used that three segments of zeros are within the boundary of first in encoding. This multiplier can be determined directly from frame 410, and thus, two segments of zeros are part of the present codeword. Other values may also be needed to second frame 420. initialize decoder 600, such as the reference level and An 8-bit code corresponding to zero phase 522 is then 10 quantization step size for computing fundamental frequency sent. This code represents the (mock) phase of the first 608. non-zero segment in the frame, which is unvoiced. The next 10-bits 524 are the quantized fundamental frequency of the Upon completion of the unpacking process in module second frame 420. For each of the two unvoiced segments 604, unpacked output 606 is provided to MBE speech 426, a 1-bit segment classifier and an 8-bit coded magnitude synthesis module 614 for synthesis into synthesized digital 526 are sent to the data stream. For the remaining voiced 15 speech signal 616. The complex magnitudes of all the voiced segment 428, a 1-bit segment classifier, an 8-bit magnitude frames are calculated with a polar to rectangular calculation code, and an 8-bit phase code 528 are sent to the data stream. on the quantized dam. Then all of the frame information is The third through fifthframes 430 are all silence frames, and sent to an MBE speech synthesis algorithm, which may be a single 8-bit code 530 is used to transmit this information. assembled in the manner suggested by Griffin et al. cited Frame six 460 is coded 560 similarly to the first two above, or other known ways. 2. Perceptual Speech Decoding For example, synthesized digital speech signal 616 may FIG. 6 shows a block diagram of the perceptual speech be synthesized as follows. First, unpacked output 606 is decoder 600 of the present invention. Output data stream separated into voiced and unvoiced sections as dictated by 132-directly, from storage, or through a form of V/UV bits 612. Real magnitudes 610 will contain phase and magnitude information for voiced segments, while unvoiced transmission-is used as decoder input stream 602. Decoder segments will only contain magnitude information. Voiced 600 decodes this signal and synthesizes it into analog speech speech is then synthesized from the voiced envelope seg output signal 620. ments by summing sinusoids at frequencies of the harmonics Information unpacking module 604 unpacks decoder of the fundamental frequency, using magnitude and phase input stream 602 so that the synthesizer can identify each dictated by real magnitudes 610. Unvoiced speech is then segment of each frame of speech. Information packing synthesized from the unvoiced segments of real magnitudes module 604 extracts a plurality of quantized information, 610. The Short Time Fourier Transform (STFT) of broad including pitch information, W/UV information, complex band is amplitude scaled (a different amplitude magnitude information, and information indicating which per segment) so as to resemble the spectral shape of the frames were declared as silence frames and which segments 35 unvoiced portion of each frame of speech. An inverse were zeroed. Module 604 produces unpacked output 606 frequency transform is then applied, each segment is comprising fundamental frequency 608, real magnitudes windowed, and the overlap add method is used to assemble 610 (which contains real magnitudes and phases), and V/UV the synthetic unvoiced speech. Finally, the voiced and bits 612 for each frame of speech to synthesize. unvoiced speech are added to produce synthesized digital The procedure for unpacking the information proceeds as speech signal 616. follows. The first eight bits are read from the data stream. If Synthesized digital speech signal 616 is provided to they correspond to silence frames, silence can be generated analog synthesis module 618 to produce analog speech and sent to the speech output and the next eight bits are read output signal 620. In the preferred embodiment, the digital from decoder input stream 602. As soon as the first eight bits signal is sent to a digital-to-analog converter using a 10 kHz in a frame do not represent one or more silence frames, then 45 the frame must be segmented. The next 10 bits are read from sampling rate and then filtered by a 5 kHz low-pass analog decoder input stream 602, which contain the quantized anti-image postfilter. The resulting analog speech output fundamental frequency for the present frame as well as the signal 620 may be sent to speakers, headphones, tape number of segments in the frame. Fundamental frequency storage, or some other output device 622 for immediate or 608 is extracted and the number of segments is calculated delayed listening. Alternatively, decoded digital speech sig before continuing to read data from the stream. nal 616 may be stored prior to analog synthesis in suitable In the preferred embodiment, two buffers are used to store application contexts. the state of unpacking. A first buffer contains the V/UV state 3. Hardware Configuration of Perceptual Coder of the present harmonic, and a second buffer counts the FIG. 7 shows a block diagram of a representative hard number of harmonics that are left to calculate for the present 55 ware configuration of a real-time perceptual coder 700. It frame. One bit is read to obtain V/UV bit 612 for the present comprises volatile storage 702, non-volatile storage 704, harmonic. Eight bits (a magnitude codeword) are read for DSP processor 706, general-purpose processor 708, A/D each expected unvoiced segment (or for an expected voiced converter 710, D/A converter 712, and timing network 714. segment with a reserved codeword), or sixteen bits (a phase Perceptual coding and decoding do not require significant and a magnitude codeword) are read for each expected storage space. Volatile storage 702, such as dynamic or static voiced segment. If the first eight bits in a frame declared RAM, of approximately 50 kilobytes is required for holding voiced correspond to a codeword that was reserved for temporary data. This requirement is trivial, as most modern zeroed-segments, the number of segments represented in processors carry this much storage in on-board cache this codeword are treated as voiced segments with zero memory. In addition, non-volatile storage 704, such as amplitude and phase. 65 ROM, PROM, EPROM, EEPROM, or magnetic or optical In the preferred embodiment, two ADPCM decoders are disk, is needed to store application software for performing used to obtain quantized phase and magnitude values com the perceptual coding process shown in FIG. 1. application 5,706,392 11 12 software for performing the perceptual decoding process obtained at bit rates of less than 20 kbits/sec for 10 kHz shown in FIG. 4, and look-up tables for holding masking sampled (5 kHz bandwidth) speech. This rate of 2 bits/ data, such as that shown in FIG. 3. The size of this storage sample is one-quarter that available with standard 8-bit space will depend on the particular techniques used in the u-law coding (used in present day telephony), yet yields application software and the approximations used in prepar comparable reproduction quality. Several listening tests ing masking data, but typically will be less than 50 kilobytes. were performed using degradation mean opinion score Perceptual coding requires a high but realizable FLOP (DMOS) tests. In these tests. listeners found that test utter rate to run in real-time mode. The coding process shown in ances had sound quality equal to that of reference utterances FIG. 1 comprises four computational parts, including MBE and that coding was effectively transparent. analysis module 106, auditory analysis module 116, audi O Listening tests were also performed to determine the bility thresholding module 120, and quantization module optimal operating point of the decoder. The operating point 124. These modules may be implemented with algorithms of the coder was found to have an optimal auditory threshold requiring O(n), O(n), O(1), and O(n) operations, level of 6 dB (i.e., segments were zeroed in auditory respectively, where n is the number of frames used in thresholding analysis if they had a perceptual weight label analysis. In the preferred embodiment, 14 frames (3 frames 15 118 less than or equal to 6 dB). Decreasing the auditory for backward masking, 10 frames for forward masking, and threshold level to 4 dB still coded the MBE synthesized data the present frame) are used. The decoding process shown in transparently, but increased the bit consumption of the coder FIG. 6 is computationally light and may be implemented by approximately two percent. Increasing the auditory with an algorithm requiring O(n) operations. In real-time threshold to 8 dB decreases the coding requirements by less mode, the coding and decoding algorithms must keep up than two percent, but lost the property of transparent coding with the analog sampling rate, which is 10 kHz in the of the MBE synthesized speech in 50% of the utterances preferred embodiment. tested. Real-time perceptual coder 700 includes DSP processor In addition, listening tests found that the use of 8 bits for 706 for front-end MBE analysis, which is multiplication coding phase and magnitude information comprising quan addition intensive, and at least one fairly high performance 25 tized real magnitudes 126 was optimal. Increasing the bit general-purpose processor 708 for the rest of the algorithm, allotment to either caused an increase in total bit require which is decision intensive. The heaviest computing demand ments by about 10%, but did not result in a performance gain is made by the auditory analysis module 116, which requires since transparent coding of the MBE speech was already on the order of 100 MFLOPS. To meet this load, general achieved without the use of additional bits. However, if the purpose processor 708 may be a single high-performance 30 eight-bit allocations were decreased, the property of trans processor, such as the DEC Alpha, or several "regular" parent coding of the MBE synthesized speech in all tested processors, such as the Motorola 68030 or Intel 80386. As utterances was lost. processor speed and performance increase, most future Perceptual coding according to the present invention may processors are likely to be sufficient for use as general be used in a variety of different applications, including processor 708. 35 high-quality speech coders, low bit-rate coders, and A/D converter 710 is used to filter, sample, and digitize an perceptual-weighting front-ends for beam-steering routines analog input signal, and D/A converter 712 is used to for microphone array systems. Perceptual coding schemes synthesize an analog signal from digital information and may be used for system applications, including speech may optionally filter this signal prior to output. Timing compression in multi-media conferencing systems, cockpit network 714 provides clocking functions to the various to-tower speech communication, wireless telephone integrated circuits comprising real-time perceptual coder communication, voice messaging systems, digital speech 700. recorders, digital answering machines, and storage of large speech databases. However, the invention is not limited to Numerous variations on real-time perceptual coder 700 these applications or to the disclosed embodiment. Those may be made. For example, a single integrated circuit that 45 incorporates the functionality provided by the plurality of persons of ordinary skill in the art will recognize that integrated circuits comprising real-time perceptual coder modifications to the preferred embodiment may be made, 700 may be designed. For applications with limited process and other applications pursued, without departure from the spirit of the invention as claimed below. ing power, a real-time perceptual coder with increased We claim: efficiency may be designed in which approximations are 50 used in the psycho-acoustic look-up tables to calculate 1. A method for coding an analog speech signal, said masking effects. Alternatively, a system may be designed in method comprising the steps of: which only simultaneous or temporal masking is imple filtering, sampling, and digitizing said analog speech mented to reduce computational complexity. signal to produce a digital speech signal, said digital The principles of perceptual coding also apply to other 55 speech signal comprising a plurality of frames; contexts. Elements of real-time perceptual coder 700 may be performing frequency analysis on said digital speech incorporated into existing vocoders to either lower the bit signal to produce spectral output data for each of said rate or improve the quality of existing coding techniques. frames, said spectral output data comprising segments, Also, the principles of the present invention may be used to at least two of said segments being approximately 25 enhance automated word recognition. The auditory model of Hz or closer in frequency; the present invention is able to perceptually weigh regions in performing auditory analysis on said spectral output data the time-frequency plane of speech. If perceptually trivial to identify segments of said frames that are inaudible to information is removed from speech prior to feature the human auditory system due to simultaneous or extraction, it may be possible to create a better feature set temporal masking effects; and due to the reduction of unnecessary information. 65 coding said spectral output data into an output data stream A high-quality speech coder was developed for testing. in which said inaudible segments are compressed and With this coder, relatively transparent speech coding was audible segments are not compressed. 5,706,392 13 14 2. A coder for coding a speech signal comprising a create a coded representation of said speech signal masking segment and a masked segment approximately 25 wherein said masked segment is compressed and said Hz or closer in frequency to said masking segment, said masking segment is not compressed. coder comprising: 3.The method of claim 1, wherein said frequency analysis storage means for storing first application software, sec 5 comprises MBE coding. ond application software, and masking data; 4. The coder of claim 2 wherein one integrated circuit a first processor connected to said storage means for using includes said first processor and said second processor. said first application software to generate spectral data 5. The coder of claim 4 wherein said first application software includes MBE coding to generate said spectral for said speech signal; and 10 a second processor connected to said storage means and data. said first processor for using said second application software, said masking data, and said spectral data to