US008942977B2

(12) Patent (10) Patent N0.: US 8,942,977 B2 Chen (45) Date of Patent: Jan. 27, 2015

(54) SYSTEM AND METHOD FOR SPEECH (56) References Cited RECOGNITION USING U.S. PATENT DOCUMENTS PITCH-SYNCHRONOUS SPECTRAL

PARAMETERS 5,917,738 A * 6/1999 Pan ...... 708/403 6,311,158 B1* 10/2001 Laroche .. 704/269 (71) Applicant: Chengiun Julian Chen, White Plains, 6,470,311 B1* 10/2002 Moncur ...... 704/208 NY (U S) H2172 H * 9/2006 Staelin et a1...... 704/207 OTHER PUBLICATIONS Inventor: Chengiun Julian Chen, White Plains, (72) Hess, Wolfgang. “A pitch-synchronous digital feature extraction sys NY (U S) tem for phonemic recognition of speech.” Acoustics, Speech and Notice: Subject to any disclaimer, the term of this Signal Processing, IEEE Transactions on 24.1 (1976): 14-25.* (*) Mandyam, Giridhar, , and Neeraj Magotra. “Applica patent is extended or adjusted under 35 tion of the discrete Laguerre transform to speech coding.” Asilomar U.S.C. 154(b) by 0 days. Conference on Signals, Systems and Computers. IEEE Computer Society, 1995* (21) Appl. No.: 14/216,684 Legat, Milan, J. Matousek, and Daniel Tihelka. “On the detection of pitch marks using a robust multi-phase algorithm.” Speech Commu (22) Filed: Mar. 17, 2014 nication 53.4 (2011): 552-566.* Wikipedia entry for tonal languages (Dec. 15, 2011).* (65) Prior Publication Data * cited by examiner US 2014/0200889 A1 Jul. 17,2014 Primary Examiner * Vincent P Harper (57) ABSTRACT Related US. Application Data The present invention de?nes a pitch- synchronous parametri Continuation-in-part of application No. 13/692,584, cal representation of speech signals as the basis of speech (63) recognition, and discloses methods of generating the said ?led on Dec. 3, 2012, now Pat. No. 8,719,030. pitch-synchronous parametrical representation from speech (51) Int CL signals. The speech signal is ?rst going through a pitch-marks @101, 15/00 (201301) picking program to identify the pitch periods. The speech @101, 15/02 (200601) signal is then segmented into pitch-synchronous frames. An G101, 25/90 (201301) ends-matching program equalizes the values at the two ends G101, 25/93 (201301) of the waveform in each frame. Using Fourier analysis, the (52) us CL speech signal in each frame is converted into a pitch-synchro CPC ...... G10L 15/02 (2013.01); G] 0L 25/90 nous 9mplimde SPeCme- Using Laguem? functions, the said (201301); G] 0L 25/93 (2013 01) amphtude spectrum is converted mto a umt vector, referred to USPC ...... 704/235; 704/254 as the timbre Vector- BY using a database of correlated pho (58) Field of Classi?cation Search nemes and timbre vectors, the most likely phoneme sequence CPC ...... G10L 15/02- G10L 15/04 Ofan Input Speech Slgnal can be decoded 1“ the must” Stage USPC ...... 704/254, 235, 256 Ofa Speech recognition System See application ?le for complete search history. 12 Claims, 9 Drawing Sheets

101, ASYMMETRIC PCM SIGNAL~102 OUTPUT \124 WINDOW1 1 l /-109 TEXT ~ 1.123 12.2 1 104- PROFILElFUNCTION RAW Plums/110 @2333? E " $643365 PICKINC: PEAKS ENDS MATCHING ‘ m E 105 1 CYCLIC+FRAMES ’112 gamma/121 120 , PlTCH MARKS As :, 55 UENCES ' 106 FRAME ErDPOINTs DISCRETEFOUREFlANALYSlS PE R 11141'13 Q ‘ 107 \ EXTENSION TO 1 s U A 3:28;? <- PHONEMES UNVOICED AND I INTERPOLATION I/ 115 V5. TIMBRE SILENCE PCMS { 119 VECTORS 108 + commuous SPECTRA/1 16 I \COMPLETE SET OF _ l ’1 17 r118 FRAME ENDPOINTS " TIMBRE VECTORS

US. Patent Jan. 27, 2015 Sheet 2 0f9 US 8,942,977 B2

(“J (u U”) E (13) w

m an -=:r cu N

. c; PM m m g u“ m

_ at)

m '9», {*1 a: 22* g N _ m v a“! m ‘2 ’ 3 if N 2% (A) (B) US. Patent Jan.27,2015 Sheet30f9 US 8,942,977 B2

0.?

w

m6wdv.0Nd / 22: .0E v.0-.wd-w.oNo-

_ 0-o. m6 To m6 Nd 70 o To No md iol m6 Fl (MW US. Patent Jan.27,2015 Sheet40f9 US 8,942,977 B2

Zié22,222;2 \NQQN?? 3wgw Kan/2wQ4“é»,//mm/.,mm/anmuni/r,%?xmwj 41/_ a 5%wow3% 3209/F?Zi;{125; 6memEF QUE US. Patent Jan. 27, 2015 Sheet 5 0f9 US 8,942,977 B2

Timescaieoftheantisymmetric-a1Windnw{mass} FIG.5

C) (23 6 CD (I) {33% C3 @123 (3:: CD (I) In wr m 04 1"“ gnawed @361 gm Jaqwnu main US. Patent Jan. 27, 2015 Sheet 6 0f9 US 8,942,977 B2

603’ $6041" FIG.6 US. Patent Jan. 27, 2015 Sheet 7 0f9 US 8,942,977 B2

E23?gmme@2232me2WEE wu?wmumwwwmw[$thme5.0% vowwok.mow mg... h.OE wok. i will! P0N\T,

Lumwads waggidi US. Patent Jan. 27, 2015 Sheet 8 0f9 US 8,942,977 B2

803

m this ii 22 sf 5 "AVAVAVAVAVAV r “E M t; A A m E: 6 v V v

(1)

:3 A m 4 V V

?

O 1 ‘ | | ‘ I I ‘ I 0 \ 2 4 6 8 18 kHz

802 8 803 US. Patent Jan. 27, 2015 Sheet 9 0f9 US 8,942,977 B2

(C?) 8 m\

N no CD (\l :55 m\ n Em < (I . m (D g u. at E “r” U’\ grip! M!

' C) (\l r“ C) “=— xepu! A1,!ngon US 8,942,977 B2 1 2 SYSTEM AND METHOD FOR SPEECH speech recognition, and discloses methods of generating the RECOGNITION USING said pitch-synchronous parametrical representation from PITCH-SYNCHRONOUS SPECTRAL speech signals, in particular timbre vectors. PARAMETERS According to an exemplary embodiment of the invention, see FIG. 1, a speech signal is ?rst going through a pitch-marks The present application is a continuation in part of patent picking program to pick the pitch marks. The pitch marks are application Ser. No. 13/692,584, entitled “System and Method for Speech Synthesis Using Timbre Vectors”, ?led sent to a process unit to generate a complete set of segmen Dec. 3, 2012, by inventor Chengjun Julian Chen. tation points. The speech signal is segmented into pitch synchronous frames according to the said segmentation FIELD OF THE INVENTION points. An ends-meeting program is executed to make the values at the two ends of every frame equal. Using Fourier The present invention generally relates to automatic speech analysis, the speech signal in each frame is converted into a recognition, in particular to automatic speech recognition pitch-synchronous amplitude spectrum, then Laguerre func using pitch-synchronous spectral parameters, for example in tions are used to convert the said pitch-synchronous ampli particular timbre vectors. tude spectrum into a unit vector characteristic to the instan taneous timbre, referred to as the timbre vector. Those timbre BACKGROUND OF THE INVENTION vectors constitute the parametrical representation of the speech signal. Speech recognition is an automatic process to convert the Using recorded speech by a speaker or a number of speak voice signal of speech into text, which has three steps. The 20 ers reading a prepared text which contains all phonemes of the ?rst step, acoustic processing, reduces the speech signal into target language, an acoustic database can be formed. The a parametric representation. The second step is to ?nd the speech signal of the read text is converted into timbre vectors. most possible sequences of phonemes from the said para The phoneme identity of each timbre vector is determined by metrical representation of the speech signal. The third step is correlating to the text. The average timbre vector and variance to ?nd the most possible sequence of words from the possible 25 phoneme sequence and a language model. The current inven for each individual phoneme is collected from the paired tion is related to a new type of parametric representation of record, which forms an acoustic database. speech signal and the process of converting speech signal into During speech recognition, the incoming speech signal is that parametric representation. ?rst converted into a sequence of timbre vectors. Those tim In current commercial speech recognition systems, the bre vectors are then compared with the timbre vectors in the speech signal is ?rst multiplied by a shifting process window, 30 database to ?nd the most likely phoneme sequence. The pos typically a Hamming window of duration about 25 msec and sible phoneme sequence is then sent to a language decoder to a shifts about 10 msec, to form a frame, see FIG. 2(A). A set ?nd out the most likely text. of parameters is produced from each windowed speech sig nal. Therefore, for each 10 msec, a set of parameters repre BRIEF DESCRIPTION OF DRAWINGS senting the speech signal in the 25 msec window duration is 35 produced. The most widely used parameter representations FIG. 1 is a block diagram of a speech recognition systems are linear prediction coef?cients (LPC) and mel-frequency using pitch-synchronous spectral parameters. cepstral coef?cients (MFCC). Such a method has ?aws. First, FIG. 2 shows the fundamental difference between the the positions of the processing windows are unrelated to the prior-art signal processing methods using a overlapping and pitch periods. Therefore, pitch information and spectral infor 40 shifting process window and the pitch-synchronous method mation cannot be cleanly separated. Second, because the of the present invention. window duration is typically 2.5 times greater that the shift FIG. 3 is an example of the asymmetric window for ?nding time, a phoneme boundary is always crossed by two or three pitch marks. consecutive windows. In other words, large number of frames FIG. 4 is an example of the pro?le function for ?nding the cross phoneme boundaries, see FIG. 2(A). A better way of parameteriZing the speech signal is ?rst to 45 pitch marks. segment the speech signals into frames that are synchronous FIG. 5 is a chart of number of pitch marks as a function of to the pitch periods, see FIG. 2(B). For voiced section of the the window scale for optimiZing the window scale. speech signals, 211, each frame is a single pitch period, 213. FIG. 6 shows the ends-meeting program to equalize the For unvoiced signals, 212, the frames 214 are segmented for values of two ends of the waveform in a pitch period. convenience, typically into frames approximately equal to the 50 FIG. 7 is an example of amplitude spectrum in a pitch average pitch periods of the voiced sections. The advantages period, including the raw data, those after interpolation, and of the pitch-synchronous parameterization are: First, the those recovered from a Laguerre transform function expan speech signal in a single frame only represent the spectrum or sion. timbre of the speech, decoupled from pitch. Therefore, timbre FIG. 8 is a graph of the Laguerre functions. information is cleanly separated from pitch information. Sec 55 FIG. 9 is an example of the proximity indices. ond, because a phoneme boundary must be either a boundary between a voiced section and an unvoiced section, or at a DETAILED DESCRIPTION OF THE INVENTION pitch-period boundary, each frame has a unique phoneme identity. Therefore, each parameter set has a unique phoneme Various exemplary embodiments of the present invention identity. The accuracy of speech recognition can be 60 are implemented on a computer system including one or more improved. (See Part E of Springer Handbook of Speech Pro processors and one or more memory units. In this regard, cessing, Springer Verlag 2008). according to exemplary embodiments, steps of the various methods described herein are performed on one or more SUMMARY OF THE INVENTION computer processors according to instructions encoded on a 65 computer-readable medium. The present invention de?nes a pitch-synchronous para FIG. 1 is a block diagram of the automatic speech recog metrical representation of the speech signals as the basis for nition system according to an exemplary embodiment of the US 8,942,977 B2 3 4 present invention. The input signal 102, typically in PCM nal, 211, each frame is a single pitch period, 213. For (pulse-code ) format, is ?rst convoluted with an unvoiced signals, 212, the frames 214 are segmented for asymmetric window 101, to generate a pro?le function 104. convenience, typically into frame sizes approximately equal The peaks 1 05 in the pro?le function, with values greater than to the average pitchperiods of the voiced sections. The advan a threshold, are assigned as pitch marks 106 of the speech tages of the pitch-synchronous parameterization are: First, signal, which are the frame endpoints in the voice section of the speech signal in a single frame only represent the spec the input speech signal 102. The pitch marks only exist for the trum or timbre of the speech, decoupled from pitch. There voiced sections of the speech signal. Using a procedure 107, fore, timbre information is cleanly separated from pitch infor those frame endpoints are extended into unvoiced and silence mation. Second, because a phoneme boundary must be either sections of the PCM signal, typically by dividing those sec a boundary between a voiced section and an unvoiced section, tions with a constant time interval roughly equals to the or at a pitch-period boundary, each frame has a unique pho average pitch period in the voiced sections. A complete set of neme identity, and therefore, each parameter set has a unique frame endpoints 108 is generated. Through a segmenter 109, phoneme identity. The accuracy of speech recognition can be using the said frame endpoints, the PCM signal 102 is then improved. (See Part E of Springer Handbook of Speech Pro segmented into raw frames 110. In general, the PCM values of cessing, Springer Verlag 2008). the two ends of a raw frame do not match. By performing To segment the speech signal into pitch-synchronous Fourier analysis on those raw frames, artifacts would be gen frames, one known method is to rely on the simultaneously erated. An ends-matching procedure 111 is applied on each acquired electroglottograph (EGG) signals. For speech rec raw frame to convert it into a cyclic frame 112 which can be ognition, in most cases there is no electroglottograph instru legitimately treated as a sample of a continuous periodic 20 ment. However, to segment the speech signals into pitch function. Then, Fourier analysis 113 is applied to each said synchronous frames, one does not require the exact glottal frame 112 to generate amplitude Fourier coef?cients 114. closure instants. It only requires the identi?cation of a section According to the sampling theorem, the number of points of in a pitch period where the variation is weak. Based on the the amplitude spectrum is one half of the number of points of observed waveforms, a method to identify the weakly varying each frame. Therefore, it is a discrete amplitude spectrum. 25 section in a pitch period is designed. It is based on the fact that Using an interpolation procedure 115, the discrete amplitude at the starting moment of a pitch period, the signal variation is spectrum is extended to a large number of points on the the greatest. Therefore, by convoluting the speech signal with frequency axis, typically 512 or 1024 points, to generate a a asymmetric window function w(n) shown in FIG. 3, the virtually continuous spectral function. The continuous spec location with weakest variation can be found. An example of tral function is then expanded using Laguerre functions, 117, 30 asymmetric window function is de?ned on an interval to generate a set of expansion coef?cients. The Laguerre (—N

20 where the coef?cients are calculated by

Where M is about N/ 10. Otherwise x(n)q<0(n). FIG. 6 C” = meA(w)(I>n(/

65 where k is an integer, typically kIO, 2 or 4; and the asso where e is a small positive number (here 6:0.1) to avoid ciated Laguerre polynomials are in?nity. The timbre proximity index is greater if the two US 8,942,977 B2 7 8 phonemes are similar. FIG. 9 shows an example of the varia 6. The method in claim 1, wherein the timbre vector data tion of timbre proximity index with the frame index. Showing base is constructed by the steps comprising: is a sequence of three IPA phonemes, [iao]. 901 is the varia recording speech-signal by a speaker or a number of speak tion of P with regard to the base phoneme of [i], 902 is the ers reading a prepared text which contains all phonemes variation of P with regard to the base phoneme of [a], and 903 of the target language into digital form; is the variation of P with regard to the base phoneme of [0]. segmenting the speech signal into pitch-synchronous Therefore, the phoneme identity of each pitch period can be frames, wherein for voiced sections each said frame is a identi?ed. A speech recognition system of high accuracy can single pitch period; be built based on this method. generating amplitude spectra of the said frames using Fou While this invention has been described in conjunction 10 rier analysis; with the exemplary embodiments outlined above, it is evident transforming the said amplitude spectra into timbre vectors that many alternatives, modi?cations and variations will be apparent to those skilled in the art. Accordingly, the exem using Laguerre functions; plary embodiments of the invention, as set forth above, are transcribing the prepared text into phonemes or sub-pho intended to be illustrative, not limiting. Various changes may neme units; be made without departing from the spirit and scope of the identifying the phoneme of each said timbre vector by invention. comparing with the phonemes or sub -phoneme tran scription of the prepared text; 1 claim: collecting the pairs of timbre vectors and the correspond 1. A method of automatic speech recognition to convert 20 ing phonemes or sub-phoneme units to form a database. speech signal into text using one or more processors compris 7. A system of automatic speech recognition to convert ing: speech-signal into text comprising one or more data process A) segmenting the speech signal into pitch-synchronous ing apparatus; and a computer-readable medium coupled to frames, wherein for voiced sections each said frame is a the one or more data processing apparatus having instructions single pitch period; 25 stored thereon which, when executed by the one or more data B) for each frame, equalizing the two ends of the waveform processing apparatus, cause the one or more data processing using an ends-matching program; apparatus to perform a method comprising: C) generating an amplitude spectrum of each said frame A) segmenting the speech signal into pitch-synchronous using Fourier analysis; frames, wherein for voiced sections each said frame is a D) transforming each said amplitude spectrum into a tim 30 single pitch period; bre vector using Laguerre functions; B) for each frame, equalizing the two ends of the waveform E) performing acoustic decoding to ?nd a list of most likely using an ends-matching program; phonemes or sub-phoneme units for each said timbre C) generating an amplitude spectrum of each said frame vector by comparing with a timbre vector database; using Fourier analysis; F) decoding the sequence of the list of the most likely 35 D) transforming each said amplitude spectrum into a tim phonemes or sub-phoneme units using a language bre vector using Laguerre functions; model database to ?nd out the most likely text; E) performing acoustic decoding to ?nd a list of most likely wherein the segmenting of the speech-signal is based on phonemes or sub-phoneme units for each said timbre an analysis of the speech signals using an asymmetric vector by comparing with a timbre vector database; window which includes: 40 F) decoding the sequence of the list of the most likely a) conducting, for a speaker, a test to ?nd the best size of phonemes or sub-phoneme units using a language the asymmetric window; model database to ?nd out the most likely text; b) convoluting the speech-signal with the said asymmet wherein the segmenting of the speech-signal is based on ric window to form a pro?le function; an analysis of the speech signals using an asymmetric c) picking up the maxima in the said pro?le function as 45 window including: segmentation points; a) conducting, for a speaker, a test to ?nd the best size of d) extending the segmentation points to unvoiced sec the asymmetric window; tions. b) convoluting the speech-signal with the said asymmet 2. The method in claim 1, wherein segmenting of the ric window to form a pro?le function; speech-signal is based on the glottal closure instants derived 50 c) picking up the maxima in the said pro?le function as from simultaneously recorded electro glotto graph signals and segmentation points; by analyzing the sections of speech signal where glottal clo d) extend the segmentation points to unvoiced sections. sure signals do not exist. 8. The system in claim 7, wherein segmenting of the 3. The method in claim 1, wherein the acoustic decoding speech-signal is based on the glottal closure instants derived comprises distinguishing different voiced phonemes by com 55 from simultaneously recorded electroglottograph signals and puting a timbre distance between each said timbre vector and by analyzing the sections of speech signal where glottal clo the timbre vectors of different voiced phonemes in the timbre sure signals do not exist. vector database. 9. The system in claim 7, wherein the acoustic decoding 4. The method in claim 1, wherein the acoustic decoding comprises distinguishing different voiced phonemes by com comprises distinguishing different unvoiced consonants by 60 puting a timbre distance between each said timbre vector and computing a timbre distance between each said timbre vector the timbre vectors of different voiced phonemes in the timbre and the timbre vectors of different unvoiced consonants in the vector database. timbre vector database. 10. The system in claim 7, wherein the acoustic decoding 5. The method in claim 1, wherein the different tones in comprises distinguishing different unvoiced consonants by tone languages are identi?ed using the frame durations and 65 computing a timbre distance between each said timbre vector the slope of changes in frame durations in the said timbre and the timbre vectors of different unvoiced consonants in the vectors. timbre vector database. US 8,942,977 B2 9 10 11. The system in claim 7, wherein the different tones in tone languages are identi?ed using the frame durations and the slope of changes in frame durations in the said timbre vectors. 12. The system in claim 7, Wherein the timbre vector data base is constructed by the steps comprising: recording speech-signal by a speaker or a number of speak ers reading a prepared text Which contains all phonemes of the target language into digital form; segmenting the speech signal into pitch-synchronous frames, Wherein for voiced sections each said frame is a single pitch period; generating amplitude spectra of the said frames using Fou rier analysis; transforming the said amplitude spectra into timbre vectors using Laguerre functions; transcribing the prepared text into phonemes or sub-pho neme units; identifying the phoneme of each said timbre vector by comparing With the phonemes or sub-phoneme tran 20 scription of the prepared text; collecting the pairs of timbre vectors and the correspond ing phonemes or sub-phoneme units to form a database.

* * * * *