(12) United States Patent (10) Patent N0.: US 8,942,977 B2 Chen (45) Date of Patent: Jan
Total Page:16
File Type:pdf, Size:1020Kb
US008942977B2 (12) United States Patent (10) Patent N0.: US 8,942,977 B2 Chen (45) Date of Patent: Jan. 27, 2015 (54) SYSTEM AND METHOD FOR SPEECH (56) References Cited RECOGNITION USING U.S. PATENT DOCUMENTS PITCH-SYNCHRONOUS SPECTRAL PARAMETERS 5,917,738 A * 6/1999 Pan ............................. .. 708/403 6,311,158 B1* 10/2001 Laroche .. 704/269 (71) Applicant: Chengiun Julian Chen, White Plains, 6,470,311 B1* 10/2002 Moncur .... .. 704/208 NY (U S) H2172 H * 9/2006 Staelin et a1. ............... .. 704/207 OTHER PUBLICATIONS Inventor: Chengiun Julian Chen, White Plains, (72) Hess, Wolfgang. “A pitch-synchronous digital feature extraction sys NY (U S) tem for phonemic recognition of speech.” Acoustics, Speech and Notice: Subject to any disclaimer, the term of this Signal Processing, IEEE Transactions on 24.1 (1976): 14-25.* (*) Mandyam, Giridhar, Nasir Ahmed, and Neeraj Magotra. “Applica patent is extended or adjusted under 35 tion of the discrete Laguerre transform to speech coding.” Asilomar U.S.C. 154(b) by 0 days. Conference on Signals, Systems and Computers. IEEE Computer Society, 1995* (21) Appl. No.: 14/216,684 Legat, Milan, J. Matousek, and Daniel Tihelka. “On the detection of pitch marks using a robust multi-phase algorithm.” Speech Commu (22) Filed: Mar. 17, 2014 nication 53.4 (2011): 552-566.* Wikipedia entry for tonal languages (Dec. 15, 2011).* (65) Prior Publication Data * cited by examiner US 2014/0200889 A1 Jul. 17,2014 Primary Examiner * Vincent P Harper (57) ABSTRACT Related US. Application Data The present invention de?nes a pitch- synchronous parametri Continuation-in-part of application No. 13/692,584, cal representation of speech signals as the basis of speech (63) recognition, and discloses methods of generating the said ?led on Dec. 3, 2012, now Pat. No. 8,719,030. pitch-synchronous parametrical representation from speech (51) Int CL signals. The speech signal is ?rst going through a pitch-marks @101, 15/00 (201301) picking program to identify the pitch periods. The speech @101, 15/02 (200601) signal is then segmented into pitch-synchronous frames. An G101, 25/90 (201301) ends-matching program equalizes the values at the two ends G101, 25/93 (201301) of the waveform in each frame. Using Fourier analysis, the (52) us CL speech signal in each frame is converted into a pitch-synchro CPC ............... .. G10L 15/02 (2013.01); G] 0L 25/90 nous 9mplimde SPeCme- Using Laguem? functions, the said (201301); G] 0L 25/93 (2013 01) amphtude spectrum is converted mto a umt vector, referred to USPC ......................................... .. 704/235; 704/254 as the timbre Vector- BY using a database of correlated pho (58) Field of Classi?cation Search nemes and timbre vectors, the most likely phoneme sequence CPC .............................. .. G10L 15/02- G10L 15/04 Ofan Input Speech Slgnal can be decoded 1“ the must” Stage USPC ........................................ .. 704/254, 235, 256 Ofa Speech recognition System See application ?le for complete search history. 12 Claims, 9 Drawing Sheets 101, ASYMMETRIC PCM SIGNAL~102 OUTPUT \124 WINDOW1 1 l /-109 TEXT ~ 1.123 12.2 1 104- PROFILElFUNCTION RAW Plums/110 @2333? E " $643365 PICKINC: PEAKS ENDS MATCHING ‘ m E 105 1 CYCLIC+FRAMES ’112 gamma/121 120 , PlTCH MARKS As :, 55 UENCES ' 106 FRAME ErDPOINTs FOUREFlANALYSlSDISCRETE PE R 11141'13 Q ‘ 107 \ EXTENSION TO 1 s U A 3:28;? <- PHONEMES UNVOICED AND I INTERPOLATION I/ 115 V5. TIMBRE SILENCE PCMS { 119 VECTORS 108 + commuous SPECTRA/1 16 I \COMPLETE SET OF _ l ’1 17 r118 FRAME ENDPOINTS " TIMBRE VECTORS US. Patent Jan. 27, 2015 Sheet 2 0f9 US 8,942,977 B2 (“J (u U”) E (13) w m an -=:r cu N . c; PM m m g u“ m _ at) m '9», {*1 a: 22* g N _ m v a“! m ‘2 ’ 3 if N 2% (A) (B) US. Patent Jan.27,2015 Sheet30f9 US 8,942,977 B2 0.? w m6wdv.0Nd / 22: .0E v.0-.wd-w.oNo- _ 0-o. m6 To m6 Nd 70 o To No md iol m6 Fl (MW US. Patent Jan.27,2015 Sheet40f9 US 8,942,977 B2 Zié22,222;2 \NQQN?? 3wgw Kan/2wQ4“é»,//mm/.,mm/anmuni/r,%?xmwj 41/_ a 5%wow3% 3209/F?Zi;{125; 6memEF QUE US. Patent Jan. 27, 2015 Sheet 5 0f9 US 8,942,977 B2 Timescaieoftheantisymmetric-a1Windnw{mass} FIG.5 C) (23 6 CD (I) {33% C3 @123 (3:: CD (I) In wr m 04 1"“ gnawed @361 gm Jaqwnu main US. Patent Jan. 27, 2015 Sheet 6 0f9 US 8,942,977 B2 603’ $6041" FIG.6 US. Patent Jan. 27, 2015 Sheet 7 0f9 US 8,942,977 B2 E23?gmme@2232me2WEE wu?wmumwwwmw[$thme5.0% vowwok.mow mg... h.OE wok. i will! P0N\T, Lumwads waggidi US. Patent Jan. 27, 2015 Sheet 8 0f9 US 8,942,977 B2 803 m this ii 22 sf 5 "AVAVAVAVAVAV r “E M t; A A m E: 6 v V v (1) :3 A m 4 V V ? O 1 ‘ | | ‘ I I ‘ I 0 \ 2 4 6 8 18 kHz 802 8 803 US. Patent Jan. 27, 2015 Sheet 9 0f9 US 8,942,977 B2 (C?) 8 m\ N no CD (\l :55 m\ n Em < (I . m (D g u. at E “r” U’\ grip! M! ' C) (\l r“ C) “=— xepu! A1,!ngon US 8,942,977 B2 1 2 SYSTEM AND METHOD FOR SPEECH speech recognition, and discloses methods of generating the RECOGNITION USING said pitch-synchronous parametrical representation from PITCH-SYNCHRONOUS SPECTRAL speech signals, in particular timbre vectors. PARAMETERS According to an exemplary embodiment of the invention, see FIG. 1, a speech signal is ?rst going through a pitch-marks The present application is a continuation in part of patent picking program to pick the pitch marks. The pitch marks are application Ser. No. 13/692,584, entitled “System and Method for Speech Synthesis Using Timbre Vectors”, ?led sent to a process unit to generate a complete set of segmen Dec. 3, 2012, by inventor Chengjun Julian Chen. tation points. The speech signal is segmented into pitch synchronous frames according to the said segmentation FIELD OF THE INVENTION points. An ends-meeting program is executed to make the values at the two ends of every frame equal. Using Fourier The present invention generally relates to automatic speech analysis, the speech signal in each frame is converted into a recognition, in particular to automatic speech recognition pitch-synchronous amplitude spectrum, then Laguerre func using pitch-synchronous spectral parameters, for example in tions are used to convert the said pitch-synchronous ampli particular timbre vectors. tude spectrum into a unit vector characteristic to the instan taneous timbre, referred to as the timbre vector. Those timbre BACKGROUND OF THE INVENTION vectors constitute the parametrical representation of the speech signal. Speech recognition is an automatic process to convert the Using recorded speech by a speaker or a number of speak voice signal of speech into text, which has three steps. The 20 ers reading a prepared text which contains all phonemes of the ?rst step, acoustic processing, reduces the speech signal into target language, an acoustic database can be formed. The a parametric representation. The second step is to ?nd the speech signal of the read text is converted into timbre vectors. most possible sequences of phonemes from the said para The phoneme identity of each timbre vector is determined by metrical representation of the speech signal. The third step is correlating to the text. The average timbre vector and variance to ?nd the most possible sequence of words from the possible 25 phoneme sequence and a language model. The current inven for each individual phoneme is collected from the paired tion is related to a new type of parametric representation of record, which forms an acoustic database. speech signal and the process of converting speech signal into During speech recognition, the incoming speech signal is that parametric representation. ?rst converted into a sequence of timbre vectors. Those tim In current commercial speech recognition systems, the bre vectors are then compared with the timbre vectors in the speech signal is ?rst multiplied by a shifting process window, 30 database to ?nd the most likely phoneme sequence. The pos typically a Hamming window of duration about 25 msec and sible phoneme sequence is then sent to a language decoder to a shifts about 10 msec, to form a frame, see FIG. 2(A). A set ?nd out the most likely text. of parameters is produced from each windowed speech sig nal. Therefore, for each 10 msec, a set of parameters repre BRIEF DESCRIPTION OF DRAWINGS senting the speech signal in the 25 msec window duration is 35 produced. The most widely used parameter representations FIG. 1 is a block diagram of a speech recognition systems are linear prediction coef?cients (LPC) and mel-frequency using pitch-synchronous spectral parameters. cepstral coef?cients (MFCC). Such a method has ?aws. First, FIG. 2 shows the fundamental difference between the the positions of the processing windows are unrelated to the prior-art signal processing methods using a overlapping and pitch periods. Therefore, pitch information and spectral infor 40 shifting process window and the pitch-synchronous method mation cannot be cleanly separated. Second, because the of the present invention. window duration is typically 2.5 times greater that the shift FIG. 3 is an example of the asymmetric window for ?nding time, a phoneme boundary is always crossed by two or three pitch marks. consecutive windows. In other words, large number of frames FIG. 4 is an example of the pro?le function for ?nding the cross phoneme boundaries, see FIG.