(12) Patent Application Publication (10) Pub. No.: US 2008/0133228A1 Rao (43) Pub
Total Page:16
File Type:pdf, Size:1020Kb
US 2008O133228A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0133228A1 Rao (43) Pub. Date: Jun. 5, 2008 (54) MULTIMODAL SPEECH RECOGNITION (52) U.S. Cl. ................................. 704/231: 704/E15.001 SYSTEM (57) ABSTRACT (76)76) InventorI tOr: AshwinSW P. Rao,a0, Seallle,Seattle, WA (US(US) The disclosure describes an overall system/method for text Correspondence Address: input using a multimodal interface with speech recognition. WHITAKER LAW GROUP Specifically, pluralities of modes interact with the main 755 WINSLOW WAY EAST, SUITE 304 speech mode to provide the speech-recognition system with BANBRIDGE ISLAND WA 98110 partial knowledge of the text corresponding to the spoken 9 utterance forming the input to the speech recognition system. (21) Appl. No.: 11/948,757 The knowledge from other modes is used to dynamically change the ASR system's active vocabulary thereby signifi (22) Filed: Nov.30, 2007 cantly increasing recognition accuracy and significantly reducing processing requirements. Additionally, the speech Related U.S. Application Data recognition system is configured using three different system configurations (always listening, partially listening, and (60) Provisional application No. 60/872,467, filed on Nov. push-to-speak) and for each one of those three different user 30, 2006. interfaces are proposed (speak-and-type, type-and-speak, O O and speak-while-typing). Finally, the overall user-interface of Publication Classification the proposed system is designed such that it enhances existing (51) Int. Cl. standard text-input methods; thereby minimizing the behav GOL 5/00 (2006.01) ior change for mobile users. Multimodal System Using Speech As One Input 100 y (a) DISPLAY HANDWRITING OTHER Hello John, RECOGNITION Can you call me back A S A P (a 123456789? SOFT-KEYPAD OPTICAL CHARACTER / RECOGNITION f f Coyffroi, SIGNAS f 1 101 As 1 y SYSTEM A. QWERTY KEYBOARD SPEECH MODE 10 (MICROPHONE INPUT) Mode Configurations: 1. Always Listening CONTROL SIGNALS 2. Partially Listening To mitigate Interference-Noise introduced by 3. Push-to-Speak other modes: 1. Signal obtained from Display and/or input ASR Options: modes Voice Detection only, 2. Send signals to ASR modules: Search, Audio Speech Features only, Buffers, Language Models, Acoustic Models Speech Recognition Ctc. 3. Signal Processing to filter noise (optional) Patent Application Publication Jun. 5, 2008 Sheet 1 of 29 US 2008/O133228A1 FIG. 1A Multimodal System Using Speech As One Input DISPLAY HANDWRITING Hello John, RECOGNITION Can you call me back A S A P (a) 123456789? SOFT-KEYPAD OPTICAL CHARACT RECOGNITION / / 101 Asy A. QWERTY KEYBOARD \ /27 SYSTEM SPEECH MODE (MICROPHONE INPUT) Mode Configurations: 1. Always Listening CONTROL SIGNALS 2. Partially Listening To mitigate Interference-Noise introduced by 3. Push-to-Speak other modes: ASR Options: in1. Signal obtained from Displavplay and/or inputpu so s 2. Send signals to ASR modules: Search, Audio Speec eatures on y, Buffers, Language Models, Acoustic Models Speech Recognition i etc o 3. Signal Processing to filter noise (optional) Patent Application Publication Jun. 5, 2008 Sheet 2 of 29 US 2008/O133228A1 Figure 1B ASR Engine For Multimodal System USER SPEAKS Generic ASR Engine: using word or syllable or phoneme or other recognition methodologies Inputs from - Front End Signal Processing other a modcs likc keypad, Finite State stylus, Grammars or Acoustic including Statistical Models Spoken Language Alphabet- Models Spelling Spech Features (preferably Voice Search Recognition Post-Process NBest Matrix or Run Multipass Recognition Return Final Results Patent Application Publication Jun. 5, 2008 Sheet 3 of 29 US 2008/O133228A1 FIGURE 2A PE beginning 210 Symbols, 220 W BASE LEXICON Emoticons 4 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Mav Include: 201 Phrases beginning with Za y AlphaNumeric 211 1. Words and/or Phrases SRaed >20,000) 221 WordsordS bbeginning 2. Symbols with A 212 3. Numbers CannedPhrases 2222 4. Alphabet-set SCS 5. Emoticons Words beginning 6. Control-Commands Control- 223 with Ab 213 7. Canned Phrases (<10) Commands // 1 ........... Spelling Input From Words/ A WAWords 2. beginni eglinnling Other Input Modes 230 Phrases 214 224 FIGURE 2B VOICETAP ALPHABET NUMBERS Patent Application Publication Jun. 5, 2008 Sheet 4 of 29 US 2008/O133228A1 FIGURE 2C FIGURE 2D; 28O MULTIPASS RECOGNITIONUSED (NON-REAL-TIME) Patent Application Publication Jun. 5, 2008 Sheet 5 of 29 US 2008/O133228A1 FIGURE 2E PARTIAL SPELLINGS FIGURE 2F: ADDING NEW WORDS 292 CREATE SUB-SETS AS PERFIGURE 2A, LET THIS BE S1 CHECK FOR NEW WORD (DONE AT APPLICATION RUN-TIME) CREATE NEW SUB-SET BY LOOKING FOR INITIAL 296 LETTERS SIMILAR TO FIG 2A; CALL THIS S2 SEARCHING AGAINST NEW WORDS SEARCHUSING THE SUB-SET LEXICON INITIALIZED AS PERFIGURE 2A TO GET BEST-MATCH WORD W1 SEARCH AGAINST ANETWORK OF PHONEMES TO GENERATE BEST-MATCHING PHONES DO DYNAMIC PROGRAMMING BETWEEN PHONES AND PRONUNCIATIONS FOR W1 PLUS PRONUNCLATIONS FOR ALL WORDS IN S2; OUTPUT THE LIST OF WORDS RANKED BY THEIR SCORE Patent Application Publication Jun. 5, 2008 Sheet 6 of 29 US 2008/O133228A1 FIGURE 2G: FIGURE 3A: One embodiment of algorithm for generating sub-set lexicons Initialize i=A Base System Select frequently used words Lexicon (common to application) beginning (10,000 and with 1 and Save in iTXT OC words/phrases) Increment Example: if i=A, i=B: if i=B, i=C............. if i=Y, i=Z Increasing number of words reduces accuracy and increases memory/processing...less than 200 word per recommended Patent Application Publication Jun. 5, 2008 Sheet 7 of 29 US 2008/O133228A1 FIGURE 3B: Increment “i’,& 5 Example: if i=A, j=A if i=A, j=B Select all words beginning with and save in ij.TXT Base System Some combination Lexicon Generate Pronunciation for words in of i,j may have (10,000 and i.TXT large number of Ore words depending words/phrases) on application; Gcncratic Grammar and compile choose number search network ij.DAT based on hardware resources; if number large split i,j into i,j,k etc. FIGURE 3C Base System Lexicon Select ambiguous input letters for (10,000 and example A, B, C One Select Next Letter, words/phrases) example D for DEF key on mobile Sclect words beginning with A, B, C and Save in A.Txt Patent Application Publication Jun. 5, 2008 Sheet 8 of 29 US 2008/O133228A1 FIGURE 4A: Method for predicting Word “Meeting” USER SPEAKS: “Meeting SYSTEM WAITS FOR A KEY-PRESS EVENT TO ACTIVATEy A STATE 402 NO User Presses any Letter Key? WAIT s Activate ASR Engine with new lexicon: Words beginning with Pressed Letter (example M) 405 - - 406 Recognize using stored features which may be the SPEECH FEATURE waveform or traditional ASR front-end features as in Cepstrum, EXAMPLE OUTPUT:“Minute” “Meeting “Me” Additional v Inputs from Post-Process NBest to rearrange all words in User: D NBest Matrix such that their initial letters are example 407 “Mee 08 Display Result “MEETING 409 Patent Application Publication Jun. 5, 2008 Sheet 9 of 29 US 2008/O133228A1 FIGURE 4B Method for predicting Symbol “Comma' SYSTEM WAITS FOR A KEY-PRESS EVENT TO ACTIVATE SYMBOL STATE 410 USER PRESSES A BUTTON MAPPED FOR SYMBOLS: EXAMPLE PRESS CENTER-KEY TWICE 11 413 NO 412 User Speaks “Comma' WAIT s Activate ASR Engine with SYMBOL lexicon 414 Recognize in real-time EXAMPLE OUTPUT: ? ; , 4 1. 5 Additional y Inputs from User to - Ho Display Result Select the 9 o417 COrrect choice 416 Patent Application Publication Jun. 5, 2008 Sheet 10 of 29 US 2008/O133228A1 FIGURE 4C Method for predicting Word “Meeting” USER SPEAKS: “PREDICT Meeting 418 ASR with Grammar “PREDICT <JUNK>'': 421 Output “PREDICT" 420 EXIT Input fro o other ES modes like keypad or spoken Activate ASR Engine with new lexicon: alphabet: “M Words beginning with M 423 SPEECH FEATURES o o424 ASR Engine-SEARCH Return Result “MEETING 425 Patent Application Publication Jun. 5, 2008 Sheet 11 of 29 US 2008/O133228A1 FIGURE 4D: Method for predicting Phrase “Where is the Meeting” USER SPEAKS: “Where Is The Meeting 426 System waits for input from user 429 428 User inputs letter? WAIT Input from user using keypad: s “W IT M Activate ASR Engine with new lexicon: Phrases with words beginning with W, I, T, M SPEECH FEATURES 430 ASR Engine-SEARCH Return Result 433 “WHERE IS THE MEETING Patent Application Publication Jun. 5, 2008 Sheet 12 of 29 US 2008/O133228A1 FIGURE 4E Algorithm for predicting Phrase “Where is the Meeting” USER SPEAKS: “PREDICT Where Is The Meeting” 434 ASR with Grammar “PREDICT <JUNK>'': S. S.S.S.S. s. s. SSSR: 88: RSSS 437 Output “PREDICT"? EXIT s SPEECH FEATURES Dictation ASR Engine with statistical Language Model OR Phoneme ASR Engine: Outputs Matrix of NBest Results 438 Input from other Post-Process NBest to rearrange all words in E. O NBest Matrix Such that their initial characters alphabet: are W., I, T, M 440 “W IT M 439 Return Result 441 “WHERE IS THE MEETING Patent Application Publication Jun. 5, 2008 Sheet 13 of 29 US 2008/O133228A1 FIGURE 5A Speak-and-Type Interface Using Push-to-Speak Configuration USER uses Push-to-speak button to Speak: “Meeting” 501 SYSTEM WAITS FOR A KEY-PRESS EVENT TO ACTIVATE A STATE 502 504 NO O3 User Presses any Letter Key? WAIT s Activate ASR Engine with lexicon: Words beginning with Pressed Letter (example M) Recognize using stored features which may be the SPEECH FEATURE waveform or traditional ASR front-end features as in epstrum; EXAMPLE OUTPUT:“Minute” “Meeting” “Me” Additional v Inputs from Re-recognize by activating ASR with ME.dat USer; or MEE.dat or Post-Process NBest to example rearrange all words in NBest Matrix Such “Mee’ that their initial letters are “Mee” 508 507 Display Result “MEETING 509 Patent Application Publication Jun.