US 2008O133228A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0133228A1 Rao (43) Pub. Date: Jun. 5, 2008

(54) MULTIMODAL (52) U.S. Cl...... 704/231: 704/E15.001 SYSTEM (57) ABSTRACT (76)76) InventorI tOr: AshwinSW P. Rao,a0, Seallle,Seattle, WA (US(US) The disclosure describes an overall system/method for text Correspondence Address: input using a multimodal interface with speech recognition. WHITAKER LAW GROUP Specifically, pluralities of modes interact with the main 755 WINSLOW WAY EAST, SUITE 304 speech mode to provide the speech-recognition system with BANBRIDGE ISLAND WA 98110 partial knowledge of the text corresponding to the spoken 9 utterance forming the input to the speech recognition system. (21) Appl. No.: 11/948,757 The knowledge from other modes is used to dynamically change the ASR system's active vocabulary thereby signifi (22) Filed: Nov.30, 2007 cantly increasing recognition accuracy and significantly reducing processing requirements. Additionally, the speech Related U.S. Application Data recognition system is configured using three different system configurations (always listening, partially listening, and (60) Provisional application No. 60/872,467, filed on Nov. push-to-speak) and for each one of those three different user 30, 2006. interfaces are proposed (speak-and-type, type-and-speak, O O and speak-while-typing). Finally, the overall user-interface of Publication Classification the proposed system is designed such that it enhances existing (51) Int. Cl. standard text-input methods; thereby minimizing the behav GOL 5/00 (2006.01) ior change for mobile users.

Multimodal System Using Speech As One Input 100 y

(a) DISPLAY HANDWRITING OTHER Hello John, RECOGNITION Can you call me back A S A P (a 123456789?

SOFT-KEYPAD OPTICAL CHARACTER / RECOGNITION f f Coyffroi, SIGNAS f 1 101 As 1 y SYSTEM A. QWERTY KEYBOARD

SPEECH MODE 10 (MICROPHONE INPUT) Mode Configurations: 1. Always Listening CONTROL SIGNALS 2. Partially Listening To mitigate Interference-Noise introduced by 3. Push-to-Speak other modes: 1. Signal obtained from Display and/or input ASR Options: modes Voice Detection only, 2. Send signals to ASR modules: Search, Audio Speech Features only, Buffers, Language Models, Acoustic Models Speech Recognition Ctc. 3. Signal Processing to filter noise (optional) Patent Application Publication Jun. 5, 2008 Sheet 1 of 29 US 2008/O133228A1

FIG. 1A Multimodal System Using Speech As One Input

DISPLAY HANDWRITING Hello John, RECOGNITION Can you call me back A S A P (a) 123456789?

SOFT-KEYPAD OPTICAL CHARACT RECOGNITION / / 101 Asy A. QWERTY KEYBOARD \ /27 SYSTEM SPEECH MODE (MICROPHONE INPUT) Mode Configurations: 1. Always Listening CONTROL SIGNALS 2. Partially Listening To mitigate Interference-Noise introduced by 3. Push-to-Speak other modes: ASR Options: in1. Signal obtained from Displavplay and/or inputpu so s 2. Send signals to ASR modules: Search, Audio Speec eatures on y, Buffers, Language Models, Acoustic Models Speech Recognition i etc o 3. Signal Processing to filter noise (optional) Patent Application Publication Jun. 5, 2008 Sheet 2 of 29 US 2008/O133228A1

Figure 1B ASR Engine For Multimodal System

USER SPEAKS

Generic ASR Engine: using or syllable or phoneme or other recognition methodologies Inputs from - Front End Signal Processing other a modcs likc keypad, Finite State stylus, Grammars or Acoustic

including Statistical Models Spoken Language Alphabet- Models Spelling Spech Features (preferably Voice

Search Recognition

Post-Process NBest Matrix or Run Multipass Recognition

Return Final Results Patent Application Publication Jun. 5, 2008 Sheet 3 of 29 US 2008/O133228A1

FIGURE 2A PE beginning 210

Symbols, 220 W BASE LEXICON Emoticons 4 ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Mav Include: 201 Phrases beginning with Za y AlphaNumeric 211 1. and/or Phrases SRaed >20,000) 221 WordsordS bbeginning 2. Symbols with A 212

3. Numbers CannedPhrases 2222

4. Alphabet-set SCS

5. Emoticons Words beginning 6. Control-Commands Control- 223 with Ab 213 7. Canned Phrases (<10) Commands // 1 ......

Spelling Input From Words/ A WAWords 2. beginni eglinnling Other Input Modes 230 Phrases 214 224

FIGURE 2B

VOICETAP ALPHABET NUMBERS Patent Application Publication Jun. 5, 2008 Sheet 4 of 29 US 2008/O133228A1

FIGURE 2C

FIGURE 2D; 28O

MULTIPASS RECOGNITIONUSED (NON-REAL-TIME) Patent Application Publication Jun. 5, 2008 Sheet 5 of 29 US 2008/O133228A1

FIGURE 2E

PARTIAL SPELLINGS

FIGURE 2F:

ADDING NEW WORDS 292 CREATE SUB-SETS AS PERFIGURE 2A, LET THIS BE S1

CHECK FOR NEW WORD (DONE AT APPLICATION RUN-TIME)

CREATE NEW SUB-SET BY LOOKING FOR INITIAL 296 LETTERS SIMILAR TO FIG 2A; CALL THIS S2

SEARCHING AGAINST NEW WORDS

SEARCHUSING THE SUB-SET LEXICON INITIALIZED AS PERFIGURE 2A TO GET BEST-MATCH WORD W1

SEARCH AGAINST ANETWORK OF PHONEMES TO GENERATE BEST-MATCHING PHONES

DO DYNAMIC PROGRAMMING BETWEEN PHONES AND PRONUNCIATIONS FOR W1 PLUS PRONUNCLATIONS FOR ALL WORDS IN S2; OUTPUT THE LIST OF WORDS RANKED BY THEIR SCORE Patent Application Publication Jun. 5, 2008 Sheet 6 of 29 US 2008/O133228A1

FIGURE 2G:

FIGURE 3A: One embodiment of algorithm for generating sub-set lexicons

Initialize i=A

Base System Select frequently used words Lexicon (common to application) beginning (10,000 and with 1 and Save in iTXT OC words/phrases)

Increment Example: if i=A, i=B: if i=B, i=C...... if i=Y, i=Z Increasing number of words reduces accuracy and increases memory/processing...less than 200 word per recommended Patent Application Publication Jun. 5, 2008 Sheet 7 of 29 US 2008/O133228A1

FIGURE 3B:

Increment “i’,& 5 Example: if i=A, j=A if i=A, j=B Select all words beginning with and save in ij.TXT

Base System Some combination Lexicon Generate Pronunciation for words in of i,j may have (10,000 and i.TXT large number of Ore words depending words/phrases) on application;

Gcncratic Grammar and compile choose number search network ij.DAT based on hardware resources; if number large split i,j into i,j,k etc.

FIGURE 3C

Base System Lexicon Select ambiguous input letters for (10,000 and example A, B, C One Select Next Letter, words/phrases) example D for DEF key on mobile

Sclect words beginning with A, B, C and Save in A.Txt Patent Application Publication Jun. 5, 2008 Sheet 8 of 29 US 2008/O133228A1

FIGURE 4A: Method for predicting Word “Meeting”

USER SPEAKS: “Meeting

SYSTEM WAITS FOR A KEY-PRESS

EVENT TO ACTIVATEy A STATE 402 NO User Presses any Letter Key? WAIT s Activate ASR Engine with new lexicon: Words beginning with Pressed Letter (example M) 405

- - 406 Recognize using stored features which may be the SPEECH FEATURE waveform or traditional ASR front-end features as in Cepstrum, EXAMPLE OUTPUT:“Minute” “Meeting “Me” Additional v Inputs from Post-Process NBest to rearrange all words in User: D NBest Matrix such that their initial letters are example 407 “Mee

08 Display Result “MEETING 409 Patent Application Publication Jun. 5, 2008 Sheet 9 of 29 US 2008/O133228A1

FIGURE 4B Method for predicting Symbol “Comma'

SYSTEM WAITS FOR A KEY-PRESS EVENT TO ACTIVATE SYMBOL STATE 410

USER PRESSES A BUTTON MAPPED FOR SYMBOLS: EXAMPLE PRESS CENTER-KEY TWICE 11

413 NO 412 User Speaks “Comma' WAIT s Activate ASR Engine with SYMBOL lexicon 414

Recognize in real-time EXAMPLE OUTPUT: ? ; , 4 1. 5 Additional y Inputs from User to - Ho Display Result Select the 9 o417 COrrect choice 416 Patent Application Publication Jun. 5, 2008 Sheet 10 of 29 US 2008/O133228A1

FIGURE 4C Method for predicting Word “Meeting”

USER SPEAKS: “PREDICT Meeting 418

ASR with Grammar “PREDICT '':

421 Output “PREDICT" 420 EXIT Input fro o other ES modes like keypad or spoken Activate ASR Engine with new lexicon:

alphabet: “M Words beginning with M 423

SPEECH FEATURES o 424o ASR Engine-SEARCH

Return Result “MEETING 425 Patent Application Publication Jun. 5, 2008 Sheet 11 of 29 US 2008/O133228A1

FIGURE 4D: Method for predicting Phrase “Where is the Meeting”

USER SPEAKS: “Where Is The Meeting 426

System waits for input from user

429 428 User inputs letter? WAIT Input from user using keypad:

s “W IT M Activate ASR Engine with new lexicon:

Phrases with words beginning with W, I, T, M

SPEECH FEATURES 430 ASR Engine-SEARCH

Return Result 433 “WHERE IS THE MEETING Patent Application Publication Jun. 5, 2008 Sheet 12 of 29 US 2008/O133228A1

FIGURE 4E Algorithm for predicting Phrase “Where is the Meeting”

USER SPEAKS: “PREDICT Where Is The Meeting” 434

ASR with Grammar “PREDICT '': S. S.S.S.S. s. s. SSSR: 88: RSSS

437 Output “PREDICT"? EXIT s SPEECH FEATURES Dictation ASR Engine with statistical Language Model OR Phoneme ASR Engine: Outputs Matrix of NBest Results 438

Input from other Post-Process NBest to rearrange all words in E. O NBest Matrix Such that their initial characters alphabet: are W., I, T, M 440 “W IT M 439 Return Result 441 “WHERE IS THE MEETING Patent Application Publication Jun. 5, 2008 Sheet 13 of 29 US 2008/O133228A1

FIGURE 5A Speak-and-Type Interface Using Push-to-Speak Configuration

USER uses Push-to-speak button to Speak: “Meeting” 501

SYSTEM WAITS FOR A KEY-PRESS EVENT TO ACTIVATE A STATE 502

504 NO O3 User Presses any Letter Key? WAIT s Activate ASR Engine with lexicon: Words beginning with Pressed Letter (example M)

Recognize using stored features which may be the SPEECH FEATURE waveform or traditional ASR front-end features as in epstrum; EXAMPLE OUTPUT:“Minute” “Meeting” “Me”

Additional v Inputs from Re-recognize by activating ASR with ME.dat USer; or MEE.dat or Post-Process NBest to example rearrange all words in NBest Matrix Such “Mee’ that their initial letters are “Mee” 508

507

Display Result “MEETING 509

Patent Application Publication Jun. 5, 2008 Sheet 14 of 29 US 2008/O133228A1

FIGURE SB Type-and-Speak Interface Using Push-to-Speak Configuration

SYSTEM WAITS FOR A KEY-PRESS EVENT TO ACTIVATE A STATE

510 User Presses any Letter Kcy? NO WAIT ES Activate ASR Engine with lexicon: Words beginning with Pressed Letter (example M) y USER uses Push-to-speak button to Speak: “Meeting”

Recognize in real-time EXAMPLE OUTPUT:"Minute” “Meeting” “Me”

SPEECH FE

Additional Inputs from Re-recognize by activating ASR with ME.dat User: or MEE.dat or Post-Process NBest to example rearrange all words in NBest Matrix such “Mee that their initial letters are “Mee”

Display Result “MEETING Patent Application Publication Jun. 5, 2008 Sheet 15 of 29 US 2008/0133228A1

FIGURE SC: Speak-and-Type Interface Using Partial Listening Configuration

SYSTEM MICROPHONE IS ON 512

USER Speaks: “Meeting"

Buffer to collect SYSTEM WAITS FOR A KEY-PRESS EVENT samples or TO ACTIVATE A STATE features

If contro 517 received, User Presses any Letter Key? Stop collecting v YES samples 1. Turn Microphone Off or Reset Search or Stop and store Recognition etc. 518 waveform 2. Activate ASR Engine with lexicon: o or features Words beginning with Pressed Letter (example M)

516 v Recognize using stored features which may be the waveform or traditional ASR front-end features as in Cepstrum; 519 EXAMPLE OUTPUT:"Minute” “Meeting” “Me” y Additional Re-recognize by activating ASR with ME.dat Inputs from or MEE.dat or Post-Process NBest to User, example rearrange all words in NBest Matrix such that their initial letters are "Mee” “Mee’? 521

5 2 O Display Results o2 2

DID USER FINALIZE THE RESULT DISPLAYED AS NO CORRECT 523 YES Patent Application Publication Jun. 5, 2008 Sheet 16 of 29 US 2008/0133228A1

FIGURE SD: Type-and-Speak Interface Using Partial Listening Configuration

USER Types letter(s): “M” 524

SYSTEM MICROPHONE IS ON or 525 AUDIO BUFFERS ARE COLLECTED

Buffer to collect SYSTEM DETECTS BEGIN AND END OF 527 Samples or SPEECH AND/OR WAITS FOR NEXT LETTER features

If control received, Stop collecting Samples If End of Speech Detected, activate and Store lexicon with previous typed letter, waveform else Activate ASR Engine with or features lexicon words beginning with all entered letters 530

If additional keypad only inputs received and no further speech detected Additional V Keypad Re-recognize by activating ASR with ME.dat only Input or MEE.dat or Post-Process NBest to rearrange all words in NBest Matrix Such 534 that their initial letters are “Mee' 535

Display Results 36

DID USER FINALIZE THE RESULT DISPLAYED AS NO CORRECT? 537 YES Patent Application Publication Jun. 5, 2008 Sheet 17 of 29 US 2008/0133228A1

FIGURE SE: Speak-and-Type Interface Using Always Listening Configuration

SYSTEM MICROPHONEv IS ON o38

USER Speaks: “Meeting” 539

Buffer to collect SYSTEM WAITS FOR A KEY-PRESS EVENT Samples or TO ACTIVATE A STATE 541 features v 543 NO If control User Presses any Letter Key? received, Stop collecting samples and store 1. Reset the Search Network waveform 2. Activate ASR Engine with lexicon: or features Words beginning with enteredletter (example M) 544

Recognize using Stored features which may be the waveform or traditional ASR front-end features EXAMPLE OUTPUT:“Minute” “Meeting “Me” 545 Additional Inputs from User: Re-recognize by activating ASR with ME.dat example or MEE.dat or Post-Process NBest to Mee rearrange all words in NBest Matrix such that their initial letters are “Mee' 547

Display Results 548

DID USER FINALIZE THE RESULT DISPLAYED AS NO CORRECT? 54 YES Patent Application Publication Jun. 5, 2008 Sheet 18 of 29 US 2008/O133228A1

FIGURE SF: Type-and-Speak Interface Using Always Listening Configuration MICROPHONE IS ALWAYS ON AND RECOGNIZERIS CONTINUOUSLY PROCESSING 550

USERTYPES LETTER(s) 551

SYSTEM RESETS SEARCH NETWORK AND STORED AUDIO BUFFERS 552 v End of Speech Detected? If End-of- 553 Speech detected, Store utterance detected If End of Specch Detected, activate 556 lexicon with all previous typed

letters 555

Recognize using stored features; e.g. “Meet” “Met” 558 w If Delete Key detected

Additional Keypad Re-recognize by activating ASR with ME.dat only Input or MEE.dat or Post-Process NBest to rearrange all words in NBest Matrix such that their initial letters are “Mee' 560 561

Display Results 562

DID USER FINALIZE THE RESULT DISPLAYED AS No | CORRECT? 563 YES Patent Application Publication Jun. 5, 2008 Sheet 19 of 29 US 2008/O133228A1

FIGURE SG

INPUT NAMEs JOHN -> SPEAK "JOHN SMITH" -> (E.G. JOHN TYPE J" SMITH) SELECT JOHN SMITH Or TYPE "J" -> SPEAK "JOHN SMITH" Or SPEAK "JOHN SMITH" While TYPING

PHRASES AMDRIV -> SPEAK "AM DRIVING"-G) (E.G. AM SELECT AM DRIVING TYPE "A" DRIVING) Or Type "A"-> Speak "Am Driving" Or

SYMBOS PRESS ( ) SPEAK "SEMICOLON” (E.G HOLD G) SEM.COON) SYMBOL-G) SCROLL --> SELECTSEMICOLON SHORT WORDS (E.G., CAL) 2 (OR 222) > SPEAK “CALL

LONGWORDS (E.G CONSEQUENCE)

2-) SPEAK: SPEAK 6 -> "CONSEQUENCE "CONSEQUENCE" 6 -> C O” 2 (or 222) -> 7 -> 3 -> Or SELECT SELECT CONSEQUENCE CONSEQUENCE SPEAK: "CONSEQUENCE CD OP” Patent Application Publication Jun. 5, 2008 Sheet 20 of 29 US 2008/O133228A1

FIGURE SH

INPUT NAMES SAY JOHN SMITH" SPEAK "JOHN SMITH" -> (E.G. JOHN SMITH) TYPE “J” Or TYPE "J" -> SPEAK "JOHN SMITH" Or SPEAK "JOHN SMITH" While TYPNG "J"

PHRASES SAY "AM DRIVING" (E.G. "AMDRIVING") SPEAK "AMDRIVING">TYPE "A" Or Type "A"->Speak "Am Driving Or Speak "Am Driving" while Typing "A" SYMBOLS SAY "SEMCOLON” SELECT SYMBOL MODE & (E.G. "SEMICOLON") ALONG WITH OTHER SAY "SEMICOLON” VOICE COMMANDS CHARACTERS SAY C-CHARLE USE HANDSFREE OPTION (E.G. "C") SAY "CD" NUMBERS SAY "12"... USE HANDSFREE OPTION (E.G. "12") SAY “12 23.” SOLATED WORDS NOT ACCURATE HANDSFREE MULTIMODE (E.G. "CALL") "CD" -> 222 G "AB" -> "CALL" “LM" -> “LM" COMMANDS - WORDS NOT ACCURATE ALLOWS SIMULTANEOUS USE OF COMMANDS LIKE ERASE THAT, CHANGE-THAT WHILE DCTATINGWORDS KEYPAD + SPEECH NOT AVAILABLE ALLOWS SIMULTANEOUS SIMULTANEOUSLY ACTIVE USE OF SPEECH AND/OR (MULTIMODAL) KEYPAD VOICE-MODE ACTIVATION PUSH-TO-SPEAK OR ALWAYS LSTENING IN BUTTON ADDITION TO PUSH-TO SPEAK SPEAKING DISTANCE CLOSE TO MC OR SAME DISTANCE AS HEADPHONE TYPNG Patent Application Publication Jun. 5, 2008 Sheet 21 of 29 US 2008/O133228A1

FIGURE 6

FIGURE 6-A FIGURE 6-B

NTERFACE WHEN THE SOFTWARE USER SPEAKS INSERT GERG APPLICATION ISTURNED ONUSER ARONOV': THE N-BEST CHOICE COULD SPEAKINTO THE TO LIST IS DISPLAYED; USER COULD WINDOW ORTYPE AMOBILE NEXT SPEAK “CHANGE THAT NUMBER OR INSERTA CONTACT AND/OR MAYUALLY SCROLLIDOWN NAMEUSING THE TO SELECT N-BEST ALTERNATIVE MENU-Y ADD-GdCONTACT ORSAY BEGIN WOICEDAL.” WHICH WILL PLACE A CALL OR SAY 66BEGIN MESSAGE? TO OCTATE

Message body To" field Patent Application Publication Jun. 5, 2008 Sheet 22 of 29 US 2008/0133228A1

FIGURE 6 CONTINUED

FIGURE 6-C FIGURE 6-D

USERSPEAKS 6WHAT CANSAY.” TO USERSAYS ONE OF THE CANNED GET HELP WINDOW. THE ASR PHRASES, NAMELY AM DRIVING”. ENGINES CONFIGURED TO NOTICE THAT THE ALTERNATIVE SUMULTANEOUSLY LISTEN FOR “AMRUNNING LATE'S DISPLAYED COMMANDS, WORDS, AND THE NN-BEST WINDOW. PREDCT-CONTROL-COMMAND

S. S SNSS S s: Si:SASON:- St.s &SS S :

Patent Application Publication Jun. 5, 2008 Sheet 23 of 29 US 2008/O133228A1

FIGURE 6 CONTINUED

FIGURE 6-E FIGURE-F

USERSPEAKS SMN EF’ USING USERSAYS “PREDICT MEETING: N VOICETAP TO ENTER ME”. AGAIN, BEST ALTERNATIVES NAMELY ALTERNATIVES ARE DISPLAYED VEDIATE AND METERARE BELOW. DISPLAYED WHICH BOTH BEGIN WTHLETTERS

E. .. 3. Patent Application Publication Jun. 5, 2008 Sheet 24 of 29 US 2008/O133228A1

FIGURE 6 CONTINUED

FIGURE 6-G FIGURE 6-H

USERTYPESH AND SAYS “HAS BEEN THE HEP MENU FORDICTATION POSTPONED FOLLOWED BY MODES DISPLAYED WHEN THE “NSERT EXCLAMATON USERSAYS WHAT CAN SAY'. NOTICE THAT THE USER INTERFACES AS PERFIGURE 2A

S S sr SS t 8x8: ŠS M

Patent Application Publication Jun. 5, 2008 Sheet 25 of 29 US 2008/O133228A1

FIGURE 6 CONTINUED

FIGURE 6 FIGURE 6

THE INTERFACE CONTROLS THE THE USER COULD SELECT USER BEHAVOR BY PROVIDING DFFERENT MODES BY CLICKING FEEDBACKLIKE SPEAK SOFTER MENU...HERE A CHOICE BETWEEN VOICETAP AND VOCEPREDCTS SHOWN

S SSSSS Š SSSSSSS s: is as

Patent Application Publication Jun. 5, 2008 Sheet 26 of 29 US 2008/O133228A1

FIGURE 7

FIGURE 7-A FIGURE 7-B USERSPEAKS “ASHWIN RAO: THE THEN-BEST CHOICE LIST IS SYSTEMPROMPTS FOR THE INTIAL DISPLAYED WHICH INCLUDE ALL LETTERS: USERENTERS THE CONTACTS BEGINNING WITH “A”: LETTER ‘A’ USER MANUALLY SCROLLS TO SELECT N-BEST ALTERNATIVE

FIGURE 7-C FIGURE 7-D USERSELECTS CONTACT: USER IN MESSAGE COMPOSITION VAY PRESS TALKBUTTON TO WINDOW, SPEAKSWORD PLACEA CALL OR DOWN-ARROW “ARRIVED AND THEN TYPES TO GO TO MESSAGE COMPOSITION LETTER “A NOTICE THAT THE WINDOW ALTERNATIVES ARE DISPLAYED IN

N-BEST WINDOW Patent Application Publication Jun. 5, 2008 Sheet 27 of 29 US 2008/O133228A1

FIGURE 7 (ctd) FIGURE 7-E FIGURE 7-F FIGURE 7-G USER MAY CONTINUE TO USER MAY PRESS USER MAY CLEAR SPEAK-AND-TYPE, OR MENU-) SETTINGSTO ENTIRE TEXT, GET SIMPLY TYPE IN THIS SELECT OPTIONS LIKE HELPEXT OR SEND CASE TYPES “F. OBSERVE SELECTION BETWEEN s THAT IN THE ABSENCE OF PUSH-TO-SPEAK AND SS BY PRESSING SPEECH INPUTOR ALWAYS-LISTENING NU. HERE, USER BACKGROUND NOISE, THE METHODS, RESETTING OF CLEARED MESSAGE SYSTEM DOES PURETEXT USERWORD FREQUENCIES AND ENTERED NEW PREDCON EC. MESSAGE

Seitsia:

Sea; Gls

5 Exit

Patent Application Publication Jun. 5, 2008 Sheet 28 of 29 US 2008/O133228A1

Figure 8

FIGURE 8-A FIGURE 8-B USER INSERTS CONTACT AND IS IN USER PUSHES VOICEPREDIC BUTTON MESSAGE COMPOSITION WINDOW. TIE AND SPEAKS THE WORD ARRIVED', “VOICEPREDICT' SOFT-KEY SAPUSH- SYSTEMDISPLAYS LISTENING ON TOP TOTALKBUTTON AND TALK-RELEASE ON SOFT-KEY

FIGURE 8-C FIGURE 8-D USER RELEASES VOICEPREDECT SOFT USERENTERS THE LETTERA AND KEY AND SYSTEMDISPLAYS NOW TYPE SYSTEMPREDICTS USING SPOKEN A LETTER’ UTTARENCE AND THE LETTER: CHOICES DISPLAYED IN WINDOW BELOW

Patent Application Publication Jun. 5, 2008 Sheet 29 of 29 US 2008/O133228A1

906 DE:|—| r—.

FOG 6(61-)

|06 US 2008/033228A1 Jun. 5, 2008

MULTIMODAL SPEECH RECOGNITION 0007. Thirdly, these methods design a speech mode that is SYSTEM independent and different compared to existing designs of non-speech modes, thus introducing a significant behavioral CROSS-REFERENCE TO RELATED change for users: (a) to input a Name (Greg Simon) on a APPLICATIONS mobile device from a contact list, a standard keypad interface 0001. This patent application claims priority to U.S. Pro may include “Menu-Insert-Select Greg Simon’: a standard visional Patent Application No. 60/872,467, entitled “Large voice-dial software requires the user to say “Call Greg Vocabulary Multimodal Text-Input System with Speech as Simon' which has a different interface-design (b) to input a One of the Input Modes', by Ashwin Rao, filed Nov.30, 2006, phrase user enters the words of that phrase one by one by and to U.S. Provisional patent application Ser. No. s using either triple-tap or predictive-text-input; a speech dic entitled “Multimodal System for Large Text Input Using tation Software requires the user to push-and-hold a button Speech Recognition', by Ashwin Rao, filed Oct. 19, 2007, and speak the whole sentence, once again a different inter both of which are incorporated herein by reference. face-design which requires users to say the entire message continuously. FIELD OF THE INVENTION 0002 The invention relates to the field of ambiguous or SUMMARY OF THE INVENTION unambiguous input recognition. More specifically, the inven tion relates to the fields of data input and speech recognition. 0008 Generally stated, the invention is directed to a mul timodal interface for entering text into devices, when speech BACKGROUND INFORMATION is one of the input modes and mechanical devices (e.g., key pad, keyboard, touch-sensitive display, or the like) form the 0003. The problem of entering text into devices having other input modes. Small form factors (like cellular phones, personal digital 0009. This invention is a system, and more generally a assistants (PDAs), RIM Blackberry, the Apple iPod, and oth method, for multimodal text-input when speech is used ers) using multimodal interfaces (especially using speech) simultaneously with other input modes including the keypad. has existed for a while now. This problem is of specific importance in many practical mobile applications that Specifically, the proposed invention relies on (a) dividing the include text-messaging (short messaging service or SMS, base vocabulary into small number of vocabulary-sets multimedia messaging service or MMS, Email, instant mes depending on the specific application, (b) further breaking saging or IM), wireless Internet browsing, and wireless con down the vocabulary within some or all of these sets into tent search. Vocabulary-Sub-sets, based on the partial spelling of the indi 0004 Although many attempts have been made to address vidual words (or phrases or more generally text) within that the above problem using “Speech Recognition', there has Sub-set, (c) generating acoustic search networks correspond been limited practical success. The inventors have deter ing to the Sub-set Vocabularies by pre-compiling them prior to mined that this is because existing systems use speech as an system launch or dynamically during system run-time, (d) independent mode, not simultaneous with other modes like activating or building one of the Sub-set acoustic networks the keypad; for instance a user needs to push-to-speak (may based on user-input from other input modes, (e) selecting a be using a Camera button on a mobile device) to activate the system configuration to turn on the speech mode from three speech mode and then speak a command or dictate a phrase. different system configurations (always listening, partial lis Further on, these systems simply allow for limited vocabulary tening or push-to-speak) based on users preference, (f) for command-and-control speech input (for instance say a song each one of the three system configurations, designing the title or a name or a phrase to search for a ringtone etc.). system to handle three different user-interfaces (speak-and Finally, these systems require the users to learn a whole new type, type-and-speak, and speak-while-typing), (g) resorting interface-design without offering providing them the motiva to multi-pass speech recognition as and when required, (h) tion to do so. depending on the choice of the system configuration for 0005. In general, contemporary systems have several limi speech mode, using signal processing and hardware/software tations. Firstly, when the vocabulary list is large (for instance signals to mitigate noise introduced by other input modes, and 1000 isolated-words or 5000 small phrases as in names/com (i) grouping all system components to yield an overall system mands or 20000 long phrases as in Song titles) and so on, (a) whose user-interface builds on top of standard mobile user these systems’ recognition accuracies are not satisfactory interfaces thereby resulting in a minimal behavioral change even when they are implemented using a distributed client for end-users. server architecture, (b) the multimodal systems are not prac 0010 Thus, in accordance with this invention, users may tical to implement on current hardware platforms, (c) the input any text (example, from a large Vocabulary systems are not fast enough given their implementation, and consisting of tens of thousands of words and/or phrases and (d) these systems recognition accuracies degrade in the additionally consisting of symbols, emoticons, Voice com slightest of background noise. mands, manual controls) using a multi-modal interface with 0006 Secondly, the activation (and de-activation) of the increased speed, ease and accuracy. For instance, a mobile speech mode requires a manual control input (for instance a user may speak a word and Subsequently input the first letter push-to-speak camera-button on a mobile device). The pri of that word and have the proposed system complete that mary reason for doing so is to address problems arising due to word for the user. As another example, a user may input the background noises; using the popular push-to-speak method first letter of a song-title and Subsequently speak that song the systems microphone is ON only when the user is speak title and have the device display the word or download/play ing; hence non-speech signals are filtered out. that song. US 2008/033228A1 Jun. 5, 2008

0011. In another aspect, the interface design builds on top 0036 FIG. 5H compares the speech features of the pro ofexisting mobile interface designs and hence may be viewed posed system to prior-art speech recognition systems for as a “voice-analog to existing standard text-input modes. mobile devices; 0037 FIG. 6 illustrates an application of the invention for BRIEF DESCRIPTION OF THE DRAWINGS “text-messaging wherein the invention is viewed as a mul 0012. The foregoing aspects and many of the attendant timodal interface along with a hands-free mode; advantages of this invention will become more readily appre 0038 FIG. 7 illustrates an application of the invention for ciated as the same become better understood by reference to “text-messaging wherein the invention is viewed as a mul the following detailed description, when taken in conjunction timodal interface using the "always listening configuration; with the accompanying drawings, wherein: 0039 FIG. 8 illustrates an application of the invention for 0013 FIG. 1A describes one embodiment of an overall “text-messaging wherein the invention is viewed as a mul multimodal system; timodal interface using the “push-to-speak’ configuration; 0014 FIG. 1B describes one embodiment of an ASR and engine in a multimodal system; 0040 FIG. 9 is a block diagram representing a computing 0.015 FIG. 2A illustrates one embodiment of a method for device in the form of a general purpose computer system with dividing a large base Vocabulary into Small Sub-set Vocabu which embodiments of the present invention may be imple laries for building Sub-set acoustic networks: mented. 0016 FIG. 2B illustrates an example of the word-graph state-machine; DETAILED DESCRIPTION OF THE PREFERRED 0017 FIG. 2C shows another example of word-graph EMBODIMENT state-machine; 0041. The following disclosure describes a multimodal 0018 FIG. 2D shows example of variations of word-graph speech recognition system. By way of background, speech state-machine for Words shown in FIG. 2C recognition is the art of transforming audible sounds (e.g., 0019 FIG. 2E shows possible variations in the state-ma speech) to text. “Multimodal” refers to a system that provides chine of FIG. 2C: a user with multiple modes of interfacing with the system 0020 FIG. 2F: Method for adding and searching new beyond traditional keyboard or keypad input and displays or words to system; voice outputs. Specifically for this invention, multimodal 0021 FIG. 2G: Example of Word-Graph State-Machine implies that the system is “intelligent” in that it has an option for method in FIG. 2F ... only a portion of states from FIG. to combine inputs and outputs from the one or more non 2C is shown for simplicity; speech input modes and output modes, with the speech input 0022 FIG. 3A: one embodiment of algorithm for gener mode, during speech recognition and processing. ating Sub-set lexicons; 0023 FIG. 3B illustrates another embodiment of the algo 1. Overview of the Disclosure rithms for generating Sub-set lexicons and search networks; 0024 FIG. 3C: One embodiment of algorithm for gener 0042. An embodiment of an overall system for multimo ating Sub-set lexicons for ambiguous inputs as in a 9-digit dal input is described. Next, three embodiments are ; described: (1) a system wherein the microphone is active all 0025 FIG. 4A: one embodiment of system in FIG. 2A for the time, thereby enabling speech input to be concurrently predicting words; active along with other input-modes namely keypad, key 0026 FIG. 4B: one embodiment of system in FIG. 2A for board, Stylus, soft-keys etc.; (2) a system wherein the micro predicting symbols; phone is only partially active enabling speech input prior to a 0027 FIG. 4C: one embodiment of Algorithm for system manual text input; and (3) a system wherein the microphone in FIG. 2A for predicting words using Voice-controls; is manually controlled (example using a push-to-speak but 0028 FIG. 4D: one embodiment of system in FIG. 2A for ton) thereby enabling speech mode to be active only if and predicting phrases; when desired. Further, for each one of the three embodiments, 0029 FIG. 4E: another embodiment of Algorithm for sys three different user-interfaces are described (speak-and-type, tem in FIG. 4D for predicting phrases; type-and-speak, and speak-while-typing). Finally, an overall 0030 FIG. 5A: a system in FIG. 2A using push-to-speak user-interface design is described which enhances existing configuration; example of speak-and-type user-interface is mobile user-interfaces and hence exhibits minimal behavioral considered; change for end-users. 0031 FIG. 5B: a system in FIG. 2A using push-to-speak configuration; example of type-and-speak user-interface is 2. Specific Embodiments and Implementations considered; 0043 FIG. 1A is a conceptual overview of a system for 0032 FIG.5C: a system in FIG. 2A using partial-listening large-vocabulary multimodal text-input using a plurality of configuration using speak-and-type user-interface FIG.5D: a input methods, such as speech 101. keypad 102, handwriting system in FIG. 2A using partial-listening configuration using recognition 103, Qwerty keyboard 104, soft-keyboard 105, type-and-speak user-interface; optical character recognition 106, and/or other non-speech 0033 FIG.5E: a system in FIG. 2A using always-listening input modes 107, in which embodiments of the invention may configuration using speak-and-type user-interface; be implemented. The system 100 is but one example and 0034 FIG.5F: a system in FIG.2A using always-listening others will become apparent to those skilled in the art. configuration using type-and-speak user-interface; 0044) The system 100 includes a speech recognition sys 0035 FIG. 5G illustrates an embodiment of the overall tem 110 (also referred to as an Automatic Speech Recognizer system proposed which may be viewed as a voice-analog or ASR) for attempting to convert audible input (e.g., speech) interface to existing text-input interfaces; to recognized text. It will be appreciated by those skilled in US 2008/033228A1 Jun. 5, 2008

the art and others that a typical speech recognition system 110 201 may include any one or more of words and/or phrases operates by accessing audio data through a microphone/ (perhaps in excess of 20,000), symbols, numbers, an alphabet Soundcard combination, processing the audio data using a set, emoticons, control commands, and/or canned phrases front-end signal processing module to generate feature vec (generally, but not always, less than 10). In FIG. 2A, the base tors, and Subsequently doing pattern-matching in a search lexicon 201 (including pronunciations) is divided into the module using knowledge from acoustic-model(s) and lan small sub-set lexicons (201-214) including sub-set acoustic guage-model(s). The system itself may exist as Software or networks (those skilled in speech recognition will appreciate may be implemented on a computing device; and thus may that different forms of acoustic networks may be designed include memory (ROM, RAM, etc.), storage, processors with the basic principle being to break down words into their (fixed-point, floating-point, etc.), interface ports, and other Sub-word representations, as in phonetic pronunciations or hardware components. syllables etc., and if required combine that with certain rules, 0045. In various implementations, the system 100 may 'grammars', or statistics, “language models’) which are operate by (a) having the ASR system 110 configured in an compiled either prior to system launch or dynamically during always-listening configuration, a partially-listening configu system's run-time; those skilled in art will further recognize ration, or a push-to-speak configuration, (b) accessing letters/ that compiling grammars for grammar-based systems are characters from any of the input modes (101-107), (c) miti fairly straightforward whereas in speech-to-text systems the gating any noise generated by the non-speech modes (102 networks are dynamically compiled on the fly based on lan 107) using software-signals to the ASR system 110 and/or guage model caching and using fast-match and detailed signal processing to filter noise, (d) reducing the systems match techniques. For example a base system Vocabulary of base application vocabulary dynamically into Sub-set (in 20,000 words plus 30 symbols plus 26 alphabet-letters plus 9 cluding Sub of Sub-sets and Sub-Sub-Sub-sets etc) Vocabular digits plus some commands may be broken down first into ies to significantly increase recognition accuracies and to Sub-set Vocabularies where symbols, alphanumeric, com permit practical real-time implementation of the overall sys mands, and words form the Sub-sets. Each Sub-set may be tem (optional), (e) handling the Sub-set Vocabularies using further broken down into sub-sub-set. For instance the 20,000 user-interface designs in standard text-input systems (op words sub-set may be broken down into sub-sub-sets such tional), and (f) dynamically interacting one or more of the that words beginning with “A”, “B”, and so on form the modes (101-107) with the ASR module. sub-sub-sets. Specific details on this may be found in the 0046 Those skilled in the art will appreciate that the sys “Predictive Speech-to-Text patent filed earlier by the inven tem 100 may be applied to any application requiring text tors. The method 200 improves over other attempts at multi input, including (but not limited to) mobile text-messaging modal input because other attempts try to use the base (SMS, MMS, Email, Instant Messaging), mobile search, vocabulary “as is”, which causes those systems to fail to mobile music download, mobile calendar/task entry, mobile deliver acceptable performance. navigation, and similar applications on other machines like 0049 Firstly, the small sub-set acoustic networks (equiva the Personal Digital Assistants, PCs, Laptops, Automobile lent to sub-set lexicons) (201-214), may be either pre-com Telematics systems. Accessible Technology systems etc. piled during system design or may be compiled during system Additionally, several implementations of the system includ start-up or may be dynamically compiled at System run-time ing a client-only, a server-only, and client-server architecture (referencing to either using grammar based ASR systems or may be employed for realizing the system. statistical language model based speech-to-text systems). 0047 FIG. 1B illustrates in greater detail the ASR engine Obviously, pre-compiling results in lowest processing introduced above in FIG. 1A. In operation, the user speaks requirements at run-time with highest memory requirements through a microphone. The speech samples are digitized and while building at run-time results in highest processing processed using signal processing (as in Linear Prediction or requirements at run-time with lowest memory requirements. Cepstrum or Mel Cepstrum etc) and front-end features are Whatever the implementation, having the sub-set networks generated. These are fed to the Search module wherein a yield in real-time practical systems, and additionally they dynamic programming search is carried out using knowledge significantly increase speech recognition accuracy. For from acoustic models and language models; the later may be instance, instead of recognizing against 20,000 words and 30 implemented as finite state grammars or statistical language symbols and 26 alphabet-letters and 9 digits and some com models. In the proposed invention (a multimodal system), the mands all at the same time, the system 100 may be recogniz difference is that the language model has an additional input ing around 300 words (because of knowledge that only words from other modes. Thus the language models are dynamically beginning with the two letters indicated by other modes are adapted based on inputs from other inputs. For example if a valid), 30 symbols, 26 VoiceTap-alphabet, 9 VoiceTap-digits user speaks Meeting and also has typed or types “M”, then the and some commands. Alternatively, if the symbols, alphabet, knowledge that the word begins with “M” is passed on to the digits, commands, phrases etc separated into Sub-set lexicons language model. In practice, one may envision implementing (similar to the way they may be broken down into multiple this by modifying the Search module directly instead of functions in mobile devices) and Subsequently into separate changing the language models. Further, for certain applica speech recognition states, the system 100 may be recognizing tions, one could use the inputs from other modes to adapt the only 300 words or only 50 phrases or only symbols or only acoustic models (including template based models as used in letters and digits or only commands. The method 200 may be speaker dependent ASR systems) and/or the signal process modified using minor modifications depending on the overlap ing algorithms. between the states; for instance if letters and symbols are 0048 FIG. 2A illustrates a method 200 for dividing a base desired to be combined into a single state, then an “Insert” lexicon 201 into smaller sub-set (including sub of sub-sets key-command may be added to symbols to distinguish them and sub-sub-sub-sets etc) lexicons (210-214). Extensions to from letters. The sub-set lexicons (210-214) may themselves ambiguous inputs may be easily addressed. The base lexicon include Phrases or other Text. US 2008/033228A1 Jun. 5, 2008

0050. In addition, observe that the method 200 also well-known art in ASR systems. Alternatively, another receives input from other input modes 230 (which may be a scheme is described in FIG. 2F for adding new words not in speech mode) to enable it to reduce the base lexicon 201 into the base lexicon 201 and searching for the same. To add a new sub-set lexicons (210-214). It is further emphasized that the word, first the system 100 creates the initial sub-set lexicons inputs from the other input modes 230 may include ambigu (210-214) as illustrated in FIG. 2A (step 292). The initial ous inputs (as is common in 9-digit keypad inputs), mis-typed sub-set lexicons (210-214) will be referred to as S1. Next, the inputs, mis-recognized voice inputs (when using ASR system 100 receives a new word from a user, such as through designed for spoken alphabet or military alphabet coding or manual input or the like, and the system 100 checks S1 for the VoiceTap coding recognition). new word (step 294). Not finding the new word, the system 0051 FIG.2B illustrates a finite state-machine search net 100 creates a new sub-set lexicon based on the initial letter(s) work (at the word-level). Those skilled in art will realize that, of the new word. This new sub-set lexicon is referred to as S2. in practice, the final network would include phonemes as 0056 To search for new words, the system searches the determined by the pronunciations of the words and that the sub-set lexicon S1 to get a best-match word, referred to as W1 figure is a general representation for both fixed-grammar (step 297). The system 100 also searches against a network of based ASR systems and statistical language model based phonemes to generate the best-matching phones (step 298). dictation or speech-to-text systems. As shown in the word Then, the system 100 performs dynamic programming graph state machine 245, the system 100 moves from a one between phones and pronunciations for W1 plus pronuncia state to some other state using a manual control. The system tions for all words in S2 (step 299). The list of resulting words 100 detects and keeps track of the states. The states may be may be output ranked by their respective scores. manually selected by users using buttons on keypad and/or 0057. Other standard techniques for adding words and the states may be combined and selected automatically using their pronunciations to existing Sub-set lexicons may be also Voice command-controls. In the later case, knowledge of state included. For example, a modification of the state-machine may be easily detected based on best-matching recognition from FIG. 2C that includes an additional Phoneme-network output and may be used to post-process the ASR engine's 110 path is shown in FIG. 2G. recognition output. For instance if a best-matching hypoth 0.058 An example algorithm for generating the sub-set esis belonged to the state “symbol 246 then only those alter lexicons is provided in FIGS. 3A and 3B. FIG. 3A illustrates natives belonging to this state (symbol 246) are displayed in the case where a lexicon is being created based on a single an N-Best choice window. Thus if the user said “Insert Semi letter being known, while FIG.3B illustrates the case where a colon” the system 100 will only display alternative symbols lexicon is being created based on a given combination of two as choices. (or more) characters. As shown in FIG.3A, the process begins 0052 FIGS. 2C and 2D illustrate additional word-graph by initializing a variable “i' to some initial value, such as the state machines (270, 280). FIG. 2C illustrates a word-graph letter “A” (block 301). Words from the base system lexicon state machine 270 based on using voice commands, and FIG. that begin with the letter represented by 'i' are added to a text 2D illustrates a variation 280 of the word-graph state machine file having a name of i.TXT, where the letter 'i' represents the 270. For instance in FIG. 2C a state machine for implement currently selected letter value (e.g., “A” currently) (block ing a handsfree system is shown wherein an “Insert key 303). At this point, the variable “i' is incremented to the next word is used for symbols and emoticons, a “Predict” key possible value, e.g., the next letter in the alphabet sequence, word is used for words while the remaining states have no and the process repeats (block 305). The process runs again keywords. In FIG. 2D another state machine for handsfree and continues to repeat until a different pre-compiled search implementation is shown wherein some of the words have a network has been created for every possible value of 'i' user-interface option of speaking-and-spelling the words, 0059. As shown in FIG.3B, a similar process is performed some words have the option to insertakeyword whereas other except that more search networks are created based on two words are simply recognized using an out-of-vocabulary letter strings. Accordingly, the process illustrated in FIG. 3B (OOV) rejection scheme (well-known in ASR systems for differs from the process in FIG. 3A in that two variables 'i' word-spotting) and multipass recognition is resorted to. and are used (block 311), and those variables are individu 0053 FIG. 2E illustrates an additional word-graph state ally incremented (block 313) until the possible combinations machine 290 for phrases. Here, a user could say “Predict' of search networks have been generated. followed by a Phrase or simply a phrase or actually spell the 0060. Observe that for the first set (FIG. 3A), the most entire phrase. common words are used based on what is commonly referred 0054 Those skilled in art will appreciate that other varia to as n-gram statistics in language modeling. This is because tions of word-graphs may also be obtained, including (but not if the system vocabulary is 26,000 words there is all possibil limited to) (a) using partial spellings, (b) using speak-and ity that each list will on an average be 1000 words. Many partially-spell mode, (c) using a hybrid Scheme wherein users devices do not have the processing power to search an active spellas in “Predict WORD" for short words and say “Predict list of 1000 words. Even if they do, the confusion between Word” when the word is long, and (d) using finer categoriza words reduces the recognition accuracy. Experiments con tion like “Predict COLORYellow’ or “Predict CITY Boston ducted by the inventors have revealed that 200 words is a good or “Predict WEB ADDRESS Google-dot-com” or “Predict limit for many practical applications. PHRASE Where are you” or “Predict BUSINESS Docu 0061 For the second set (FIG.3B), combination of the 1st ment” or “Predict ROMANCE Missing” etc. and 2nd characters are used to create the words. In this case, 0055 Finally, a word-graph may be adapted based on the most of the combinations have words less than 200. If they are users behavior, frequently used words/phrases, corrected more than 200, once again n-gram counts may be used to words etc. One simple way to add new words is to simply select the top 200 words and the rest may be split into 3-char add/remove the words and their pronunciations from the sub acter lists or other techniques may be incorporated. Obvi set vocabularies or sub-set acoustic networks. This is fairly ously, the actual process may be run off-line (on a PC or other US 2008/033228A1 Jun. 5, 2008

machine) and the final search-network files may be packaged employed to increase recognition speeds. Example output into a single data-structure. Alternatively, if the networks are may be “minute”, “meeting, or “me”. being dynamically built then only words and pronunciations 0067. The N-best results may be post-processed (step 407) may be stored and cached at run-time to build the actual to rearrange the N-best words such that their initial letters are networks. “mee'. Additional inputs are received from the user(step 408) 0062 Several extensions to the above are possible. For until the correct word is recognized, and the result is then instance, based on the frequency of usage of words by the user displayed (step 409). (also referred to as dynamic statistical language modeling) 0068 FIG. 4B illustrates an example method for predict the words from 3-letter lists may be moved into 2-letter lists ing the comma symbol “”. The system waits for a key-press and subsequently to 1-letter lists and finally to a list that does event to move to the symbol state (step 410). When the user not require any letter input, meaning pure speech-to-text; and presses a button mapped to the symbols State, the system vice-versa. For instance if the word “consequence' was moves to that state (step 411). For example, the user may included in the condat search network because its unigram press the center key twice to indicate that a symbol is desired. count for English language is low and a user begins using the The system then waits (step 412) until the user speaks the word “consequence', then "consequence may be moved to word “comma” (step 413), at which point the system activates co.dat and further up to c.dat and so on and finally be the word the ASR engine with the symbol sub-set lexicon (step 414). in the main recognition pass. Similarly if “although' is word The ASR engine then recognizes the waveform in real-time ina.datand not being used then it could be moved to al.datand (step 415). Additional inputs from the user may be received to alt.dat and so on. For speech-to-text systems, the statistical help select the correct choice if the initial recognition selec language models may be adapted in a similar fashion and tion is incorrect (step 416). The system finally displays the cached at run-time based on additional language model correct result (step 417). Those skilled in art will recognize knowledge of the spelling of the words. that this example is standard technique used in ASR systems 0063. Several other extensions are possible for selecting for command recognition. The difference here is that the words. For instance a combination of longer and more fre technique is embedded in the multimodal framework and the quent words may be grouped together and using the so-called symbol state gets activated by a manual control; after recog confusion matrix (well-known in speech recognition) the top nition of the symbol the system goes back to the default state 200 words may be selected. which may be words. 0064. Additional information on this feature may be 0069 FIG. 4C illustrates another example method for pre obtained from currently-pending U.S. patent application Ser. dicting the word “meeting with a system that uses voice No. 1 1/941,910, which is hereby incorporated by reference in controls. The method begins when the user speaks the phrase its entirety for all purposes. Those skilled in the art will “predict meeting” (step 418). In response, the system appreciate that a variety of other techniques may alternatively launches the ASR engine with the grammar “predict be used for this step. '' (step 419), where using the Sub-set lexicon (step 424). keypad. Thus, beginning with the letter A (step 315), each of The system then returns the result “meeting”. the ambiguous letters that are associated with the letter A are 0070 FIG. 4D illustrates an example method for predict identified (e.g., A, B, and C) (step 317). Next, the words from ing phrases. To begin, the user speaks the phrase “where is the the base lexicon that begin with all of the identified ambigu meeting” (step 426). The system waits for input from the user ous letters are grouped and saved in a Sub-set lexicon for the (steps 427, 428) until the user inputs a letter (step 429). In key letter “A” (step 319). The next letter in the series (e.g., the response, the system activates the ASR engine with the sub letter D since the letters A-C are grouped) is selected and the set lexicon based on the letter received from the user (step method repeats (step 321). The method continues until 430). The system receives more input from the user (step 431) completion. and continues to refine the search operation. In this example, 0066 FIGS. 4A-E display different embodiments of the the system recognizes phrases based on the initial letters of proposed system for different tasks including symbols, each word in the phrase (W. I. T. M). The ASR engine con words, and phrases. First, in FIG. 4A, a method is shown for tinues to recognize (step 432) until the appropriate phrase is predicting the word “meeting. First, the user speaks the word identified, and the results are returned (step 433). “meeting” (step 401). The system waits for a key-press event 0071 FIG. 4E illustrates another example method for pre (or other input mode event) to activate a state (step 402). Until dicting phrases. To begin, the user speaks the phrase “predict the user presses any letter key, the system waits (step 404). where is the meeting” (step 434). The system initiates the When the user presses a letter key, the system activates the ASR engine with the grammar “predict '' (step 435), ASR engine with the new lexicon (step 405). In this example, where indicates a place marker for a waveform to the first pressed letter would be the letter M, thus the new recognize in response to the control command "predict'. The lexicon is the sub-set lexicon associated with the letter M. system launches the ASR engine (step 438) with a statistical Using that lexicon, the system recognizes using stored fea language model or phoneme. The ASR engine outputs a tures which may be the waveform or traditional ASR front matrix of N-best results. The user provides additional input end features, such as in Cepstrum. Since the recognition in using other input modes, such as the letters W. I. T. and M this example is not real-time, utterance detection may be (step 439). The system post-processes the N-best list to rear US 2008/033228A1 Jun. 5, 2008

range the words such that their initial letters are W.I.T. and M (step 528). The system ultimately detects an “end of speech” (step 440). With that input, the system recognizes and returns condition, meaning that the user has paused or stopped mak the result “where is the meeting” (step 441). ing an utterance, or the user presses a key (e.g., the letter E) 0072. In FIGS. 5A-E, different methods that implement (step 529). At that point, the system activates the ASR engine the system in different embodiments are illustrated in detail. with the appropriate lexicon, which may be a sub-set lexicon This includes Push-to-Speak configuration (FIGS. 5A and if a letter was pressed or it may be the base lexicon if an 5B), Partial Listening configuration (FIGS. 5C and 5D), and end-of-speech was detected (step 530). The ASR engine Always Listening configuration (FIGS. 5E and 5F). The begins recognizing the stored features (step 532), or it may push-to-speak configuration implies either holding down a ignore the utterance if it is too short to recognize (step 531). key while speaking; or pressing a key, Speaking, and pressing As the user provides additional input letters (step 534), the the key again; or pressing a key to activate and end-of-speech system re-recognizes by activating the ASR engine with the is automatically detected. They may be selected based on user preference, hardware limitations, and background noise con new sub-set lexicon based on the additional input letters, or by ditions. post-processing the N-best list using the additional input let 0073. In FIG. 5A, the user presses a “push-to-speak' but ters (step 535). This method continues until the proper result ton and speaks “meeting” (step 501). The system waits for a is recognized and displayed (step 536). The user may be key-press event to activate a state (step 502/503). When the prompted to indicate that the displayed results are correct user presses a letter key (step 504), the system activates the (step 537). If the results were incorrect, the method repeats ASR engine with the sub-set lexicon for words beginning (step 535) until the correct results are identified. with the pressed letter (step 505). The ASR engine recognizes 0077. Other optimizations may be easily envisioned. For using stored speech features which may be the waveform or instance waveform collected between keystrokes may all be traditional ASR front-end features, such as in Cepstrum (step compared using confidence scores or other techniques and 506). As the user provides additional input letters (step 507), this way the system may allow a user to speak before or after the system re-recognizes by activating the ASR engine with or while typing a word. the new sub-set lexicon based on the additional input letters, 0078. In FIG. 5E, the system microphone is initially on or by post-processing the N-best list using the additional (step 538) when the user speaks “meeting” (step 539). The input letters (step 508). This method continues until the system buffers the spoken text (step 540) and waits for a proper result is recognized and displayed (step 509). key-press event to activate a state (step 541). If a control 0074 The method illustrated in FIG. 5B performs as command is received, the system stops collecting samples described in FIG. 5A, except that the user presses input keys and stores the waveform or features for processing (step 542). to input letters (step 510) prior to pressing a push-to-speak When the user ultimately presses a letter key (step 543), the button and speaking the word “meeting” (step 511). In this system resets the search network and activates the ASR implementation, the system can begin recognizing the wave engine with the appropriate sub-set lexicon (step 544). The form in real-time because the sub-set lexicon will have been system recognizes using the stored features which may be the already selected based on the prior letter input. waveform or traditional ASR front-end features, as in Cep 0075. In FIG. 5C, the system microphone is initially on strum (step 545). As the user provides additional input letters (step 512) when the user speaks “meeting” (step 513). The (step 546), the system re-recognizes by activating the ASR system buffers the spoken text (step 514) and waits for a engine with the new sub-set lexicon based on the additional key-press event to activate a state (step 515). If a control input letters, or by post-processing the N-best list using the command (for instance any letter key) is received, the system additional input letters (step 547). This method continues stops collecting samples and stores the waveform or features until the proper result is recognized and displayed (step 548). for processing (step 516). Additionally, when the user presses The user may be prompted to indicate that the displayed a letter key (step 517), the system turns the microphone offor results are correct (step 549). If the results were incorrect, the resets the search or stops recognizing etc. and activates the method repeats (step 547) until the correct results are identi ASR engine with the appropriate sub-set lexicon (step 518). fied. The system recognizes using the stored features which may (0079. In FIG. 5F, the microphone is always on and the be the waveform or traditional ASR front-end features, as in recognizer is continuously processing (step 550), when the Cepstrum (step 519). As the user provides additional input user types a letter (step 551). In response, the system resets letters (step 520), the system re-recognizes by activating the the current search network and the stored audio buffers (step ASR engine with the new sub-set lexicon based on the addi 552). Until the end-of-speech is detected (step 553), the sys tional input letters, or by post-processing the N-best list using tem waits (step 554). When the end-of-speech is detected, the the additional input letters (step 521). This method continues system activates the Sub-set lexicon based on the previously until the proper result is recognized and displayed (step 522). typed letters (step 555). In addition, the utterance that was The user may be prompted to indicate that the displayed detected is stored for processing (step 556). If the utterance results are correct (step 523). If the results were incorrect, the that was detected is too short for recognition, the system method repeats (step 521) until the correct results are identi ignores the utterance (step 557). The system begins recogniz fied. ing the utterance using the stored waveform and features (step 0076. In FIG.5D, the user types the letter “M” (step 524). 558). If a special key is pressed (e.g., the Delete key), the At this point, the system microphone is on and/or audio system re-recognizes by activating the ASR engine with the buffers are being collected (step 525). The buffers collect new Sub-set lexicon based on additional input letters (step audio samples or features (step 526). The system detects a 560), or by post-processing the N-best list using the addi begin and end of speech, and/or waits for the next letter to be tional input letters (step 561). This method continues until the input (step 527). If a control command is received, the system proper result is recognized and displayed (step 562). The user stops collecting samples and stores the waveform or features may be prompted to indicate that the displayed results are US 2008/033228A1 Jun. 5, 2008

correct (step 563). If the results were incorrect, the method Further notice that the N-Best window in FIG. 6F only dis repeats (step 561) until the correct results are identified. plays words that begin with ME (the a priori information 0080 FIG.5G is a table comparing the multimodal input provided when user entered M, E). The MENU includes equivalents for manual input mechanisms for achieving cer settings like SEND MESSAGE, CHOOSE VOICETAP OR tain illustrative desired outputs. Note that the speech compo VOICEPREDICT MODE, and TURN NBEST ON/OFF etc. nent of the multimodal system builds on-top of existing stan I0085. Referring to FIGS. 6G and 6H, the extension of dard non-speech text-input designs, and hence may be viewed recognition for phrases is displayed. In a multimodal fashion, as a voice-analog to existing input methods specifically the the user enters Hand simply says the sentence. The user could mobile user-interface. For instance, instead of having to type have entered H B P and said the sentence for increased accu 4 or 5 letters of a Contact-Name, the system invokes the user racy. Observe that the HELP window contents change to say "Contact-Name” and begin typing as usual; because of depending on which mode the user is in. For instance, the the additional speech input the system completes the name in mode shown in FIG. 6H has all the sub-set lexicons present: 1 or 2 letters as opposed to 4 or 5 letters. Overall, the user the first two lines denote VoiceTap for alphabet and numbers interface stays the same with an additional option to speak respectively, the 3rd line is for VoicePredict, the 4th for before or while or after typing. Other examples in the Figure Canned-Phrases, 5th for symbols, 6th for emoticons like Smi refer to inputting Phrases, symbols, and words. For each one ley and 7-10 belong to the command state. Notice that this of these, the proposed system retains the existing interface mode is a hands-free mode since any text may be entered and builds on top of it to provide additional options of speed without touching the keypad. ing up input using speech. By having Such a design it is I0086) Referring to FIGS. 6I and 6J, it is shown that envisioned that the overall behavioral change for the mobile because standard phones may not have the ability to adjust user is minimum. microphone-gain-levels automatically, a user-interface 0081 FIG. 5H illustrates a table comparing the multimo prompts the user to “Speak softer . . . . Additionally, a user dal input equivalents for ordinary speech recognition input could select a sub-set of the states of FIG. 2A by using a mechanisms for achieving certain illustrative desired outputs. MENU button. In FIG. 6J for instance, the states are divided Note that other systems attempt to do “open dictation' and into two groups. VoicePredict mode is purely multimodal fail to yield desired accuracies and/or acceptable perfor because the user has to use the keypad to enterinitial letters of aCC. the word/phrase. VoiceTap on the other hand is a hands-free 0082 FIGS. 6A-6J illustrate an illustrative user interface mode where user could use VoiceTap to enter initial letters in that may be presented in certain embodiments. Referring to addition to multimode. FIGS. 6A and 6B, the illustrated application may be a pre 0087 FIGS. 7A-7G illustrate another embodiment of a loaded software on a device and/or may be downloaded over user interface that may be implemented using the "always the-air (OTA) by end-users. It may be launched by a simple listening configuration. Notice that the user-interface is push of a button. As shown in Figure on top right, user could enhanced by building on top of the standard user-interface. say “Insert Greg Aronov or alternatively use a multimodal Thus, users have the option to fall back to an interface with method: type in “G” and then say “Insert Greg Aronov'. which they are familiar in situations where speaking is unde Notice that the TO: field may be changed to have a general sirable. In FIG. 7-D, a speak-and-type user interface is cho purpose address field to address multiple applications like sen. However, the same system may be extended for type Text-Messaging, Mobile-Search, and Navigation etc. For and-speak interface or type-speak interface. instance, one could envision saying “Insert GOOGLE DOT I0088 Finally, FIGS. 8A-8D illustrate another embodi COM followed by “Begin Search” to simply modify the ment of a user interface that may be implemented. FIGS. user-interface for Mobile-Internet-Browsing. The Message 8A-8D illustrates an embodiment of an overall user-interface body may then be viewed as a “WINDOW' to enter search as an enhanced multimodal mobile user interface using a related queries. “Push-to-Speak’ configuration for SMS/Email application. 0.083 Referring to FIGS. 6C and 6D, The ON-DEVICE I0089. Those skilled in art will recognize that other modi help shown may also include help for other modes like a help fications may be made to the system to yield an enhanced for the popular and/or the triple-tap method multimodal user-interface. For instance, one could add com and/or the Graffiti method etc. Different HELP modes may be mands like “Switch to T9” and “Switch to iTap” and “Switch simply activated by speaking for example “HOW DOIUSE to TrippleTap' and “Switch to ' and MULTI-TAP' etc. The Microphone ON/OFF is an optional “Help Me with iTap' or “Help Me With Graffiti” etc. button to control the MIC when desired. For instance, when 0090 This system is an independent stand-alone large the background in noisy, the MIC may be simply turned OFF. vocabulary multimodal text-input system for mobile devices. Alternatively, to eliminate the automatic Voice-Activity-De Although the application displayed is for text-messaging, tector (VAD) and to increase ASR accuracy in noise, a user those skilled in art will recognize that with minor modifica may control the VAD manually: turn Mic-ON, speak, and turn tions, the same system could be used for MMS, IM, Email, Mic-Off and so on. The canned phrases displayed in FIG. 6D Mobile-Search, Navigation or all of them put-together in one may be used for emergency mode. application. However, portions of the system may be used in I0084. Referring to FIGS. 6E and 6F, the simultaneous conjunction with other systems for instance for error-correc multimodality is shown. The user could use VoiceTap to enter tion of Dictation systems, for handicapped access systems the letters M and E but alternatively the user could have etc. Further, the system may be incorporated with many other simply typed the letters using TrippleTap or T9. Note that the modes to yield the next-generation multimodal interface N-Best window displays only those alternatives that belong to including: (a) keypad (b) -keyboard (c) soft-keyboard the same sub-set lexicon as that chosen in the Main window. (d) handwriting recognition (e) graffiti recognition (f) optical This is the outcome of the implementation of the system in the character recognition (g) 9-digit telephone keypad (h) Word form of a state-machine, a state for each sub-set lexicon. completion algorithms (i) Predictive Text methods () Multi US 2008/033228A1 Jun. 5, 2008

Tap methods and (k) VoiceMail attachments (which could be embodiments. The mobile device 901 may be any handheld included within the body of the text itself as opposed to an computing device and not just a cellular phone. For instance, independent mode). the mobile device 901 could also be a mobile messaging 0091 Those skilled in art will recognize that other modi device, a personal digital assistant, a portable music player, a fications may be made to the system to yield an enhanced global positioning satellite (GPS) device, or the like. multimodal user-interface. For instance, one could add com Although described here in the context of a handheld mobile mands like “Switch to Predictive Text and “Switch to phone, it should be appreciated that implementations of the TrippleTap' and “Switch to Handwriting Recognition' and invention could have equal applicability in other areas. Such “Help Me with Prediction' etc. as conventional wired telephone systems and the like. 0092. Further, extensions to multiple languages (other than English) including but not limited to Chinese, Japanese, (0098. In this example, the mobile device 901 includes a Korean, India (more generally Asian languages), French, processor unit 904, a memory 906, a storage medium 913, an German (more generally European languages), Spanish, Por audio unit 931, an input mechanism 932, and a display 930. tuguese, Brazilian (other Latin American languages), Rus The processor unit 904 advantageously includes a micropro sian, Mexican, and in general any language. Some modifica cessor or a special-purpose processor Such as a digital signal tions may also been made to adjust for pronunciation processor (DSP), but may in the alternative be any conven variations, accents, and accounting for the large base alpha tional form of processor, controller, microcontroller, State bet, so as to customize the system for that language. machine, or the like. 0093. For instance, for languages like Chinese (referred to (0099. The processor unit 904 is coupled to the memory as languages based on ideographic characters as opposed to 906, which is advantageously implemented as RAM memory alphabetic characters) the phone keys are mapped to basic holding software instructions that are executed by the proces handwritten strokes (or their stroke categories) and the user sor unit 904. In this embodiment, the software instructions could input the strokes for the desired character in a tradi stored in the memory 906 include a display manager 911, a tional order. Alternatively, based on a mapping of the keys to runtime environment or operating system 910, and one or a phonetic alphabet, the user may enter the phonetic spelling more other applications 912. The memory 906 may be on of the desired character. In both these cases, the user selects board RAM, or the processor unit 904 and the memory 906 the desired character among the many that match the input could collectively reside in an ASIC. In an alternate embodi sequence from an array of alternatives that is typically dis ment, the memory 906 could be composed of firmware or played. The system proposed in this invention may be easily flash memory. extended for these languages: (a) in a way that text prediction 0100. The storage medium 913 may be implemented as methods extend to multiple languages, (b) since these lan any nonvolatile memory, such as ROM memory, flash guages have many times different words depending on the memory, or a magnetic disk drive, just to name a few. The tone or prosody of the word, additional states may be added to storage medium 913 could also be implemented as a combi do Such selection, and (c) additional states to insert diacritics nation of those or other technologies, such as a magnetic disk (example vowel accents) for placing them on the proper char drive with cache (RAM) memory, or the like. In this particular acters of the word being spoken or corrected may be included. embodiment, the storage medium 913 is used to store data 0094. Those skilled in art will also appreciate that the during periods when the mobile device 901 is powered off or proposed invention leads to many interesting variants includ without power. The storage medium 913 could be used to ing the ability to send only a code determining the text and store contact information, images, call announcements such decoding the text at the receiver based on that code and a as ringtones, and the like. pre-set rule. For instance, assume the User typed the Phrase 0101 The mobile device 901 also includes a communica “Received Your Message', and then the system may send the tions module 921 that enables bi-directional communication information about where that phrase resides in the system between the mobile device 901 and one or more other com lexicon and use the same to retrieve it at the receiver (assum puting devices. The communications module 921 may ing that the receiver has the same lexicon). include components to enable RF or other wireless commu 0095. Further, there has been a surge in activities like nications, such as a cellular telephone network, Bluetooth standardizing the short forms or abbreviations being used in connection, wireless local area network, or perhaps a wireless mobile communication. This includes phrases like Be-Right wide area network. Alternatively, the communications mod Back (BRB), Laughing-Out-Loud (LOL) etc. The problem ule 92.1 may include components to enable land line or hard here is that the list is large and its hard for end-users to wired network communications, such as an Ethernet connec remember all of them. The system's Phrase state may be tion, RJ-11 connection, universal serial bus connection, IEEE easily extended to handle a multitude of such phrases. A user 1394 (Firewire) connection, or the like. These are intended as would then simply say the Phrase and enter initial letter; for non-exhaustive lists and many other alternatives are possible. instance Speak “Laughing out Loud' and type “L” or “L O' 0102 The audio unit 931 is a component of the mobile or “L O L and so on. device 901 that is configured to convert signals between ana 0096. Certain of the components described above may be log and digital format. The audio unit 931 is used by the implemented using general computing devices or mobile mobile device 901 to output sound using a speaker 932 and to computing devices. To avoid confusion, the following discus receive input signals from a microphone 933. The speaker sion provides an overview of one implementation of Such a 932 could also be used to announce incoming calls. general computing device that may be used to embody one or 0103) A display 930 is used to output data or information more components of the system described above. in a graphical form. The display could be any form of display 0097 FIG. 9 is a functional block diagram of a sample technology, such as LCD, LED, OLED, or the like. The input mobile device 901 that may be configured for use in certain mechanism 932 may be any keypad-style input mechanism. implementations of the disclosed embodiments or other Alternatively, the input mechanism 932 could be incorpo US 2008/033228A1 Jun. 5, 2008

rated with the display 930, such as the case with a touch c) Speak while providing letter or character or any other sensitive display device. Other alternatives too numerous to visual inputs. mention are also possible. 4. The system of claim3, wherein each of the user-interface accessing speech from the user is designed using three differ The claimed invention is: ent configurations including: 1. A multimodal system for receiving inputs via more than a) Push-to-Speak Configuration; wherein the system waits one mode from a user and for interpretation and display of the for a signal to begin processing speech as in manual text intended by the inputs received through those multiple push-to-speak button; modes, the system comprising: b) Partially Listening Configuration; wherein the system a) a user input device having a plurality of modes with one begins processing speech based on a user-implied signal mode for accessing speech and each of the remaining as in space-bar to begin processing and typing a letter to modes being associated with an input method for enter end processing; and ing letters or characters or words or phrases or sentences c) Always Listening Configuration; wherein the system is or graphics or image or any combination; simultaneously processing speech along with other b) a memory containing a plurality of acoustic networks, inputs from other modes. each of the plurality of acoustic networks associated 5. The system of claim 1, wherein the speech input or its equivalent signal representation as in features is stored in with one or more inputs from the one or more modes; memory for Subsequent speech recognition or c) a mechanism to feedback the system output to a user of enabling users to speak only once and be able to do Subse the input device, in the form of a visual display or an quent error correction. audible speaker, and 6. The system of claim 1, wherein the user input device is d) a processor to: a mobile device, a mobile phone, a Smartphone, an Ultra i) process the inputs received via the multiple modes and Mobile PC, a Laptop, and/or a Palmtop. select and/or build an appropriate acoustic network; 7. The system of claim 1 wherein the system automatically ii) perform automatic speech recognition using the or manually defaults to a pure text prediction system in the selected/built acoustic network; and event that the speech input from the user is either absent or is iii) return the recognition outputs to the user for Subse unreliable and the system has the ability to automatically quent user-actions. switch back-and-forth between text-prediction and speech 2. The system of claim 1, wherein one or more acoustic recognition based on the design of the system set by the networks are compiled during system design and/or compiled system designer or end-user(s). at System start-up and/or compiled by the system during 8. The system of claim 1 wherein the system includes an run-time or a combination. option to be configured as a hands-free system by using the 3. The system of claim 1, wherein the mode for accessing speech mode for entering letters or characters or words or speech from the user has different types of use-interfaces, phrases or sentences or graphics or image or a combination. such that the user could selectively: 9. The system of claim 1, wherein the system is indepen a) Speak before providing letter or character or any other dent of the underlying hard-ware platform and/or the under visual inputs; lying software/operating system. b) Speak after providing letter or character or any other visual inputs; and/or c c c c c