(19) United States (12) Patent Application Publication (10) Pub

US 20080235029A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0235029 A1 Cross et al. (43) Pub. Date: Sep. 25, 2008

(54) SPEECH-ENABLED PREDICTIVE TEXT (52) us. c1...... 704/275; 704/E15.001 SELECTION FOR A MULTIMODAL (57) ABSTRACT APPLICATION Methods, apparatus, and products are disclosed for speech (76) Inventors: Charles W. Cross, Wellington, FL enabled predictive text selection for a multimodal applica (US); Igor R. J ablokov, Charlotte, tion, the multimodal application operating on a multimodal NC (U S) device supporting multiple modes of interaction including a voice mode and one or more non-voice modes, the multimo Correspondence Address: dal application operatively coupled to an automatic speech INTERNATIONAL CORP (BLF) recognition (‘ASR’) engine through a VoiceXML interpreter, c/o BIGGERS & OHANIAN, LLP, P.O. BOX 1469 including: identifying, by the VoiceXML interpreter, a text AUSTIN, TX 78767-1469 (US) prediction event, the text prediction event characterized by one or more predictive texts for a text input ?eld of the (21) Appl. No.: 11/690,471 multimodal application; creating, by the VoiceXML inter preter, a grammar in dependence upon the predictive texts; receiving, by the VoiceXML interpreter, a voice utterance (22) Filed: Mar. 23, 2007 from a user; and determining, by the VoiceXML interpreter using the ASR engine, recognition results in dependence Publication Classi?cation upon the voice utterance and the grammar, the recognition (51) Int. Cl. results representing a user selection of a particular predictive G10L 11/00 (2006.01) text.

Voice Prompts And Responses Speech For Recognition

Multimodal Appiication @ X+\/ Page E Multimodai Devices Text Field 101 E

VoiceXML Interpreter E

Muitimodal Browser rkm @

VoiceXML Interpreter Q

Voice Sewer Application @ Patent Application Publication Sep. 25, 2008 Sheet 1 0f 6 US 2008/0235029 A1

Voice Prompts And Responses Speech For Recognition

Multimodal Application E X+V Page E Multimodal Devices Text Field m Personal 118 E

VoiceXML Interpreter @ / Speech @ Engine E Multimodal Browser Data Communications Network m @

1 2 2 X+V Pages VoiceXML E Interpreter Q

Speech Engine E Web Server 14_7 Voice Server Application @ FIG. 1 Patent Application Publication Sep. 25, 2008 Sheet 2 0f 6 US 2008/0235029 A1

Voice Server E RAM E Voice ServerApplication g

video Adapter VoiceXML Interpreter 1_2 & I Dialog FIA Video m @ Bus Processor E Memory E BUS E Speech Engine Q

FrontSide ASR EngineI m

Bus I Q Grammar Lexicon Acoustlc 104 106 Mode' — — m Bus Adapter @ TTS Engine m

Expansion I Bus Operating System E @

Communications I/O Adapter Disk Drive Adapter @ E Adapter E

V U U Data Comm Other Computers 0 Network 182 User Input Data Storage @ — Device @ E FIG. 2 Patent Application Publication Sep. 25, 2008 Sheet 3 0f 6 US 2008/0235029 A1

Voice Prompts And Speech For Recognition Responses @ Speaker

?n >>>>> ser 1 18%?“ Card <— U Multimodal I Device Multlmodal Browser 19 E Multimodal Application 1_5 X+V Page g Text Field m

O f? Voice Services Module m A

VOIP Connection216 @a Communlcatrons- - Network w 2:1) ServerWeb _| If‘ f 14_7 Voice Server V @ Voice Server Application Q

Speech Engine H I VoiceXML ASR Eng'ne m lnterpreter? Lexicon & . TTS Englne M FIA @ M0de| H Grammar M Dialog m

FIG. 3 Patent Application Publication Sep. 25, 2008 Sheet 4 of 6 US 2008/0235029 A1

Multimodal Device Q RAM E

Multimodal Application @ X+V Page E Text Field M

_6 _ Multimodal Vld \ Browser @ | eo . A dapter Vo|ceXMi_ Interpreter Q m Memory Dlaiog E Bus @ FIA m Processor Video E BUS Speech Engine @ M A t. From Grammar Lexicon “CA0? :9 Side 104 106 ° 9 Bus _ _ w E Bus Adapter @ TTS Engine m ASR Engine Q

Expansion Operating System @ Bus @

Sound Card w Communications Adapter Drive Adapter E E Codec? Adapter?

u u k Data Other Spf7a7 er Comm Computers — Network E m | Devices E Patent Application Publication Sep. 25, 2008 Sheet 5 0f 6 US 2008/0235029 A1

Multimodal Device Q Multimodal \ Browser? VorceXMLI Interpreter Q

GUI Speech @ Engine @ ASR @ v Multimodal Device @

Text Input I I Field 101 Predictive — Texts Q research restaurant testis re

FIG. 5 Patent Application Publication Sep. 25, 2008 Sheet 6 of 6 US 2008/0235029 A1

Multimodal Browser 19 Multimodal Application E Render The Predictive Texts On X+V Page E A GUI Of The Multimodal Device In Dependence Upon The Text @[i Prediction Event @

VoiceXML Interpreter 1 2 V Identify A Text Prediction Event, The Text T tP . t. E t 2 Prediction Event Characterized By One Or More ex redlc Ion Ven Q Predictive Texts ForAText Field OfThe ’ I Predictive Texts @ IJJ _ Multimodal Application w '

Create A Grammar in Dependence Upon The 4 / Predictive Texts w Generate A Grammar Rule For The @ Grammar w

Create A User Prompt For A Voice Utterance In 4‘ '

PromptDependence The User Upon For The The Predictive Voice Utterance Texts ? In +

Dependence Upon The User Prompt w

Receive A Voice Utterance From A User w I Voice Utterance Q I

Determine, Using An ASR Engine, Recognition : Resuits In Dependence Upon The Voice Utterance And The Grammar Q Pi I RecOgrlition Results % In

Render At Least A Portion Of The Recognition Results In The Text Field @ FIG. 6 US 2008/0235029 A1 Sep.25,2008

SPEECH-ENABLED PREDICTIVE TEXT thesis technologies or ‘speech engines’ to do the Work of SELECTION FOR A MULTIMODAL recognizing and generating human speech. As markup lan APPLICATION guages, both X+V and SALT provide markup-based pro gramming environments forusing speech engines in an appli BACKGROUND OF THE INVENTION cation’s user interface. Both languages have language [0001] 1. Field of the Invention elements, markup tags, that specify What the speech-recog [0002] The ?eld of the invention is data processing, or, nition engine should listen for and What the synthesis engine more speci?cally, methods, apparatus, and products for should ‘say.’ Whereas X+V combines XHTML, VoiceXML, speech-enabled predictive text selection for a multimodal and the XML Events standard to create multimodal applica application. tions, SALT does not provide a standard visual markup lan [0003] 2. Description Of Related Art guage or eventing model. Rather, it is a loW-level set of tags [0004] User interaction With applications running on small for specifying voice interaction that can be embedded into devices through a keyboard or stylus has become increasingly other environments. In addition to X+V and SALT, multimo limited and cumbersome as those devices have become dal applications may be implemented in Java With a Java increasingly smaller. In particular, small handheld devices speech frameWork, in C++, for example, and With other tech like mobile phones and PDAs serve many functions and con nologies and in other environments as Well. tain su?icient processing poWer to support user interaction through multimodal access, that is, by interaction in non [0007] As mentioned above, a user may interact With a multimodal application by typing text on a keypad of a mul voice modes as Well as voice mode. Devices Which support timodal device. The draWback to this mode of user interaction multimodal access combine multiple user input modes or is that it is dif?cult for a user to enter text because the small channels in the same interaction alloWing a user to interact siZe of the device typically prohibits providing a full-siZe With the applications on the device simultaneously through keyboard to the user. To partially overcome this limitation, multiple input modes or channels. The methods of input predictive text input technology has been developed that include speech recognition, keyboard, touch screen, stylus, accumulates a context composed of the Words already typed mouse, handWriting, and others. Multimodal input often by a user and the letters of the Word currently being typed by makes using a small device easier. the user. Such predictive text input technology uses the accu [0005] Multimodal applications are often formed by sets of mulated context to predict several possible Words that the user markup documents served up by Web servers for display on intends to input. The user may then select the Word that multimodal browsers. A ‘multimodal browser,’ as the term is matches the user’s intended input, thereby reducing the num used in this speci?cation, generally means a Web broWser ber of keystrokes required by the user. The draWback to cur capable of receiving multimodal input and interacting With rent predictive text input technology, hoWever, is that the user users With multimodal output, Where modes of the multimo must manually select one of several possible Words as the dal input and output include at least a speech mode. Multi user’s intended input through a graphical user interface. Fur modal broWsers typically render Web pages Written in thermore, current predictive text input technology in general XHTML+Voice (‘X+V’). X+V provides a markup language does not take advantage of the speech mode of user interac that enables users to interact With an multimodal application tion available to a user of a multimodal device. Readers Will often running on a server through spoken dialog in addition to therefore appreciate that room for improvement exists in pre traditional means of input such as keyboard strokes and dictive text selection for a multimodal application. mouse pointer action. Visual markup tells a multimodal broWser What the user interface is look like and hoW it is to behave When the user types, points, or clicks. Similarly, voice SUMMARY OF THE INVENTION markup tells a multimodal broWser What to do When the user speaks to it. For visual markup, the multimodal broWser uses [0008] Methods, apparatus, and products are disclosed for a graphics engine; for voice markup, the multimodal broWser speech-enabled predictive text selection for a multimodal uses a speech engine. X+V adds spoken interaction to stan application, the multimodal application operating on a mul dard Web content by integrating XHTML (eXtensible Hyper timodal device supporting multiple modes of interaction text Markup Language) and speech recognition vocabularies including a voice mode and one or more non-voice modes, the supported by VoiceXML. For visual markup, X+V includes multimodal application operatively coupled to an automatic the XHTML standard. For voice markup, X+V includes a speech recognition (‘ASR’) engine through a VoiceXML subset of VoiceXML. For synchronizing the VoiceXML ele interpreter, including: identifying, by the VoiceXML inter ments With corresponding visual interface elements, X+V preter, a text prediction event, the text prediction event char uses events. XHTML includes voice modules that support acteriZed by one or more predictive texts for a text input ?eld speech synthesis, speech dialogs, command and control, and of the multimodal application; creating, by the VoiceXML speech grammars. Voice handlers can be attached to XHTML interpreter, a grammar in dependence upon the predictive elements and respond to speci?c events. Voice interaction texts; receiving, by the VoiceXML interpreter, a voice utter features are integrated With XHTML and can consequently be ance from a user; and determining, by the VoiceXML inter used directly Within XHTML content. preter using the ASR engine, recognition results in depen [0006] In addition to X+V, multimodal applications also dence upon the voice utterance and the grammar, the may be implemented With Speech Application Tags recognition results representing a user selection of a particu (‘SALT’). SALT is a markup language developed by the Salt lar predictive text. Forum. Both X+V and SALT are markup languages for cre [0009] The foregoing and other objects, features and ating applications that use voice input/ speech recognition and advantages of the invention Will be apparent from the folloW voice output/speech synthesis. Both SALT applications and ing more particular descriptions of exemplary embodiments X+V applications use underlying speech recognition and syn of the invention as illustrated in the accompanying draWings US 2008/0235029 A1 Sep.25,2008

wherein like reference numbers generally represent like parts [0017] The multimodal broWser (196) of FIG. 1 provides an of exemplary embodiments of the invention. execution environment for the multimodal application (195). To support the multimodal broWser (196) in processing the BRIEF DESCRIPTION OF THE DRAWINGS multimodal application (195), the system of FIG. 1 includes a VoiceXML interpreter (192). The VoiceXML interpreter [0010] FIG. 1 sets forth a network diagram illustrating an (192) is a softWare module of computer program instructions exemplary system for speech-enabled predictive text selec that accepts voice dialog instructions and other data from a tion for a multimodal application according to embodiments multimodal application, typically in the form of a VoiceXML of the present invention.

element. The voice dialog instructions include one or [0011] FIG. 2 sets forth a block diagram of automated more grammars, data input elements, event handlers, and so computing machinery comprising an example of a computer on, that advise the VoiceXML interpreter (192) hoW to admin useful as a voice server in speech-enabled predictive text ister voice input from a user and voice prompts and responses selection for a multimodal application according to embodi to be presented to a user. The VoiceXML interpreter (192) ments of the present invention. administers such dialogs by processing the dialog instruc [0012] FIG. 3 sets forth a functional block diagram of tions sequentially in accordance With a VoiceXML Form exemplary apparatus for speech-enabled predictive text Interpretation Algorithm (‘PIA’). selection for a multimodal application according to embodi [0018] The VoiceXML interpreter (192) of FIG. 1 is ments of the present invention. improved for speech-enabled predictive text selection for a [0013] FIG. 4 sets forth a block diagram of automated multimodal application (195) according to embodiments of computing machinery comprising an example of a computer the present invention. The VoiceXML interpreter (192) may useful as a multimodal device in speech-enabled predictive operate generally for speech-enabled predictive text selection text selection for a multimodal application according to for a multimodal application (195) according to embodiments embodiments of the present invention. of the present invention by: identifying a text prediction [0014] FIG. 5 sets forth a line draWing of a multimodal event, the text prediction event characteriZed by one or more device useful in speech-enabled predictive text selection for a predictive texts for the text input ?eld (101) of the multimodal multimodal application according to embodiments of the application (195); creating a grammar in dependence upon present invention. the predictive texts; receiving a voice utterance from a user; [0015] FIG. 6 sets forth a How chart illustrating an exem and determining, using the ASR engine (150), recognition plary method of speech-enabled predictive text selection for a results in dependence upon the voice utterance and the gram multimodal application according to embodiments of the mar, the recognition results representing a user selection of a present invention particular predictive text. [0019] In the example of FIG. 1, the VoiceXML interpreter DETAILED DESCRIPTION OF EXEMPLARY (192) may also operate generally for speech-enabled predic EMBODIMENTS tive text selection for a multimodal application (195) accord ing to embodiments of the present invention by: creating a [0016] Exemplary methods, apparatus, and products for user prompt for the voice utterance in dependence upon the speech-enabled predictive text selection for a multimodal predictive texts; and prompting the user for the voice utter application according to embodiments of the present inven ance in dependence upon the user prompt. The VoiceXML tion are described With reference to the accompanying draW interpreter (192) may further operate generally for speech ings, beginning With FIG. 1. FIG. 1 sets forth a netWork enabled predictive text selection for a multimodal application diagram illustrating an exemplary system for speech-enabled (195) according to embodiments of the present invention by: predictive text selection for a multimodal application (195) rendering at least a portion of the recognition results in the according to embodiments of the present invention. Speech text input ?eld (101). In the example of FIG. 1, the multimo enabled predictive text selection for a multimodal application dal broWser (196) may operate generally for speech-enabled in this example is implemented With a multimodal application predictive text selection for a multimodal application (195) (195) operating in a multimodal broWser (196) on a multimo according to embodiments of the present invention by: ren dal device (152). The multimodal application (195) of FIG. 1 dering the predictive texts on a graphical user interface of the is composed of at least one X+V page (124). The X+V page multimodal device in dependence upon the text prediction (124) speci?es a text input ?eld (101) for receiving text from event. a user. The multimodal device (152) supports multiple modes [0020] As mentioned above, the VoiceXML interpreter of interaction including a voice mode and one or more non (192) identi?es a text prediction event. A text prediction event voice modes of user interaction With the multimodal applica is an event that is triggered each time a user enters a character tion (195). The voice mode is represented here With audio into a text input ?eld. The text prediction event may occur output of voice prompts and responses (314) from the multi When the user types a character in the text input ?eld (101) of modal devices and audio input of speech for recognition (315) the multimodal application (195). The text prediction event from a user (128). Non-voice modes are represented by input/ may also occur When the user speaks a character for input in output devices such as keyboards and display screens on the the text input ?eld (101) of the multimodal application (195). multimodal devices (152). The multimodal application (195) When triggered, the text prediction event activates a predic is operatively coupled (195) to an automatic speed recogni tive text algorithm that determines one or more possible tion (‘ASR’) engine (150) through a VoiceXML interpreter Words that the user intends to input into the text input ?eld. (192). The operative coupling may be implemented With an The text prediction event may be implemented according to application programming interface (‘API’), a voice service the Document Object Model (‘DOM’) Events speci?cation, module, or a VOIP connection as explained more detail the XML Events speci?cation, or any other standard as Will beloW. occur to those of skill in the art. US 2008/0235029 A1 Sep.25,2008

[0021] As mentioned above, the VoiceXML interpreter device may be implemented, for example, as a voice-enabled (192) creates a grammar based on predictive texts of the broWser on a laptop, a voice broWser on a telephone handset, predictive text event. A grammar communicates to the ASR an online game implemented With Java on a personal com engine (150) the Words and sequences of Words that currently puter, and With other combinations of hardWare and softWare may be recogniZed. In the example of FIG. 1, a grammar as may occur to those of skill in the art. Because multimodal includes grammar rules that advise an ASR engine or a voice applications may be implemented in markup languages interpreter Which Words and Word sequences presently can be QGV, SALT), object-oriented languages (Java, C++), proce recogniZed. Grammars for use according to embodiments of dural languages (the C programming language), and in other the present invention may be expressed in any format sup kinds of computer languages as may occur to those of skill in ported by an ASR engine, including, for example, the Java the art, a multimodal application may refer to any softWare Speech Grammar Format (‘J SGF’), the format of the W3C application, server-oriented or client-oriented, thin client or Speech Recognition Grammar Speci?cation (‘SRGS’), the thick client, that administers more than one mode of input and Augmented Backus-Naur Format (‘ABNF’) from the IETF’s more than one mode of output, typically including visual and RFC2234, in the form of a stochastic grammar as described in speech modes. the W3C’s Stochastic Language Models (N-Gram) Speci? [0028] The system of FIG. 1 includes several example mul cation, and in other grammar formats as may occur to those of timodal devices: skill in the art. Grammars typically operate as elements of [0029] personal computer (107) Which is coupled for dialogs, such as, for example, a VoiceXML

or an data communications to data communications netWork X+V . A grammar’s de?nition may be expressed in (100) through Wireline connection (120), line in a dialog. Or the grammar may be implemented exter [0030] personal digital assistant (‘PDA’) (112) Which is nally in a separate grammar document and referenced from coupled for data communications to data communica With a dialog With a URI. Here is an example of a grammar tions netWork (100) through Wireless connection (114), expressed in JSFG: [0031] mobile telephone (110) Which is coupled for data communications to data communications netWork (100) through Wireless connection (116), and = [0033] Each of the example multimodal devices (152) in [remind me to] call I phone I telephone ; = bob l martha ljoe l pete l chris ljohn l artoush l tom; the system of FIG. 1 includes a microphone, an audio ampli = today I this afternoon I tomorrow I next Week; ?er, a digital-to-analog converter, and a multimodal applica ll> tion capable of accepting from a user (128) speech for recog nition (315), digitiZing the speech, and providing the digitiZed speech to a speech engine for recognition. The [0022] In this example, the elements named , speech may be digitiZed according to industry standard , and are rules of the grammar. Rules are a codecs, including but not limited to those used for Distributed combination of a rulename and an expansion of a rule that Speech Recognition as such. Methods for ‘COding/DECod advises an ASR engine or a VoiceXML interpreter Which ing’ speech are referred to as ‘codecs.’ The European Tele Words presently can be recogniZed. In the example above, communications Standards Institute (‘ETSI’) provides sev rule expansions includes conjunction and disjunction, and the eral codecs for encoding speech foruse in DSR, including, for vertical bars ‘I’ mean ‘or.’ An ASR engine or a VoiceXML example, the ETSI ES 201 108 DSR Front-end Codec, the interpreter processes the rules in sequence, ?rst , ETSI ES 202 050 Advanced DSR Front-end Codec, the ETSI then , then . The rule accepts for ES 202 211 Extended DSR Front-end Codec, and the ETSI recognition ‘call’ or ‘phone’ or ‘telephone’ plus, that is, in ES 202 212 Extended Advanced DSR Front-end Codec. In conjunction With, Whatever is returned from the rule standards such as RFC3557 entitled and the rule. The rule accepts ‘bob’ or [0034] RTP Payload Format for European Telecommu ‘martha’ orjoe’ or ‘pete’ or ‘chris’ or john’ or ‘artoush’ or nications Standards Institute (ETSI) European Standard ‘tom,’ and the rule accepts ‘today’ or ‘this afternoon’ ES 201 108 Distributed Speech Recognition Encoding or ‘tomorroW’ or ‘next Week.’ The command grammar as a and the Internet Draft entitled Whole matches utterances like these, for example: [0035] RTP Payload Formats for European Telecommu [0023] “phone bob next Week,” nications Standards Institute (ETSI) European Standard [0024] “telephone martha this afternoon,” ES 202 050, ES 202 211, and ES 202 212 Distributed [0025] “remind me to call chris tomorroW,” and Speech Recognition Encoding, [0026] “remind me to phone pete today.” the IETF provides standard RTP payload formats for various [0027] A multimodal device on Which a multimodal appli codecs. It is useful to note, therefore, that there is no limitation cation operates is an automated device, that is, automated in the present invention regarding codecs, payload formats, or computing machinery or a computer program running on an packet structures. Speech for speech-enabled predictive text automated device, that is capable of accepting from users selection for a multimodal application according to embodi more than one mode of input, keyboard, mouse, stylus, and so ments of the present invention may be encoded With any on, including speech inputiand also providing more than codec, including, for example: one mode of output such as, graphic, speech, and so on. A [0036] AMR (Adaptive Multi-Rate Speech coder) multimodal device is generally capable of accepting speech [0037] ARDOR (Adaptive Rate-Distortion Optimized input from a user, digitiZing the speech, and providing digi sound codeR), tiZed speech to a speech engine for recognition. A multimodal [0038] Dolby Digital (A/ 52, AC3), US 2008/0235029 A1 Sep.25,2008

[0039] DTS (DTS Coherent Acoustics), application. The multimodal application (195) provides dia [0040] MP1 (MPEG audio layer-1), log instructions, VoiceXML elements, grammars, [0041] MP2 (MPEG audio layer-2) Layer 2 audio codec input elements, event handlers, and so on, through the API to (MPEG-1, MPEG-2 and non-ISO MPEG-2.5), the VoiceXML interpreter, and the VoiceXML interpreter [0042] MP3 (MPEG audio layer-3) Layer 3 audio codec administers the speech engine on behalf of the multimodal (MPEG-1, MPEG-2 and non-ISO MPEG-2.5), application. In the thick client architecture, VoiceXML dia [0043] Perceptual Audio Coding, logs are interpreted by a VoiceXML interpreter on the multi [0044] FS-1015 (LPC-10), modal device. In the thin client architecture, VoiceXML dia [0045] FS-1016 (CELP), logs are interpreted by a VoiceXML interpreter on a voice [0046] G.726 (ADPCM), server (151) located remotely across a data communications [0047] G.728 (LD-CELP), netWork (100) from the multimodal device running the mul [0048] G.729 (CS-ACELP), timodal application (195). [0049] GSM, [0055] The VoiceXML interpreter (192) provides gram [0050] HILN (MPEG-4 Parametric audio coding), and mars, speech for recognition, and text prompts for speech [0051] others as may occur to those of skill in the art. synthesis to the speech engine (153), and the VoiceXML [0052] As mentioned, a multimodal device according to interpreter (192) returns to the multimodal application speech embodiments of the present invention is capable of providing engine (153) output in the form of recogniZed speech, seman speech to a speech engine for recognition. The speech engine tic interpretation results, and digitiZed speech for voice (153) of FIG. 1 is a functional module, typically a softWare prompts. In a thin client architecture, the VoiceXML inter module, although it may include specialiZed hardWare also, preter (192) is located remotely from the multimodal client that does the Work of recogniZing and generating or ‘synthe device in a voice server (151), the API for the VoiceXML siZing’ human speech. The speech engine (153) implements interpreter is still implemented in the multimodal device speech recognition by use of a further module referred to in (152), With the API modi?ed to communicate voice dialog this speci?cation as a ASR engine (150), and the speech instructions, speech for recognition, and text and voice engine carries out speech synthesis by use of a further module prompts to and from the VoiceXML interpreter on the voice referred to in this speci?cation as a text-to-speech (‘TTS’) server (151). For ease of explanation, only one (107) of the engine (not shoWn). As shoWn in FIG. 1, a speech engine multimodal devices (152) in the system of FIG. 1 is shoWn (153) may be installed locally in the multimodal device (107) With a VoiceXML interpreter (192), but readers Will recog itself, or a speech engine (153) may be installed remotely With niZe that any multimodal device may have a VoiceXML inter respect to the multimodal device, across a data communica preter according to embodiments of the present invention. tions netWork (100) in a voice server (151). A multimodal Each of the example multimodal devices (152) in the system device that itself contains its oWn speech engine is said to of FIG. 1 may be con?gured for speech-enabled predictive implement a ‘thick multimodal client’ or ‘thick client,’ text selection for a multimodal application by installing and because the thick multimodal client device itself contains all running on the multimodal device a VoiceXML interpreter the functionality needed to carry out speech recognition and and an ASR engine that supports speech-enabled predictive speech synthesisithrough API calls to speech recognition text selection for a multimodal application according to and speech synthesis modules in the multimodal device itself embodiments of the present invention. With no need to send requests for speech recognition across a [0056] The use of these four example multimodal devices netWork and no need to receive synthesiZed speech across a (152) is for explanation only, not for limitation of the inven netWork from a remote voice server. A multimodal device that tion. Any automated computing machinery capable of accept does not contain its oWn speech engine is said to implement a ing speech from a user, providing the speech digitiZed to an ‘thin multimodal client’ or simply a ‘thin client,’ because the ASR engine through a VoiceXML interpreter, and receiving thin multimodal client itself contains only a relatively thin and playing speech prompts and responses from the layer of multimodal application softWare that obtains speech VoiceXML interpreter may be improved to function as a recognition and speech synthesis services from a voice server multimodal device according to embodiments of the present located remotely across a netWork from the thin client. For invention. ease of explanation, only one (107) of the multimodal devices [0057] The system of FIG. 1 also includes a voice server (152) in the system of FIG. 1 is shoWn With a speech engine (151), Which is connected to data communications netWork (153), but readers Will recogniZe that any multimodal device (100) through Wireline connection (122). The voice server may have a speech engine according to embodiments of the (151) is a computer that runs a speech engine (153) that present invention. provides voice recognition services for multimodal devices [0053] A multimodal application (195) in this example pro by accepting requests for speech recognition and returning vides speech for recognition and text for speech synthesis to text representing recogniZed speech. Voice server (151) also a speech engine through the VoiceXML interpreter (192). provides speech synthesis, text to speech (‘TTS’) conversion, [0054] As shoWn in FIG. 1, the VoiceXML interpreter (192) for voice prompts and voice responses (314) to user input in may be installed locally in the multimodal device (107) itself, multimodal applications such as, for example, X+V applica or the VoiceXML interpreter (192) may be installed remotely tions, SALT applications, or Java voice applications. With respect to the multimodal device, across a data commu [0058] The system of FIG. 1 includes a data communica nications netWork (100) in a voice server (15 1). In a thick tions netWork (100) that connects the multimodal devices client architecture, a multimodal device (152) includes both (152) and the voice server (151) for data communications. A its oWn speech engine (153) and its oWn VoiceXML inter data communications netWork for speech-enabled predictive preter (192). The VoiceXML interpreter (192) exposes anAPI text selection for a multimodal application according to to the multimodal application (195) for use in providing embodiments of the present invention is a data communica speech recognition and speech synthesis for the multimodal tions data communications netWork composed of a plurality US 2008/0235029 A1 Sep.25,2008

of computers that function as data communications routers speech synthesis. For further explanation, therefore, FIG. 2 connected for data communications With packet switching sets forth a block diagram of automated computing machin protocols. Such a data communications netWork may be ery comprising an example of a computer useful as a voice implemented With optical connections, Wireline connections, server (151) in speech-enabled predictive text selection for a or With Wireless connections. Such a data communications multimodal application according to embodiments of the netWork may include intranets, intemets, local area data com present invention. The voice server (151) of FIG. 2 includes at munications netWorks (‘LANs’), and Wide area data commu least one computer processor (156) or ‘CPU’ as Well as ran nications netWorks (‘WANs’). Such a data communications dom access memory (168) (‘RAM’) Which is connected netWork may implement, for example: through a high speed memory bus (166) and bus adapter (158) [0059] a link layer With the EthernetTM Protocol or the to processor (156) and to other components of the voice Wireless EthernetTM Protocol, server (151). [0060] a data communications netWork layer With the [0067] Stored in RAM (168) is a voice server application Internet Protocol (‘IP’), (188), a module of computer program instructions capable of [0061] a transport layer With the Transmission Control operating a voice server in a system that is con?gured for Protocol (‘TCP’) or the User Datagram Protocol speech-enabled predictive text selection for a multimodal (‘1111*’). application according to embodiments of the present inven [0062] an application layer With the HyperText Transfer tion. Voice server application (188) provides voice recogni Protocol (‘HTTP’), the Session Initiation Protocol tion services for multimodal devices by accepting requests for (‘SIP’), the Real Time Protocol (‘RTP’), the Distributed speech recognition and returning speech recognition results, Multimodal Synchronization Protocol (‘DMSP’), the including text representing recognized speech, text for use as Wireless Access Protocol (‘WAP’), the Handheld variable values in dialogs, and text as string representations of Device Transfer Protocol (‘HDTP’), the ITU protocol scripts for semantic interpretation. Voice server application knoWn as H.323, and (188) also includes computer program instructions that pro [0063] other protocols as Will occur to those of skill in the vide text-to-speech (‘TTS’) conversion for voice prompts and art. voice responses to user input in multimodal applications such [0064] The system of FIG. 1 also includes a Web server as, for example, X+V applications, SALT applications, or (147) connected for data communications through Wireline Java Speech applications. Voice server application (188) may connection (123) to netWork (100) and therefore to the mul be implemented as a Web server, implemented in Java, C++, timodal devices (152). The Web server (147) may be any or another language, that supports speech-enabled predictive server that provides to client devices X+V markup documents text selection for a multimodal application according (125) that compose multimodal applications. The Web server embodiments of the present invention. (147) typically provides such markup documents via a data [0068] The voice server (151) in this example includes a communications protocol, HTTP, HDTP, WAP, or the like. speech engine (153). The speech engine is a functional mod That is, although the term ‘Web’ is used to described the Web ule, typically a softWare module, although it may include server generally in this speci?cation, there is no limitation of specialized hardWare also, that does the Work of recognizing data communications betWeen multimodal devices and the and synthesizing human speech. The speech engine (153) Web server to HTTP alone. A multimodal application in a includes an automated speech recognition (‘ASR’) engine multimodal device then, upon receiving from the Web sever (150) for speech recognition and a text-to-speech (‘TTS’) (147) an X+V markup document as part of a multimodal engine (194) for generating speech. The speech engine (153) application, may execute speech elements by use of a also includes a grammar (104) created by a VoiceXML inter VoiceXML interpreter (192) and speech engine (153) in the preter (192) in dependence upon predictive texts for a predic multimodal device itself or by use of a VoiceXML interpreter tive text event. The speech engine (153) also includes a lexi (192) and speech engine (153) located remotely from the con (106) and a language-speci?c acoustic model (108). The multimodal device in a voice server (15 1). language-speci?c acoustic model (108) is a data structure, a [0065] The arrangement of the multimodal devices (152), table or database, for example, that associates Speech Feature the Web server (147), the voice server (151), and the data Vectors With phonemes representing, to the extent that it is communications netWork (100) making up the exemplary practically feasible to do so, all pronunciations of all the system illustrated in FIG. 1 are for explanation, not for limi Words in a human language. The lexicon (106) is an associa tation. Data processing systems useful for speech-enabled tion of Words in text form With phonemes representing pro predictive text selection for a multimodal application accord nunciations of each Word; the lexicon effectively identi?es ing to various embodiments of the present invention may Words that are capable of recognition by anASR engine. Also include additional servers, routers, other devices, and peer stored in RAM (168) is a Text To Speech (‘TTS’) Engine to-peer architectures, not shoWn in FIG. 1, as Will occur to (194), a module of computer program instructions that those of skill in the art. Data communications netWorks in accepts text as input and returns the same text in the form of such data processing systems may support many data com digitally encoded speech, for use in providing speech as munications protocols in addition to those noted above. Vari prompts for and responses to users of multimodal systems. ous embodiments of the present invention may be imple [0069] The voice server application (188) in this example is mented on a variety of hardWare platforms in addition to those con?gured to receive, from a multimodal client located illustrated in FIG. 1. remotely across a netWork from the voice server, digitized [0066] Speech-enabled predictive text selection for a mul speech for recognition from a user and pass the speech along timodal application according to embodiments of the present to the ASR engine (150) for recognition. ASR engine (150) is invention in a thin client architecture may be implemented a module of computer program instructions, also stored in With one or more voice servers, computers, that is, automated RAM in this example. In carrying out speech-enabled predic computing machinery, that provide speech recognition and tive text selection for a multimodal application, the ASR US 2008/0235029 A1 Sep.25,2008

engine (150) receives speech for recognition in the form of at electronics for high speed buses, the front side bus (162), the least one digitized word and uses frequency components of video bus (164), and the memory bus (166), as well as drive the digitized word to derive a Speech Feature Vector (‘ SFV’). electronics for the slower expansion bus (160). Examples of An SFV may be de?ned, for example, by the ?rst twelve or bus adapters useful in voice servers according to embodi thirteen Fourier or frequency domain components of a sample ments of the present invention include the Intel Northbridge, of digitized speech. The ASR engine can use the SFV to infer the Intel Memory Controller Hub, the Intel Southbridge, and phonemes for the word from the language-speci?c acoustic the Intel I/O Controller Hub. model (108). The ASR engine then uses the phonemes to ?nd [0075] Examples of expansion buses useful in voice servers the word in the lexicon (106). according to embodiments of the present invention include [0070] In the example of FIG. 2, the voice server applica Industry Standard Architecture (‘ISA’) buses and Peripheral tion (188) passes the speech along to theASR engine (150) for Component Interconnect (‘PCI’) buses. recognition through a VoiceXML interpreter (192). The [0076] Voice server (151) of FIG. 2 includes disk drive VoiceXML interpreter (192) is a software module of com adapter (172) coupled through expansion bus (160) and bus puter program instructions that accepts voice dialogs (121) adapter (158) to processor (156) and other components of the from a multimodal application running remotely on a multi voice server (15 1). Disk drive adapter (172) connects non modal device. The dialogs (121) include dialog instructions, volatile data storage to the voice server (151) in the form of typically implemented in the form of a VoiceXML disk drive (170). Disk drive adapters useful in voice servers element. The voice dialog instructions include one or more include Integrated Drive Electronics (‘IDE’) adapters, Small grammars, data input elements, event handlers, and so on, that Computer System Interface (‘SCSI’) adapters, and others as advise the VoiceXML interpreter (192) how to administer will occur to those of skill in the art. In addition, non-volatile voice input from a user and voice prompts and responses to be computer memory may be implemented for a voice server as presented to a user. The VoiceXML interpreter (192) admin an optical disk drive, electrically erasable programmable isters such dialogs by processing the dialog instructions read-only memory (so-called ‘EEPROM’ or ‘Flash’ sequentially in accordance with a VoiceXML Form Interpre memory), RAM drives, and so on, as will occur to those of tation Algorithm (‘FIA’) (193). skill in the art. [0071] The VoiceXML interpreter (192) of FIG. 2 is [0077] The example voice server of FIG. 2 includes one or improved for speech-enabled predictive text selection for a more input/output (‘I/O’) adapters (178). I/O adapters in multimodal application (195) according to embodiments of voice servers implement user-oriented input/output through, the present invention. The VoiceXML interpreter (192) may for example, software drivers and computer hardware for operate generally for speech-enabled predictive text selection controlling output to display devices such as computer dis for a multimodal application according to embodiments of play screens, as well as user input from user input devices the present invention by: identifying a text prediction event, (181) such as keyboards and mice. The example voice server the text prediction event characterized by one or more pre of FIG. 2 includes a video adapter (209), which is an example dictive texts for the text input ?eld of the multimodal appli of an I/O adapter specially designed for graphic output to a cation; creating a grammar in dependence upon the predictive display device (180) such as a display screen or computer texts; receiving a voice utterance from a user; and determin monitor. Video adapter (209) is connected to processor (156) ing, using the ASR engine (150), recognition results in depen through a high speed video bus (164), bus adapter (158), and dence upon the voice utterance and the grammar, the recog the front side bus (162), which is also a high speed bus. nition results representing a user selection of a particular [0078] The exemplary voice server (151) of FIG. 2 includes predictive text. a communications adapter (167) for data communications [0072] In the example of FIG. 2, the VoiceXML interpreter with other computers (182) and for data communications (192) may also operate generally for speech-enabled predic with a data communications network (100). Such data com tive text selection for a multimodal application according to munications may be carried out serially through RS-232 con embodiments of the present invention by: creating a user nections, through external buses such as a Universal Serial prompt for the voice utterance in dependence upon the pre Bus (‘USB’), through data communications data communi dictive texts; and prompting the user for the voice utterance in cations networks such as IP data communications networks, dependence upon the user prompt. The VoiceXML interpreter and in other ways as will occur to those of skill in the art. (192) may operate generally for speech-enabled predictive Communications adapters implement the hardware level of text selection for a multimodal application according to data communications through which one computer sends data embodiments of the present invention by: rendering at least a communications to another computer, directly or through a portion of the recognition results in the text input ?eld. data communications network. Examples of communications [0073] Also stored in RAM (168) is an operating system adapters useful for speech-enabled predictive text selection (154). Operating systems useful in voice servers according to for a multimodal application according to embodiments of embodiments of the present invention include UNIXTM, the present invention include modems for wired dial-up com LinuxTM, Microsoft NTTM, IBM’s AIXTM, IBM’s iS/OSTM, munications, Ethernet (IEEE 802.3) adapters for wired data and others as will occur to those of skill in the art. Operating communications network communications, and 802.11 system (154), voice server application (188), VoiceXML adapters for wireless data communications network commu interpreter (192), speech engine (153), including ASR engine nications. (150), and TTS Engine (194) in the example of FIG. 2 are [0079] For further explanation, FIG. 3 sets forth a func shown in RAM (168), but many components of such software tional block diagram of exemplary apparatus for speech-en typically are stored in non-volatile memory also, for example, abled predictive text selection for a multimodal application of on a disk drive (170). a multimodal application in a thin client architecture accord [0074] Voice server (151) of FIG. 2 includes bus adapter ing to embodiments of the present invention. The example of (158), a computer hardware component that contains drive FIG. 3 includes a multimodal device (152) and a voice server US 2008/0235029 A1 Sep.25,2008

(151) connected for data communication by a VOIP connec modal, client-side computer program that presents a voice tion (216) through a data communications network (100). A interface to user (128), provides audio prompts and responses multimodal application (195) operates in a multimodal (314) and accepts input speech for recognition (315). Multi browser (196) on the multimodal device (152), and a voice modal application (195) provides a speech interface through server application (188) operates on the voice server (151). Which a user may provide oral speech for recognition (315) The multimodal application (195) may be a composed of at through microphone (176) and have the speech digitiZed least one X+V page (124) that executes in the multimodal through an audio ampli?er (185) and a coder/decoder (‘co broWser (196). The X+V page (124) ofFIG. 3 speci?es a text dec’) (183) of a sound card (174) and provide the digitiZed input ?eld (101) for receiving text from a user. speech for recognition to ASR engine (150). Multimodal [0080] In the example of FIG. 3, the multimodal device application (195), through the multimodal broWser (196), an (152) supports multiple modes of interaction including a API (316), and a voice services module (130), then packages voice mode and one or more non-voice modes. The exem the digitiZed speech in a recognition request message accord plary multimodal device (152) of FIG. 3 supports voice With ing to a VOIP protocol, and transmits the speech to voice a sound card (174), Which is an example of an I/O adapter server (151) through the VOIP connection (216) on the net specially designed for accepting analog audio signals from a Work (100). microphone (176) and converting the audio analog signals to [0085] Voice server application (188) provides voice rec digital form for further processing by a codec (183). The ognition services for multimodal devices by accepting dialog example multimodal device (152) of FIG. 3 may support instructions, VoiceXML segments, and returning speech rec non-voice modes of user interaction With keyboard input, ognition results, including text representing recognized mouseclicks, a graphical user interface (‘GUI’), and so on, as speech, text for use as variable values in dialogs, and output Will occur to those of skill in the art. from execution of semantic interpretation scriptsias Well as [0081] In addition to the voice sever application (188), the voice prompts. Voice server application (188) includes com voice server (151) also has installed upon it a speech engine puter program instructions that provide text-to-speech (153) With an ASR engine (150), a grammar (104), a lexicon (‘TTS’) conversion for voice prompts and voice responses to (106), a language-speci?c acoustic model (108), and a TTS user input in multimodal applications providing responses to engine (194), as Well as a Voice XML interpreter (192) that HTTP requests from multimodal broWsers running on multi includes a form interpretation algorithm (193). VoiceXML modal devices. interpreter (192) interprets and executes a VoiceXML dialog [0086] The voice server application (188) receives speech (121) received from the multimodal application and provided for recognition from a user and passes the speech throughAPI to VoiceXML interpreter (192) through voice server applica calls to VoiceXML interpreter (192) Which in turn uses an tion (188). VoiceXML input to VoiceXML interpreter (192) ASR engine (150) for speech recognition. The ASR engine may originate from the multimodal application (195) imple receives digitiZed speech for recognition, uses frequency mented as an X+V client running remotely in a multimodal components of the digitiZed speech to derive an SFV, uses the broWser (196) on the multimodal device (152). The SFV to infer phonemes for the Word from the language VoiceXML interpreter (192) administers such dialogs by pro speci?c acoustic model (108), and uses the phonemes to ?nd cessing the dialog instructions sequentially in accordance the speech in the lexicon (106). The ASR engine then com With a VoiceXML Form Interpretation Algorithm (‘PIA’) pares speech found as Words in the lexicon to Words in a (193). grammar (104) to determine Whether Words or phrases in [0082] VOIP stands for ‘Voice Over Internet Protocol,’ a speech are recogniZed by the ASR engine. generic term for routing speech over an IP-based data com [0087] The multimodal application (195) is operatively munications netWork. The speech data ?oWs over a general coupled to the ASR engine (150) through the VoiceXML purpose packet-sWitched data communications netWork, interpreter (192). In this example, the operative coupling to instead of traditional dedicated, circuit-sWitched voice trans the ASR engine (150) through a VoiceXML interpreter (192) mission lines. Protocols used to carry voice signals over the IP is implemented With aVOIP connection (216) through a voice data communications netWork are commonly referred to as services module (130). The voice services module is a thin ‘Voice over IP’ or ‘VOIP’ protocols. VOIP traf?c may be layer of functionality, a module of computer program instruc deployed on any IP data communications netWork, including tions, that presents an API (316) for use by an application data communications netWorks lacking a connection to the level program in providing dialogs (121) and speech for rec rest of the Internet, for instance on a private building-Wide ognition to a VoiceXML interpreter and receiving in response local area data communications netWork or ‘LAN.’ voice prompts and other responses, including action identi? [0083] Many protocols are used to effect VOIP. The tWo ers according to embodiments of the present invention. The most popular types of VOIP are effected With the IETF’s VoiceXML interpreter (192), in turn, utiliZes the speech Session Initiation Protocol (‘SIP’) and the ITU’s protocol engine (153) for speech recognition and generation services. knoWn as ‘H.323.’ SIP clients use TCP and UDP port 5060 to [0088] The VoiceXML interpreter (192) of FIG. 3 is connect to SIP servers. SIP itself is used to set up and tear improved for speech-enabled predictive text selection for a doWn calls for speech transmission. VOIP With SIP then uses multimodal application (195) according to embodiments of RTP for transmitting the actual encoded speech. Similarly, the present invention. The VoiceXML interpreter (192) may H.323 is an umbrella recommendation from the standards operate generally for speech-enabled predictive text selection branch of the International Telecommunications Union that for a multimodal application (195) according to embodiments de?nes protocols to provide audio-visual communication ses of the present invention by: identifying a text prediction sions on any packet data communications netWork. event, the text prediction event characterized by one or more [0084] The apparatus of FIG. 3 operates in a manner that is predictive texts for the text input ?eld (101) of the multimodal similar to the operation of the system of FIG. 2 described application (195); creating a grammar (104) in dependence above. Multimodal application (195) is a user-level, multi upon the predictive texts; receiving a voice utterance from a US 2008/0235029 A1 Sep.25,2008

user (128); and determining, using the ASR engine (150), (166), bus adapter (158), video adapter (209), video bus recognition results in dependence upon the voice utterance (164), expansion bus (160), communications adapter (167), and the grammar (104), the recognition results representing a I/O adapter (178), disk drive adapter (172), an operating user selection of a particular predictive text. system (154), a VoiceXML Interpreter (192), a speech engine [0089] In the example of FIG. 3, the VoiceXML interpreter (153), and so on. As in the system of FIG. 2, the speech engine (192) may also operate generally for speech-enabled predic in the multimodal device of FIG. 4 includes an ASR engine tive text selection for a multimodal application (195) accord (150), a grammar (104), a lexicon (106), a language-depen ing to embodiments of the present invention by: creating a dent acoustic model (108), and a TTS engine (194). The user prompt for the voice utterance in dependence upon the VoiceXML interpreter (192) administers dialogs (121) by predictive texts; and prompting the user (128) for the voice processing the dialog instructions sequentially in accordance utterance in dependence upon the user prompt. The With a VoiceXML Form Interpretation Algorithm (‘PIA’) VoiceXML interpreter (192) may operate generally for (193). speech-enabled predictive text selection for a multimodal [0093] The speech engine (153) in this kind of embodi application (195) according to embodiments of the present ment, a thick client architecture, often is implemented as an invention by: rendering at least a portion of the recognition embedded module in a small form factor device such as a results in the text input ?eld (101). In the example of FIG. 3, handheld device, a mobile phone, PDA, and the like. An the multimodal broWser (196) may operate generally for example of an embedded speech engine useful for speech speech-enabled predictive text selection for a multimodal enabled predictive text selection for a multimodal application application (195) according to embodiments of the present according to embodiments of the present invention is IBM’s invention by: rendering the predictive texts on a graphical Embedded ViaVoice Enterprise. The example multimodal user interface of the multimodal device (152) in dependence device of FIG. 4 also includes a sound card (174), Which is an upon the text prediction event. example of an I/O adapter specially designed for accepting [0090] In the example of FIG. 3, the voice services module analog audio signals from a microphone (176) and converting (130) provides data communications services through the the audio analog signals to digital form for further processing VOIP connection and the voice server application (188) by a codec (183). The sound card (174) is connected to pro betWeen the multimodal device (152) and the VoiceXML cessor (156) through expansion bus (160), bus adapter (158), interpreter (192). The API (316) is the same API presented to and front side bus (162). applications by a VoiceXML interpreter When the VoiceXML [0094] Also stored in RAM (168) in this example is a mul interpreter is installed on the multimodal device in a thick timodal application (195), a module of computer program client architecture. So from the point of vieW of an application instructions capable of operating a multimodal device as an calling theAPI (316), the application is calling the VoiceXML apparatus that supports speech-enabled predictive text selec interpreter directly. The data communications functions of tion for a multimodal application according to embodiments the voice services module (130) are transparent to applica of the present invention. The multimodal application (195) tions that call the API (316). At the application level, calls to implements speech recognition by accepting speech utter the API (316) may be issued from the multimodal broWser ances for recognition from a user and sending the utterance (196), Which provides an execution environment for the mul for recognition through VoiceXML interpreter API calls to timodal application (195). the ASR engine (150). The multimodal application (195) [0091] Speech-enabled predictive text selection for a mul implements speech synthesis generally by sending Words to timodal application of a multimodal application according to be used as prompts for a user to the TTS engine (194). As an embodiments of the present invention in thick client architec example of thick client architecture, the multimodal applica tures is generally implemented With multimodal devices, that tion (195) in this example does not send speech for recogni is, automated computing machinery or computers. In the tion across a netWork to a voice server for recognition, and the system of FIG. 1, for example, all the multimodal devices multimodal application (195) in this example does not receive (152) are implemented to some extent at least as computers. synthesiZed speech, TTS prompts and responses, across a For further explanation, therefore, FIG. 4 sets forth a block netWork from a voice server. All grammar processing, voice diagram of automated computing machinery comprising an recognition, and text to speech conversion in this example is example of a computer useful as a multimodal device (152) in performed in an embedded fashion in the multimodal device speech-enabled predictive text selection for a multimodal (152) itself. application according to embodiments of the present inven [0095] More particularly, multimodal application (195) in tion. In a multimodal device implementing a thick client this example is a user-level, multimodal, client-side computer architecture as illustrated in FIG. 4, the multimodal device program that provides a speech interface through Which a user (152) has no connection to a remote voice server containing a may provide oral speech for recognition through microphone VoiceXML interpreter and a speech engine. Rather, all the (176), have the speech digitiZed through an audio ampli?er components needed for speech synthesis and voice recogni (185) and a coder/decoder (‘codec’) (183) of a sound card tion in speech-enabled predictive text selection for a multi (174) and provide the digitiZed speech for recognition to ASR modal application according to embodiments of the present engine (150). The multimodal application (195) may be invention are installed or embedded in the multimodal device implemented as a set or sequence of X+V pages (124) execut itself. ing in a multimodal broWser (196) or microbroWser that [0092] The example multimodal device (152) of FIG. 4 passes VoiceXML grammars and digitiZed speech by calls includes several components that are structured and operate through aVoiceXML interpreter API directly to an embedded similarly to parallel components of the voice server, having VoiceXML interpreter (192) for processing. The embedded the same draWing reference numbers, as described above With VoiceXML interpreter (192) may in turn issue requests for reference to FIG. 2: at least one computer processor (156), speech recognition through API calls directly to the embed frontside bus (162), RAM (168), high speed memory bus ded ASR engine (150). The embedded VoiceXML interpreter US 2008/0235029 A1 Sep.25,2008

(192) may then issue requests to the action classi?er (132) to an example of a so-called ‘thick client architecture,’ so-called determine an action identi?er in dependence upon the recog because all of the functionality for processing voice mode niZed result provided by the ASR engine (150). Multimodal interactions betWeen a user and the multimodal applicationi application (195) also can provide speech synthesis, TTS as Well as all or most of the functionality for speech-enabled conversion, by API calls to the embedded TTS engine (194) predictive text selection for a multimodal application of a for voice prompts and voice responses to user input. multimodal application according to embodiments of the [0096] The multimodal application (195) is operatively present inventioniis implemented on the multimodal device coupled to the ASR engine (150) through a VoiceXML inter itself. preter (192). In this example, the operative coupling through [01 00] For further explanation of a thick client architecture, the VoiceXML interpreter is implemented using a VoiceXML FIG. 5 sets forth a line draWing of a multimodal device useful interpreter API (3 1 6). The VoiceXML interpreter API (3 1 6) is in speech-enabled predictive text selection for a multimodal a module of computer program instructions for use by an application (195) according to embodiments of the present application level program in providing dialog instructions, invention. In the example of FIG. 5, the multimodal applica speech for recognition, and other input to a VoiceXML inter tion (195) operates in a multimodal broWser (196) on the preter and receiving in response voice prompts and other multimodal device (152). The multimodal application (195) responses. The VoiceXML interpreter API presents the same of FIG. 5 speci?es a text input ?eld (101) for receiving text application interface as is presented by the API of the voice from a user. The multimodal device (152) of FIG. 5 supports service module (130 on FIG. 3) in a thin client architecture. At multiple modes of interaction including a voice mode and one the application level, calls to the VoiceXML interpreter API or more non-voice modes. The multimodal application (195) may be issued from the multimodal broWser (196), Which is operatively coupled to an ASR engine (150) of a speech provides an execution environment for the multimodal appli engine (153) through a VoiceXML interpreter (192). In the cation (195) When the multimodal application is implemented example of FIG. 5, the operative coupling is implemented With X+V. The VoiceXML interpreter (192), in turn, utiliZes using an API exposed by the VoiceXML interpreter (192) to the speech engine (153) for speech recognition and genera the multimodal broWser (196), Which provides an execution tion services. environment for the multimodal application (195). [0097] The VoiceXML interpreter (192) of FIG. 4 is [0101] In the example of FIG. 5, the VoiceXML interpreter improved for speech-enabled predictive text selection for a (192) is improved for speech-enabled predictive text selec multimodal application (195) according to embodiments of tion for the multimodal application (195) according to the present invention. The VoiceXML interpreter (192) may embodiments of the present invention. The VoiceXML inter operate generally for speech-enabled predictive text selection preter (192) of FIG. 5 identi?es a text prediction event. As for a multimodal application (195) according to embodiments mentioned above, a text prediction event is an event that is of the present invention by: identifying a text prediction triggered each time a user enters a character into the text input event, the text prediction event characteriZed by one or more ?eld (1 01). The text prediction event may occur When the user predictive texts for the text input ?eld (101) of the multimodal types a character in the text input ?eld (101) of the multimo application (195); creating a grammar in dependence upon dal application (195). The text prediction event may also the predictive texts; receiving a voice utterance from a user; occur When the user speaks a character for input in the text and determining, using the ASR engine (150), recognition input ?eld (101) of the multimodal application (195). The text results in dependence upon the voice utterance and the gram prediction event is characteriZed by one or more predictive mar, the recognition results representing a user selection of a texts (502) for the text input ?eld (101) of the multimodal particular predictive text. application (195). That is, When triggered, the text prediction [0098] In the example of FIG. 4, the VoiceXML interpreter event activates a predictive text algorithm that determines one (192) may also operate generally for speech-enabled predic or more predictive texts (502) that the user intends to input tive text selection for a multimodal application (195) accord into the text input ?eld. In the example of FIG. 5, text predic ing to embodiments of the present invention by: creating a tion event is triggered When the user enters the character ‘r,’ user prompt for the voice utterance in dependence upon the When the user enters the character ‘e,’ and When the user predictive texts; and prompting the user for the voice utter enters the character ‘s’ in the text input ?eld (101). ance in dependence upon the user prompt. The VoiceXML [0102] The multimodal broWser (196) of FIG. 5 renders the interpreter (192) may operate generally for speech-enabled predictive texts (502) on a graphical user interface (‘GUI’) predictive text selection for a multimodal application (195) (500) of the multimodal device (152) in dependence upon the according to embodiments of the present invention by: ren text prediction event. As mentioned above, the most recent dering at least a portion of the recognition results in the text text prediction event occurs When the user enters the character input ?eld (101). In the example of FIG. 4, the multimodal ‘s’ in the text input ?eld (101). Based on the text prediction broWser (196) may operate generally for speech-enabled pre event, a text prediction algorithm generates the predictive dictive text selection for a multimodal application (195) texts (502), including ‘research,’ ‘restaurant,’ and ‘restore,’ according to embodiments of the present invention by: ren and renders the predictive texts (502) on the GUI (500) of the dering the predictive texts on a graphical user interface of the multimodal device (152) multimodal device in dependence upon the text prediction [0103] In the example of FIG. 5, the VoiceXML interpreter event. (192) creates a grammar in dependence upon the predictive [0099] The multimodal application (195) in this example, texts (502) and receives a voice utterance from the user of the running in a multimodal broWser (196) on a multimodal multimodal device (152). Using the voice utterance, the device (152) that contains its oWn VoiceXML interpreter grammar, and the ASR engine (150), the VoiceXML inter (192) and its oWn speech engine (153) With no netWork or preter (192) determines recognition results that represent a VOIP connection to a remote voice server containing a user selection of a particular predictive text (502). For remote VoiceXML interpreter or a remote speech engine, is example, the VoiceXML interpreter (192) may receive a digi US 2008/0235029 A1 Sep.25,2008

tiZed speech from a user representing one of the predictive example of FIG. 6, the text prediction event (602) may be texts (502) and use theASR engine (152) to determine that the implemented according to the Document Object Model digitized speech represented the Word ‘restaurant.’ The (‘DOM’) Events Speci?cation. The DOM is created by a VoiceXML interpreter (192) of FIG. 5 may then render the multimodal broWser (196) When the multimodal application recognition result ‘restaurant’ in the text input ?eld (101). (195) is loaded. [0104] In some embodiments, the VoiceXML interpreter [0108] TheVoiceXML interpreter (192) may identify (600) (192) may create a user prompt for the voice utterance in a text prediction event (602) according to the method of FIG. dependence upon the predictive texts (502). For example, 6 by receiving an event noti?cation according to the DOM after the predictive text event is triggered, the VoiceXML event model to execute an ECMAScript script. The DOM interpreter (192) of FIG. 5 may generate a user prompt stat event model is speci?ed according to the DOM Events Speci ing, ‘You can say research, restaurant, restore.’ The ?cation. For further explanation, consider the folloWing seg VoiceXML interpreter (192) may then prompt the user for a ment of an exemplary multimodal application: voice utterance using the user prompt. [0105] For further explanation, FIG. 6 sets forth a How chart illustrating an exemplary method of speech-enabled predictive text selection for a multimodal application accord composed of at least one X+V page (124). The X+V page (124) speci?es a text input ?eld (101) for receiving text from a user. The multimodal application (195) operates in a mul timodal broWser (196) on a multimodal device supporting multiple modes of interaction including a voice mode and one or more non-voice modes of user interaction With the multi modal application. The voice mode may be implemented in this example With audio output through a speaker and audio [0109] The exemplary multimodal application segment input through a microphone. Non-voice modes may be imple above speci?es an ECMAScript script identi?ed as ‘predic mented by user input devices such as, for example, a keyboard tion-event.’ The VoiceXML interpreter executes the predic and a mouse. tion-event script When a text prediction event originates in the [0106] The multimodal application is operatively coupled text input ?eld identi?ed as ‘inputl .’ Readers Will note that the to an ASR engine through a VoiceXML interpreter (192). The exemplary multimodal application segment is for explanation operative coupling provides a data communications path and not for limitation. from the multimodal application (195) to an ASR engine for [0110] The method of FIG. 6 also includes rendering (606), grammars, speech for recognition, and other input. The by the multimodal broWser (196), the predictive texts (604) operative coupling also provides a data communications path on a graphical user interface of the multimodal device in from the ASR engine to the multimodal application (195) for dependence upon the text prediction event (602). The multi recogniZed speech, semantic interpretation results, and other modal broWser (196) may render (606) the predictive texts results. The operative coupling may be effected With a (604) on a graphical user interface according to the method of VoiceXML interpreter (192 on FIG. 4) When the multimodal FIG. 6 by displaying each of the predictive texts (604) in a application is implemented in a thick client architecture. WindoW on the GUI adjacent to the text input ?eld (101) as When the multimodal application is implemented in a thin illustrated, for example, on the GUI (500) described With client architecture, the operative coupling may include a reference to FIG. 5. voice services module (130 on FIG. 3), a VOIP connection [0111] The method of FIG. 6 includes creating (608), by the (216 on FIG. 3), and aVoiceXML interpreter (192 on FIG. 3). VoiceXML interpreter (192), a grammar (104) in dependence [0107] The method of FIG. 6 includes identifying (600), by upon the predictive texts (604). In the method of FIG. 6, the VoiceXML interpreter (192), a text prediction event (602). creating (608), by the VoiceXML interpreter (192), a gram The text prediction event (602) represents an event that is mar (104) in dependence upon the predictive texts (604) triggered each time a user enters a character into the text input includes generating (610) a grammar rule for the grammar ?eld (101). The text prediction event (602) of FIG. 6 may (104) that speci?es each predictive text (604) as an alternative occur When the user types a character in the text input ?eld for recognition. The VoiceXML interpreter (192) may then (101) of the multimodal application (195). The text prediction create (608) a grammar (104) according to the method of FIG. event may also occur When the user speaks a character for 6 by combining the grammar rule With a grammar template input in the text input ?eld (101) of the multimodal applica and storing the result as the grammar (104). For further expla tion (195). In the example of FIG. 6, the text prediction event nation, consider another segment of the multimodal applica (602) is characterized by one or more predictive texts (604) tion illustrated above: for the text input ?eld (101) of the multimodal application (195). That is, When triggered, the text prediction event (602) activates a predictive text algorithm that generates one or more predictive texts (604) that the user might intend to input into the text input ?eld (101). Any predictive text algorithm as Will occur to those of skill in the art may be useful in speech enabled predictive text selection for a multimodal application according to embodiments of the present invention. In the US 2008/0235029 A1 Sep.25,2008

-continued -continued

[0112] The exemplary multimodal application segment [0114] As mentioned above, the exemplary multimodal includes aVoiceXML dialog identi?ed as ‘voice-search.’ The application segment includes a VoiceXML dialog identi?ed voice-search dialog speci?es a grammar identi?ed as ‘Word as ‘voice-search.’ In addition to specifying the ‘Word-gram grammar’ that is initially empty When the exemplary multi mar’ grammar that is initially empty When the exemplary modal application is loaded. As mentioned above, the exem multimodal application is loaded, the voice-search dialog plary multimodal application segment contains an speci?es a user prompt identi?ed as ‘promptl ’ that is initially ECMAScript script identi?ed as ‘prediction-event’ that is empty When the exemplary multimodal application is loaded. executed by the VoiceXML interpreter When a text prediction As mentioned above, the exemplary multimodal application event occurs for a particular text input ?eld identi?ed as segment contains an ECMAScript script identi?ed as ‘pre ‘inputl.’ The prediction-event script instructs a VoiceXML diction-event’ that is executed by the VoiceXML interpreter interpreter to generate a grammar rule that speci?es each When a text prediction event occurs for a particular text input predictive text as an alternative for recognition, combine the ?eld identi?ed as ‘inputl.’ The prediction-event script grammar rule With a grammar template, and store the result as instructs a VoiceXML interpreter to generate a grammar rule the ‘Word-grammar’ grammar of the ‘voice-search’ dialog. that speci?es each predictive text as an alternative for recog [0113] The method of FIG. 6 also includes creating (612), nition, combine the grammar rule With a grammar template, by the VoiceXML interpreter (192), a user prompt (614) for and store the result as the ‘Word-grammar’ grammar of the the voice utterance (620) in dependence upon the predictive ‘voice-search’ dialog. The prediction-event script also texts (604). The user prompt (614) of FIG. 6 represents a instructs a VoiceXML interpreter to combine the predictive phrase provided to the user to solicit user input and may be texts With a prompt template and store the result as the user implemented using the VoiceXML element. In a prompt ‘promptl’ of the ‘voice-search’ dialog. The predic manner similar to creating a grammar, the VoiceXML inter tion-event script ends by instructing the VoiceXML inter preter (192) may create a user prompt (614) for the voice preter to activate the ‘voice-search’ dialog for prompting the utterance (620) according to the method of FIG. 6 by com user and obtaining recognition results. bining the predictive texts (604) With a prompt template and [0115] The method of FIG. 6 includes prompting (616), by storing the result as the user prompt (614). For further expla the VoiceXML interpreter (192), the user for the voice utter nation, consider again the segment of the multimodal appli ance (620) in dependence upon the user prompt (614). The cation illustrated above: VoiceXML interpreter (192) may prompt (616) the user for the voice utterance (620) according to the method of FIG. 6 by passing the user prompt (614) to a text-to-speech (‘TTS’) engine, receiving a synthesiZed version of the user prompt (614) from the TTS engine, and providing the synthesiZed version of the user prompt (614) to the multimodal broWser (196) for rendering to the user through a speaker of the mul timodal device. [0116] The method of FIG. 6 also includes receiving (618), by the VoiceXML interpreter (192), a voice utterance (620) from a user. The voice utterance (620) of FIG. 6 represents US 2008/0235029 A1 Sep.25,2008

digitized human speech provided to the multimodal applica [0123] When the VoiceXML interpreter (192) stores the tion (195) by a user of a multimodal device. As mentioned recognition results (624) in an ECMAScript ?eld variable above, the multimodal application (195) may acquire the array for a ?eld speci?ed in the multimodal application (195), voice utterance (620) from a user through a microphone and the recognition results (624) may be stored in ?eld variable encode the voice utterance in a suitable format for storage and array using shadoW variables similar to the application vari transmission using any CODEC as Will occur to those of skill able ‘application.lastresult$.’ For example, a ?eld variable in the art. In a thin client architecture, the VoiceXML inter array may represent a possible recognition result through the preter (192) may receive (618) the voice utterance (620) from folloWing shadoW variables: the multimodal application (195) according to the method of [0124] name$[i] .con?dence, FIG. 6 as part of a call by the multimodal application (195) to [0125] name$[i] .utterance, a voice services module (130 on FIG. 3) to provide voice [0126] name$[i].inputmode, and recognition services. The voice services module, then in turn, [0127] name$[i] .interpretation, passes the voice utterance (620) to the VoiceXML interpreter [0128] Where ‘name$’ is a placeholder for the ?eld identi (192) through a VOIP connection (216 on FIG. 3) and a voice ?er for a VoiceXML ?eld in the multimodal application (195) server application (188 on FIG. 3). In a thick client architec speci?ed to store the results of the recognition results (624). ture, the VoiceXML interpreter (192) may receive (618) the For example, a ?eld variable array identi?ed as ’search’ may voice utterance (620) from the multimodal application (195) be used to store recognition results for the ‘search’ ?eld of the according to the method of FIG. 6 as part of a call directly to ‘voice-search’ dialog in the exemplary multimodal applica an embedded VoiceXML interpreter (192) by the multimodal tion segment above. application (195) through an API exposed by the VoiceXML [0129] The method of FIG. 6 also includes rendering (626), interpreter (192). by the VoiceXML interpreter (192), at least a portion of the [0117] The method of FIG. 6 includes determining (622), recognition results (624) in the text input ?eld (101). The by the VoiceXML interpreter (192) using an ASR engine, VoiceXML interpreter (192) may render (626) at least a por recognition results (624) in dependence upon the voice utter tion of the recognition results (624) in the text input ?eld ance (620) and the grammar (104). The recognition results (101) according to the method of FIG. 6 by assigning the (624) of FIG. 6 represent a user selection of a particular recognition result (624) having the highest con?dence level to predictive text. The VoiceXML interpreter (192) may deter an element of the DOM representing the text input ?eld (101) mine (622) recognition results (624) using the ASR engine and alloWing the multimodal broWser (196) to refresh a GUI according to the method of FIG. 6 by passing the voice utter With the neW value for the element of the DOM representing ance (620) and the grammar (104) created by the VoiceXML the text input ?eld (101). For further explanation, consider interpreter (192) to an ASR engine for speech recognition, again the exemplary multimodal application segment: receiving the recognition results (624) from the ASR engine, and storing the recognition results (624) in an ECMAScript data structure such as, for example, the application variable array ‘application.lastresult$’ some other ?eld variable array for a VoiceXML ?eld speci?ed by the X+V page (124). ECMAScript data structures represent objects in the Docu ment Object Model (‘DOM’) at the scripting level in an X+V page. [0118] The ‘application.lastresult$’ array holds informa tion about the last recognition generated by anASR engine for the VoiceXML interpreter (192). The ‘application.lastresult$’ is an array of elements Where each element, application. lastresult$ [i], represents a possible result through the folloW ing shadoW variables: [0119] application.lastresult$[i].con?dence, Which speci?es the con?dence level for this recognition result. A value of 0.0 indicates minimum con?dence, and a [0130] As mentioned above, the recognition results value of 1.0 indicates maximum con?dence. obtained from executing the ‘voice-search’ dialog may be stored in the ‘search’ ?eld variable array, Which is ordered [0120] application.lastresult$[i].utterance, Which is the according to each results’ con?dence level from highest to raW string of Words that compose this recognition result. loWest. The exemplary multimodal application segment The exact tokeniZation and spelling is platform-speci?c above assigns the value of the recognition result having the (e.g. “?ve hundred thirty” or “5 hundred 30” or even highest con?dence level to the element of the DOM repre “530”). senting the text input ?eld identi?ed as ‘inputl .’ [0121] application.lastresult$[i].inputmode, Which [0131] To further understand hoW the VoiceXML inter speci?es the mode in Which the user provided the voice preter (192) as signs at least a portion of the recognition results utterance. Typically, the value is voice for a voice utter (624) to a DOM element representing the text input ?eld ance. (101), readers Will note that the assignment is contained in a [0122] application.lastresult$[i].interpretation, Which is VoiceXML element, Which is in turn contained in an ECMAScript variable containing output from VoiceXML element. The exemplary element ECMAScript post-processing script typically used to above is only executed by the VoiceXML interpreter (192) reformat the value contained in the ‘utterance’ shadoW When the VoiceXML interpreter (192) is able to ?ll the ?eld variable. speci?ed by the parent element With a value. For US 2008/0235029 A1 Sep.25,2008

example, the VoiceXML interpreter (192) Will execute the 2. The method of claim 1 further comprising rendering, by exemplary element above When the ‘search’ ?eld of the VoiceXML interpreter, at least a portion of the recognition the ‘voice-search’ dialog is ?lled With a value from the rec results in the text input ?eld. ognition result ‘application.lastresult$.’ Upon executing the 3. The method of claim 1 further comprising: exemplary element, the VoiceXML interpreter (192) creating, by the VoiceXML interpreter, a user prompt for assigns the recognition results having the highest con?dence the voice utterance in dependence upon the predictive level to a DOM element representing the text input ?eld texts; and (101). prompting, by the VoiceXML interpreter, the user for the [0132] Exemplary embodiments of the present invention voice utterance in dependence upon the user prompt. are described largely in the context of a fully functional 4. The method of claim 1 further comprising rendering, by computer system for speech-enabled predictive text selection a multimodal broWser, the predictive texts on a graphical user for a multimodal application. Readers of skill in the art Will interface of the multimodal device in dependence upon the recogniZe, hoWever, that the present invention also may be text prediction event. embodied in a computer program product disposed on signal 5. The method of claim 1 Wherein creating, by the bearing media for use With any suitable data processing sys VoiceXML interpreter, a grammar in dependence upon the tem. Such signal bearing media may be transmission media or predictive texts further comprises: recordable media for machine-readable information, includ generating a grammar rule for the grammar, the grammar ing magnetic media, optical media, or other suitable media. rule specifying each predictive text as an alternative for Examples of recordable media include magnetic disks in hard recognition. drives or diskettes, compact disks for optical drives, magnetic 6. The method of claim 1 Wherein the text prediction event tape, and others as Will occur to those of skill in the art. occurs When the user types a character in the text input ?eld of Examples of transmission media include telephone netWorks the multimodal application. for voice communications and digital data communications netWorks such as, for example, EthemetsTM and netWorks that 7. The method of claim 1 Wherein the text prediction event communicate With the lntemet Protocol and the World Wide occurs When the user speaks a character for input in the text Web. Persons skilled in the art Will immediately recogniZe input ?eld of the multimodal application. that any computer system having suitable programming 8. Apparatus for speech-enabled predictive text selection means Will be capable of executing the steps of the method of for a multimodal application, the multimodal application the invention as embodied in a program product. Persons operating on a multimodal device supporting multiple modes skilled in the art Will recogniZe immediately that, although of interaction including a voice mode and one or more non some of the exemplary embodiments described in this speci voice modes, the multimodal application operatively coupled ?cation are oriented to software installed and executing on to an automatic speech recognition (‘ASR’) engine through a computer hardWare, nevertheless, alternative embodiments VoiceXML interpreter, the apparatus comprising a computer implemented as ?rmWare or as hardWare are Well Within the processor and a computer memory operatively coupled to the scope of the present invention. computer processor, the computer memory having disposed [0133] It Will be understood from the foregoing description Within it computer program instructions capable of: that modi?cations and changes may be made in various identifying, by the VoiceXML interpreter, a text prediction embodiments of the present invention Without departing from event, the text prediction event characterized by one or its true spirit. The descriptions in this speci?cation are for more predictive texts for a text input ?eld of the multi purposes of illustration only and are not to be construed in a modal application; limiting sense. The scope of the present invention is limited creating, by the VoiceXML interpreter, a grammar in only by the language of the folloWing claims. dependence upon the predictive texts; receiving, by the VoiceXML interpreter, a voice utterance What is claimed is: from a user; and 1. A computer-implemented method of speech-enabled determining, by the VoiceXML interpreter using the ASR predictive text selection for a multimodal application, the engine, recognition results in dependence upon the voice multimodal application operating on a multimodal device utterance and the grammar, the recognition results rep supporting multiple modes of interaction including a voice resenting a user selection of a particular predictive text. mode and one or more non-voice modes, the multimodal 9. The apparatus of claim 8 further comprising computer application operatively coupled to an automatic speech rec program instructions capable of rendering, by the VoiceXML ognition (‘ASR’) engine through aVoiceXML interpreter, the interpreter, at least a portion of the recognition results in the method comprising: text input ?eld. identifying, by the VoiceXML interpreter, a text prediction 10. The apparatus of claim 8 further comprising computer event, the text prediction event characterized by one or program instructions capable of: more predictive texts for a text input ?eld of the multi creating, by the VoiceXML interpreter, a user prompt for modal application; the voice utterance in dependence upon the predictive creating, by the VoiceXML interpreter, a grammar in texts; and dependence upon the predictive texts; prompting, by the VoiceXML interpreter, the user for the receiving, by the VoiceXML interpreter, a voice utterance voice utterance in dependence upon the user prompt. from a user; and 11. The apparatus of claim 8 further comprising computer determining, by the VoiceXML interpreter using the ASR program instructions capable of rendering, by a multimodal engine, recognition results in dependence upon the voice broWser, the predictive texts on a graphical user interface of utterance and the grammar, the recognition results rep the multimodal device in dependence upon the text prediction resenting a user selection of a particular predictive text. event.