US 20080235029A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0235029 A1 Cross et al. (43) Pub. Date: Sep. 25, 2008
(54) SPEECH-ENABLED PREDICTIVE TEXT (52) us. c1...... 704/275; 704/E15.001 SELECTION FOR A MULTIMODAL (57) ABSTRACT APPLICATION Methods, apparatus, and products are disclosed for speech (76) Inventors: Charles W. Cross, Wellington, FL enabled predictive text selection for a multimodal applica (US); Igor R. J ablokov, Charlotte, tion, the multimodal application operating on a multimodal NC (U S) device supporting multiple modes of interaction including a voice mode and one or more non-voice modes, the multimo Correspondence Address: dal application operatively coupled to an automatic speech INTERNATIONAL CORP (BLF) recognition (‘ASR’) engine through a VoiceXML interpreter, c/o BIGGERS & OHANIAN, LLP, P.O. BOX 1469 including: identifying, by the VoiceXML interpreter, a text AUSTIN, TX 78767-1469 (US) prediction event, the text prediction event characterized by one or more predictive texts for a text input ?eld of the (21) Appl. No.: 11/690,471 multimodal application; creating, by the VoiceXML inter preter, a grammar in dependence upon the predictive texts; receiving, by the VoiceXML interpreter, a voice utterance (22) Filed: Mar. 23, 2007 from a user; and determining, by the VoiceXML interpreter using the ASR engine, recognition results in dependence Publication Classi?cation upon the voice utterance and the grammar, the recognition (51) Int. Cl. results representing a user selection of a particular predictive G10L 11/00 (2006.01) text.
Voice Prompts And Responses Speech For Recognition
Multimodal Appiication @ X+\/ Page E Multimodai Devices Text Field 101 E
VoiceXML Interpreter E
Muitimodal Browser rkm @
VoiceXML Interpreter Q
Voice Sewer Application @ Patent Application Publication Sep. 25, 2008 Sheet 1 0f 6 US 2008/0235029 A1
Voice Prompts And Responses Speech For Recognition
Multimodal Application E X+V Page E Multimodal Devices Text Field m Personal 118 E
VoiceXML Interpreter @ / Speech @ Engine E Multimodal Browser Data Communications Network m @
1 2 2 X+V Pages VoiceXML E Interpreter Q
Speech Engine E Web Server 14_7 Voice Server Application @ FIG. 1 Patent Application Publication Sep. 25, 2008 Sheet 2 0f 6 US 2008/0235029 A1
Voice Server E RAM E Voice ServerApplication g
video Adapter VoiceXML Interpreter 1_2 & I Dialog FIA Video m @ Bus Processor E Memory E BUS E Speech Engine Q
FrontSide ASR EngineI m
Bus I Q Grammar Lexicon Acoustlc 104 106 Mode' — — m Bus Adapter @ TTS Engine m
Expansion I Bus Operating System E @
Communications I/O Adapter Disk Drive Adapter @ E Adapter E
V U U Data Comm Other Computers 0 Network 182 User Input Data Storage @ — Device @ E FIG. 2 Patent Application Publication Sep. 25, 2008 Sheet 3 0f 6 US 2008/0235029 A1
Voice Prompts And Speech For Recognition Responses @ Speaker
?n >>>>> ser 1 18%?“ Card <— U Multimodal I Device Multlmodal Browser 19 E Multimodal Application 1_5 X+V Page g Text Field m
O f? Voice Services Module m A
VOIP Connection216 @a Communlcatrons- - Network w 2:1) ServerWeb _| If‘ f 14_7 Voice Server V @ Voice Server Application Q
Speech Engine H I VoiceXML ASR Eng'ne m lnterpreter? Lexicon & . TTS Englne M FIA @ M0de| H Grammar M Dialog m
FIG. 3 Patent Application Publication Sep. 25, 2008 Sheet 4 of 6 US 2008/0235029 A1
Multimodal Device Q RAM E
Multimodal Application @ X+V Page E Text Field M
_6 _ Multimodal Vld \ Browser @ | eo . A dapter Vo|ceXMi_ Interpreter Q m Memory Dlaiog E Bus @ FIA m Processor Video E BUS Speech Engine @ M A t. From Grammar Lexicon “CA0? :9 Side 104 106 ° 9 Bus _ _ w E Bus Adapter @ TTS Engine m ASR Engine Q
Expansion Operating System @ Bus @
Sound Card w Communications Adapter Drive Adapter E E Codec? Adapter?
u u k Data Other Spf7a7 er Comm Computers — Network E m | Devices E Patent Application Publication Sep. 25, 2008 Sheet 5 0f 6 US 2008/0235029 A1
Multimodal Device Q Multimodal \ Browser? VorceXMLI Interpreter Q
GUI Speech @ Engine @ ASR @ v Multimodal Device @
Text Input I I Field 101 Predictive — Texts Q research restaurant testis re
FIG. 5 Patent Application Publication Sep. 25, 2008 Sheet 6 of 6 US 2008/0235029 A1
Multimodal Browser 19 Multimodal Application E Render The Predictive Texts On X+V Page E A GUI Of The Multimodal Device In Dependence Upon The Text @[i Prediction Event @
VoiceXML Interpreter 1 2 V Identify A Text Prediction Event, The Text T tP . t. E t 2 Prediction Event Characterized By One Or More ex redlc Ion Ven Q Predictive Texts ForAText Field OfThe ’ I Predictive Texts @ IJJ _ Multimodal Application w '
Create A Grammar in Dependence Upon The 4 / Predictive Texts w Generate A Grammar Rule For The @ Grammar w
Create A User Prompt For A Voice Utterance In 4‘ '
PromptDependence The User Upon For The The Predictive Voice Utterance Texts ? In +
Dependence Upon The User Prompt w
Receive A Voice Utterance From A User w I Voice Utterance Q I
Determine, Using An ASR Engine, Recognition : Resuits In Dependence Upon The Voice Utterance And The Grammar Q Pi I RecOgrlition Results % In
Render At Least A Portion Of The Recognition Results In The Text Field @ FIG. 6 US 2008/0235029 A1 Sep.25,2008
SPEECH-ENABLED PREDICTIVE TEXT thesis technologies or ‘speech engines’ to do the Work of SELECTION FOR A MULTIMODAL recognizing and generating human speech. As markup lan APPLICATION guages, both X+V and SALT provide markup-based pro gramming environments forusing speech engines in an appli BACKGROUND OF THE INVENTION cation’s user interface. Both languages have language [0001] 1. Field of the Invention elements, markup tags, that specify What the speech-recog [0002] The ?eld of the invention is data processing, or, nition engine should listen for and What the synthesis engine more speci?cally, methods, apparatus, and products for should ‘say.’ Whereas X+V combines XHTML, VoiceXML, speech-enabled predictive text selection for a multimodal and the XML Events standard to create multimodal applica application. tions, SALT does not provide a standard visual markup lan [0003] 2. Description Of Related Art guage or eventing model. Rather, it is a loW-level set of tags [0004] User interaction With applications running on small for specifying voice interaction that can be embedded into devices through a keyboard or stylus has become increasingly other environments. In addition to X+V and SALT, multimo limited and cumbersome as those devices have become dal applications may be implemented in Java With a Java increasingly smaller. In particular, small handheld devices speech frameWork, in C++, for example, and With other tech like mobile phones and PDAs serve many functions and con nologies and in other environments as Well. tain su?icient processing poWer to support user interaction through multimodal access, that is, by interaction in non [0007] As mentioned above, a user may interact With a multimodal application by typing text on a keypad of a mul voice modes as Well as voice mode. Devices Which support timodal device. The draWback to this mode of user interaction multimodal access combine multiple user input modes or is that it is dif?cult for a user to enter text because the small channels in the same interaction alloWing a user to interact siZe of the device typically prohibits providing a full-siZe With the applications on the device simultaneously through keyboard to the user. To partially overcome this limitation, multiple input modes or channels. The methods of input predictive text input technology has been developed that include speech recognition, keyboard, touch screen, stylus, accumulates a context composed of the Words already typed mouse, handWriting, and others. Multimodal input often by a user and the letters of the Word currently being typed by makes using a small device easier. the user. Such predictive text input technology uses the accu [0005] Multimodal applications are often formed by sets of mulated context to predict several possible Words that the user markup documents served up by Web servers for display on intends to input. The user may then select the Word that multimodal browsers. A ‘multimodal browser,’ as the term is matches the user’s intended input, thereby reducing the num used in this speci?cation, generally means a Web broWser ber of keystrokes required by the user. The draWback to cur capable of receiving multimodal input and interacting With rent predictive text input technology, hoWever, is that the user users With multimodal output, Where modes of the multimo must manually select one of several possible Words as the dal input and output include at least a speech mode. Multi user’s intended input through a graphical user interface. Fur modal broWsers typically render Web pages Written in thermore, current predictive text input technology in general XHTML+Voice (‘X+V’). X+V provides a markup language does not take advantage of the speech mode of user interac that enables users to interact With an multimodal application tion available to a user of a multimodal device. Readers Will often running on a server through spoken dialog in addition to therefore appreciate that room for improvement exists in pre traditional means of input such as keyboard strokes and dictive text selection for a multimodal application. mouse pointer action. Visual markup tells a multimodal broWser What the user interface is look like and hoW it is to behave When the user types, points, or clicks. Similarly, voice SUMMARY OF THE INVENTION markup tells a multimodal broWser What to do When the user speaks to it. For visual markup, the multimodal broWser uses [0008] Methods, apparatus, and products are disclosed for a graphics engine; for voice markup, the multimodal broWser speech-enabled predictive text selection for a multimodal uses a speech engine. X+V adds spoken interaction to stan application, the multimodal application operating on a mul dard Web content by integrating XHTML (eXtensible Hyper timodal device supporting multiple modes of interaction text Markup Language) and speech recognition vocabularies including a voice mode and one or more non-voice modes, the supported by VoiceXML. For visual markup, X+V includes multimodal application operatively coupled to an automatic the XHTML standard. For voice markup, X+V includes a speech recognition (‘ASR’) engine through a VoiceXML subset of VoiceXML. For synchronizing the VoiceXML ele interpreter, including: identifying, by the VoiceXML inter ments With corresponding visual interface elements, X+V preter, a text prediction event, the text prediction event char uses events. XHTML includes voice modules that support acteriZed by one or more predictive texts for a text input ?eld speech synthesis, speech dialogs, command and control, and of the multimodal application; creating, by the VoiceXML speech grammars. Voice handlers can be attached to XHTML interpreter, a grammar in dependence upon the predictive elements and respond to speci?c events. Voice interaction texts; receiving, by the VoiceXML interpreter, a voice utter features are integrated With XHTML and can consequently be ance from a user; and determining, by the VoiceXML inter used directly Within XHTML content. preter using the ASR engine, recognition results in depen [0006] In addition to X+V, multimodal applications also dence upon the voice utterance and the grammar, the may be implemented With Speech Application Tags recognition results representing a user selection of a particu (‘SALT’). SALT is a markup language developed by the Salt lar predictive text. Forum. Both X+V and SALT are markup languages for cre [0009] The foregoing and other objects, features and ating applications that use voice input/ speech recognition and advantages of the invention Will be apparent from the folloW voice output/speech synthesis. Both SALT applications and ing more particular descriptions of exemplary embodiments X+V applications use underlying speech recognition and syn of the invention as illustrated in the accompanying draWings US 2008/0235029 A1 Sep.25,2008
wherein like reference numbers generally represent like parts [0017] The multimodal broWser (196) of FIG. 1 provides an of exemplary embodiments of the invention. execution environment for the multimodal application (195). To support the multimodal broWser (196) in processing the BRIEF DESCRIPTION OF THE DRAWINGS multimodal application (195), the system of FIG. 1 includes a VoiceXML interpreter (192). The VoiceXML interpreter [0010] FIG. 1 sets forth a network diagram illustrating an (192) is a softWare module of computer program instructions exemplary system for speech-enabled predictive text selec that accepts voice dialog instructions and other data from a tion for a multimodal application according to embodiments multimodal application, typically in the form of a VoiceXML of the present invention.