An Intelligent Interface for Keyboardand MouseControl - Providing Full Accessto PC Functionality via Speech

Bill Manaris Renre McCauley Valanne MaeGyvers Computer Science Department Computer Science Department Psychology Department College of Charleston College of Charleston University of Louisiana Charleston, SC 29424 Charleston, SC 29424 Lafayette, LA 70504 [email protected] [email protected] [email protected]

ABSTRACT mouse. SUITEKeysintegrates state-of-the-art components SUITEKeysis a speech user interface for motor-disabled for speech recognition, natural language processing, and computer users. This interface provides access to all dialog management, to provide a speaker-independent, available functionality of a computer by modeling continuous-speech interface. It is implemented on MS interaction at the physical keyboard and mouselevel. Windowsplatforms (95, 98, NT, 2000, and potentially SUITEKeysis currently implemented for MS Windows WindowsCE--for palmtop PCs). platforms. Its architecture integrates a continuous,speaker- This speech-interface concept provides motor-disabled independent recognition engine with natural language users with universal access to any device that processing components. It makes extensive use of dialog requires/supports alphanumericdata entry, such as palmtop managementto improve recognition accuracy. SUITEKeys PCs, cellular phones, and camcorders.It mayalso be used extends the speech-keyboard metaphor through by able-bodied users with temporary, task-induced motor functionality visible on its , and disabilities, such as users entering arbitrary (non-word) accessible through speech. Experimental results indicate alphanumeric data in mobile computingand/or hands-busy that speech input for alphanumeric data entry is a much environments. Specifically, as the physical dimensionsof moreeffective modality than existing alternatives for the traditional input devices shrink, the motorskills of the user target user group. Such alternatives include miniaturized becomeless effective, almost inadequate. For instance, keyboards, stylus "soft" keyboards, and handwriting studies showthat users mayexperience severe reduction in recognition software. data entry speed (words per minute) whenswitching from Keywords regular QWERTYkeyboard to a mobile-device keyboard Speechrecognition, natural languageprocessing, intelligent alternative. Specifically, these studies report a 75% user interfaces, accessibility. reduction on a stylus-based "soft" keyboard(PalmPilot) and a 60%reduction on a telephone keypad[4, 5]. INTRODUCTION SUITEKeys allows users to transcribe sequences of Studies indicate that speech interaction with a virtual keystrokes and mouseactions using spoken language. It keyboardand mouseis a very effective input modality for assumesthat the user speaks English and has minimalor no motor-controlchallenged users, in terms of data entry, task speech impediments. It models an 84-key keyboard (no completion,and error rates [1, 2, 3]. Theseresults applyto numeric pad) and two-button mousefunctionality. 2 Other both permanent and temporary (task-induced) motor applications, including the operating system, treat the disabilities. Examplesof the latter include users entering generated keyboardand mouseevents as having originated arbitrary alphanumeric data in devices from the correspondingphysical devices. and hands-busy environments. Speech interaction with a virtual keyboardand mousehas been shownto be far better Other speech applications, such as NaturallySpeaking and than~ alternative modalities such as mouthstiek, handstick, Voice, do not provide access to all the miniaturized keyboards, stylus "soft" keyboards, and functionality available through the physical keyboardand handwritingrecognition software [3]. mouse[6, 7]. This is becausethey are designed for higher- level tasks, such as Dictation and Command& Control. This paper presents SUITEKeys 1.0--a speech user Theyallow input of keystrokes either by requiring a fixed interface for providing access to a virtual keyboard and keywordas prefix (e.g., "Press a"), or by entering a spell

Copyright©2001, AAAI. All rights reserved. ] Mouthstickand handstickare devices used by motor-disabled users to manipulateQWERTY keyboards. 2 SUITEKeysmodels only the left keys for ALT,CONTROL, and SHIFT.

182 FLAIRS-2001 From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved. mode. Moreover, they do not necessarily model all 84 keys. Finally, they send recognized keystrokes directly to the active window,bypassing the operating system. This prohibits the operating system and other applications from attaching arbitrary semantics to sequences of keystrokes (e.g., ALT-CONTROL-DEL),and intercepting them. OVERVIEW SUITEKeys was originally developed as a testbed B application for the SUITEspeech-understanding-interface architecture at the Universityof Louisianaat Lafayette [8]. It has been finalized at the College of Charleston. It is distributed as freeware for non-profit use. An early | prototype of the system, the underlying architecture, and theoretical foundations for this work were presented in ASSETS-98,FLAIRS-99 and elsewhere [1, 2, 9]. SUITEKeyshas been implementedusing MSVisual Studio, SAPI, and Lex and YACC.Minimum system requirements I OperatingSystem ] include: Pentium120 processor (preferred Pentium200 and up); 16MBRAM for Windows 95 & 98, 24MBRAM for Fig. l. SUITEKeysArchitecture. Windows NT, 64MBRAM for ; 18MBdisk space; sound card & sound card device driver supported in "Up Arrow", "Tab", "Back Space"), mousecommands, and Windows;microphone (preferred noise-reducing, close-talk system commands. headset), and 16 kHz/16bit or 8kHz/16bit sampling rate SYSTEM ARCHITECTURE for input stream. The architecture of SUITEKeys integrates speech SU[TEKeysis available at www.cs.cofc.edu/~manaris/ recognition and natural language processing components SUITEKeys/. (see Figure 1). It processes input that originates as speech Domain Limitations events and converts it to operating system keyboard and Although the system incorporates a continuous, speaker- mouseevents. The complete system runs on top of the independentengine, this capability could not be exploited operating system like any other application. Other in all cases. This is due to speech ambiguities in the applications are unaware that keyboard/mouse events linguistic domain which normally confuse even human originated as speechinput. listeners. For instance, the linguistic domain includes The architecture is subdivided into (a) the languagemodel several near-homophoneletters, such as "b" and "p’, or "d" and (b) language processing components.Speech events are and "t". 3 For this reason, a few tasks have been modeled processed in a pipelined fashion to enable real-time only through discrete speech. This requires that the user response [2, 8]. Processing is performedby the following pause betweenwords. Although this maymake interaction components: less natural in some cases, it improves understanding Dialog Management accuracy4 considerably. The SUITEKeysarchitecture incorporates a stack-based Specifically, we use discrete speech mainly when dialog manager. This subsystem utilizes a dialog grammar transcribing regular letters (e.g., "a", "b"). Wealso use it containing dialog states and actions. Each top-level state a few other cases--where our usability studies showedthat corresponds to a specific language model (e.g., lexicon, users used it anyway--suchas entering function keys (e.g., grammar). The overall linguistic domainhas been divided "Function Eleven", "Function Twelve")and certain special into thirteen such states. Thedialog managerloads, in real keys (e.g., "Insert", "Home","Print Screen"). We use time, the appropriate dialog state based on the current state continuousspeech in all other cases, such as entering letters of the interaction. Giventhe ambiguity of the linguistic using the military alphabet, numbers,repeatable keys (e.g., domain, this allows processing componentsto focus on specific subsets of the overall domain, and thus improve recognition accuracy and performance. 3 Hencethe inventionof the militaryalphabet. Speech Processing 4 Weuse the term "understanding"as opposedto "recognition," SUITEKeysevolved through several architectures following since the system,in additionto translatingan utteranceto its the rapid developmentsin speechrecognition within the last ASCIIrepresentation (recognition), it also carries out the six years. Currently, it implementsthe Microsoft Speech intendedmeaning (understanding), such as type a letter, move API (SAPI). The latest version of SUITEKeys comes the mouse,and switch speakers.

HCl / IUI 183 bundled with the MS Speech Recognition engine (freeware), but could also workwith other SAPI-compliant engines, such as DragonSystems NaturallySpeaking [6]. Natural LanguageProcessing Thearchitecture incorporates a left-to-right, top-down,non- deterministic parser. This parser supports multiple parse trees, thus providing for ambiguityhandling at the semantic level. It incorporates semantic actions for constructing semanticinterpretations of user input. Code Generator This moduleconverts semantic interpretations--the output of the parser--to low-level code understood by the operating system. Other Components Other componentsof the SUITEKeysarchitecture include a knowledge-base manager, speech generator, lexical Fig. 2: SUITEKeysmaximized view. analyzer, and pragmatic analyzer. The application domain (as implementedin the final release of SUITEKeys)does not have significant lexical, pragmatic, and speech generation requirements. A detailed discussion of these componentsappears in [2]. GRAPHICALUSER INTERFACE Fig.3: Inputlevel. Fig. 4: Inputprocessing. AlthoughSUITEKeys is primarily a speech user interface, it includes a graphical user interface for visibility and feedback purposes. Specifically, the user interface identifies available functionality and howto access it (visibility); it also providesinformation about the effects user actions (feedback). The SUITEKeysgraphical user interface has two views: maximizedand minimized. Fig. 5: SUITEKeysminimized view. Maximized View accessible through speech. This is to prevent users from This view is designed to provide maximumfeedback to the entering system states from whichthey maynot be able to user. As shownin Figure 2, this view allows the user to exits focus on SUITEKeysand its operation. This view provides Minimized View several formsof feedback(left-to-right, top-to-bottom): Effective user interfaces shouldbe invisible to the user. As ¯ CommandHistory: a list of recognized commandsin MarkWeiser states, reverse chronological order. Unrecognizedcommands "[a] goodtool is an invisible tool. Byinvisible, I meanthat are denoted by "???". the tool does not intrude on your consciousness;you focus ¯ Speech Processing: superimposed animations provide on the task, not the tool. Eyeglassesare a goodtool -- you feedbackfor two independentyet related events: voice look at the world,not the eyeglasses."[10]. input level correspondsto raising and droppingrings in For this reason, SUITEKeysprovides a minimized view. green, yellow, and red (Figure 3); speech engine This view is intended for when the user wants to processing correspondsto a revolving "SK"(Figure 4). concentrateon the task at hand, i.e., controlling the PC, as ¯ Toggle Keys: on/off indicators for CAPSLOCK and opposed to SUITEKeysitself (see Figure 5). This view SCROLL LOCK. always stays on top and provides minimalfeedback, namely ¯ Generated Keystrokes: a keyboard that visually the last-recognized input, whichmodifier/toggle keys are "echoes" generated keystrokes. pressed, the volume indicator, and the speech engine processingstatus. Additionally, this view identifies all system commands through its menustructure. Since somesystem commands disconnect the speech engine from the microphone(e.g., 5 Thesecommands are only accessible to users with somemotor Microphone Calibration), such commands are not controlvia the physicalkeyboard/mouse interface.

164 FLAIRS-2OOl KEYBOARD SPEECH COMMANDS Typing Other Keys SUITEKeyxallows users to input sequencesof keystrokes Saying the nameof other keys (e.g., "Function Twelve", usingspoken language, much as if they weretranscribing to "Escape", "Space", "Ampersand", "Tab", "Left Arrow") someonehow to performthe sameactions with a physical types the correspondingkey. keyboard. Specifically, keys maybe tapped, pressed, or MOUSE SPEECH COMMANDS released. Typing errors may be handled similarly to SUITEKeysallows users to manipulate a two-button mouse physical keyboard-entryerrors, i.e., via the "BackSpace" key. using spokenlanguage, muchas if they weretranscribing to someonehow to perform the same actions with a physical In the followingsections, unless otherwise specified, users mouse. Specifically, the mousemay be movedin different mayuse continuous speech. directions; its buttons maybe clicked, double-clicked, Typing Alphanumeric Keys pressed, or released. Giventhe difficulty of recognizing certain letters of the SUITEKeysprovides three different techniques for setting alphabet, the system supports both regular and military the position of the mousepointer. pronunciations. Continuous Motion Saying the nameof a letter (e.g., "a", "b") sends the Users may smoothly move the mouse pointer towards a corresponding letter to the active window. To improve specific direction. Movementcontinues until a commandis accuracy, only discrete speech maybe used for sequences spoken(e.g., "Stop"), or until the pointer reaches a screen of alphanumerickeystrokes. border. This is intended for rough positioning of the Saying the military equivalent of a letter (e.g., "alpha", pointer, whenthe target is far fromthe original position. "bravo") sends the correspondingletter (e.g., "a", "b") For example, saying "movemouse down" ... "stop", will the active window. movethe pointer downwardsome distance. Saying "move Saying a number (e.g., "one", "two") sends the mousetwo o’ clock" ... "stop" will movethe pointer correspondingnumber to the active window. towards6 two o’clock. Word Auto-Completion Relative Positioning Studies showthat wordprediction techniques improveuser Users mayreposition the mousepointer a numberof units performanceconsiderably. This is accomplishedby giving fromits original position. This is intended for fine-tuning the user a choice of the most-likely words, based on the mouseposition, whenthe mousepointer is near the previous input. This reduces the overall number of target. keystrokesa user has to enter by as muchas 44%[ 11 ]. For example, saying "move mouse down three" will SUITEKeys incorporates a word auto-completion reposition the pointer 3 units belowits original position. mechanism.A word prediction box is displayed when the Saying "movemouse three o’clock forty" will reposition user begins typing a new word. This box is attached to the pointer40 units to the fight. bottom of the mousepointer for easy positioning. Word SUITEKeyssubdivides the screen into 100 units--"zero" to auto-completion allows the user to reduce the numberof "ninety-nine"---on both horizontal and vertical axes. This "keystrokes" needed to enter a previously used word. makesrelative positioning independentof screen resolution. Instead of typing the entire word, the user types enough letters to make it appear in the prediction box. Saying AbsolutePositioning "Select" types the rest of the word. Wordsare maintained Users mayreposition the mousepointer to a specific screen using a most-recently-usedscheme (identical to the paging location. This is intended for targets whoseapproximate schemein operating systemliterature). screen location is known. For example, saying "set mouseposition three five" will Typing Modifier Keys place the pointer 3 units awayfrom the left screen border Modifier keys are held pressed while typing alphanumeric keys to provide semanticvariations. and 5 units awayfrom the top. In other words, "zero zero" correspondsto the top-let~ comerand "ninety-nine ninety- Saying "Press" and a modifier key (i.e., "Ait", "Control", nine" to the bottom-rightcorner. "Shift") holds the key pressed. Saying "Release" and modifier key, releases the key. Thegraphical user interface Controlling MouseButtons provides feedbackas to whichmodifier keys are pressed at Mousebuttons may be controlled similarly to keyboard any time. keys. For example, saying "Click left button" or "Double- click right button" will performthe correspondingactions. Typing Toggle Keys Saying "Caps Lock", or "Scroll Lock" toggles the correspondingkey. The graphical user interface identifies 6 Twelveo’clock is the sameas up, three o’clockis right, six toggle keystatus (on/off). o’clockis down,etc.

HCl I IUI les Dragging is achieved by pressing the left mousebutton accessible keyboard and mouseis a very effective input (e.g., "Press left button") and movingthe mouse.Dropping modality, in terms of user data entry, task completion,and is achieved by releasing the left mousebutton (e.g., error rates. Specifically, users performedmuch better using "Releaseleft button"). a Wizard-of-Oz model of SUITEKeys as opposed to handstick. Handstick is a good approximationof alternate OTHER FUNCTIONALITY SUITEKeysprovides additional functionality through the input modalities such as mouthstick, miniaturized menustructure in maximizedview. This functionality is keyboards, stylus "soft" keyboards, and handwriting also accessible through speech, as long as it does not recognition software. Specifically, the average results on our particular benchmarktask were as follows: disconnect the microphonefrom the speech engine. ¯ Timeto completion:201.5 sees for speech; 321.1 sees for Context-Sensitive Help~"WhatCan I Say" handstick. Studies showthat there are manynames possible./br any ¯ Amount completed: 99.8% for speech; 99.1% for object, manyways to say the samething about it, and many handstick. di.[/erent things to say. Any one personthinks of only one ¯ Data entry speed: 1.336 characters/see for speech; 0.766 or a few of the possibilities. [12] characters/seefor handstick. For this reason, every dialog state includes the command ¯ Error rate: 0.033for speech; 0.09 for handstick. "What Can I Say". This commandbrings up a box with Theseresults suggest that this speechmodality is far better available commandsin this context. This box also appears after a numberof unrecognized inputs (currently 4). For than alternative modalities used in mobile devices that example, Figure 6 shows the corresponding box for the require manualdexterity for alphanumericdata entry. Such modalities are characterized by one or more of the Preferences dialog state. Once a correct conunand is following: decreased physical input area, increased visual spoken, the box disappears. scan time, and increased character entry time (e.g., handwritingrecognition). A speechuser interface similar to SUITEKeysappears to be relatively easy to learn and to use, particularly for motor challenged and/or computer eyboard feedback inimize on startup illiterate users. Anecdotal evidence from the novice ~tve or cEtrlcel subjects of the study suggests that this systemis far less ore mouse intimidatingthan other interfaces. .g. "move mouse left t O’) ~ore help In terms of summativeevaluation, we have conducted one small-scale experiment using our latest implementationof SUITEKeys.We are also in the process of conducting a Fig. 6: Context-sensitivehelp. large-scale study at the Universityof Louisianaat Lafayette. The results of the small-scale experiment apply only to Progressive Disclosure "More Help" speech and are approximate. Howeverthey provide an idea SUITEKeysincorporates a two-step, progressive-disclosure as to how closely the current implementation of mechanismfor help. Specifically, in mostdialog states, one SUITEKeysis to the ideal Wizard-of-Oz model. These of the commandsdisplayed as a result of"WhatCan I Say" results are as follows: is "More Help". Whenspoken, this commandbrings up the ¯ Timeto completion:346.8 sees. correspondingtopic in the Helpfile. ¯ Amountcompleted: 100%. Quick Training ¯ Data entry speed: 0.695 characters/see. SUITEKeysincorporates training "cards" to teach system ¯ Error rate: 0.057. essentials to the user. These cards are displayed automatically for every new speaker. Each card Discussion corresponds to a training exercise on a subset of These preliminary summativeresults suggest that, in its SUITEKeysfunctionality. Each training exercise is current implementation,SUITEKeys is a useful application associated with a very small linguistic model. Training for motor-impairedusers. Althoughit does not appear to be cards are designed so that a user mayproceed to the next as effective as handstick (at least in terms of data entry lesson only by masteringthe conceptbeing presented (e.g., speed and task completiontime), it is a viable alternative mousecontrol). for users whoare unable or unwilling to use means that require manualdexterity for alphanumericdata entry. Of USABILITY STUDIES course, this assumes a user with no significant speech Wehave conductedseveral usability studies related to impedimentsand a noise-free environment. SUITEKeys.Most of themwere of formative nature; they As mentioned above, although the system incorporates a providedfeedback and direction during development[1, 2, continuous speech engine, this capability could was not 3]. Theseresults indicate that interaction with a speech-

186 FLAIRS-2001 exploited in all cases to improve recognition accuracy. ACKNOWLEDGMENTS Obviously,discrete speech input slows downdata entry. This research has been partially supported by the Louisiana Researchsuggests that, in the near future, computerspeech- Board of Regents grant BoRSF-(1997-00)-RD-A-31.The recognition maysurpass humanspeech-recognition [14]. authors acknowledgethe contribution of Rao Adaviknlanu, Therefore, the above domainlimitations maybe specific to HuongDo, Corey Gaudin, Garrick Hall, William Jacobs, the current state-of-the-art in speech recognition. The Jonathan Laughery, Michail Lagoudakis, Eric Li, Adi modulardesign of SUITEKeysmakes it straightforward to Sakala, and the Spring 1999 Human-ComputerInteraction improveits speech recognition accuracy by attaching it to class at the Universityof Louisianaat Lafayette. newerspeech engines, as they becomeavailable. REFERENCES CONCLUSION I. Manaris, B. and Harkreader, A. 1998. SUITEKeys:A SUITEKeysis a continuous-speechuser interface for motor- speech understanding interface for the motor-control disabledcomputer users. This interface providesaccess to challenged, in Proceedingsof ASSETS’98 (Marina del all available functionality of a computerby modeling Rey, CA, April 1998), ACM,108-115. interactionat the level of the physicalkeyboard and mouse. 2. Manaris, B., MacGyvers, V., and Lagoudakis, M., By following this "metaphor as model" approach, "Speech Input for Mobile Computing Devices," SU/TEKeysfacilitates the formationof a conceptualmodel Proceedingsof 12th International Florida AI Research of the systemand its linguistic domain[13]. Giventhat the Symposium (FLAIRS-99) (Orlando, FL, May 1999), linguistic primitives of speechuser interfaces are invisible 286-292. (as opposedto the primitives of well-designed graphical 3. Manaris, B. and MacGyvers, V. "Speech for user interfaces), this significantly improvesthe system’s Alphanumeric Data Entry - A Usability Study," usability. submitted to the International Journal of Speech Additional functionality--that is functionality beyondthe Technology. interface metaphor--is incorporated by makingit visible 4. Goidstein, M., Book, R., Alsio, G., Tessa, S. 1998. through the SUITEKeysgraphical user interface. The user Ubiquitous Input for Wearable Computing: QWERTY may simply read aloud menu entries and dialog-box Keyboardwithout a Board, in Proceedingsof the First elements to select them. Further support for learning and Workshop on Human Computer Interaction with retaining how to use the system is provided through Mobile Devices (Glasgow, Scotland, 1998), GIST context-sensitive help (i.e., "What Can I Say") and Technical Report 98/1. Available at progressivedisclosure (i.e., "MoreHelp"). www.dcs.gla.ac.uk/-johnson/papers/mobile/HCIMD1. This relatively low-level approachhas a potential drawback html comparedto speech user interfaces employing a higher- 5. MacKenzie,I.S., Zhang, S.X., Soukoreff, R.W.Text level approach, such as NaturallySpeaking.In manyeases, Entry Using Soft Keyboards. Behaviour & Information a user maybe required to "synthesize" higher-level actions Technology l8 (1999), pp. 235-244. Available (e.g., Drag & Drop) through a sequence of low-level www.yorku.ca/faculty/academic/mack/BIT3 .html actions. This is especially "painful" in the context of 6. DragonSystems, Inc., NaturallySpeaking.Available at dictation. Although SUITEKeysdoes not "claim" to be a www.dragonsys.com. dictation package,it address this drawbackthrough (a) its word auto-completion mechanismand (b) its capability 7. Microsoft, MSSpeech Engine 4.0. Available at relinquish control to other speech-enabledapplications. www.microsoft.com/IIT/download/. Although speech is not the best modality for all human- 8. Manaris, B. and Harkreader, A. SUITE: speech computerinteraction tasks, whendelivered at the level of understanding interface tools and environments, in keyboard and mouse it allows for universal access to Proceedings of FLAIRS’97 (Daytona Beach, FL, May computingdevices--functionally similar to the one enjoyed 1997), 247-252. through a standard QWERTYkeyboard and mouse. The 9. Manaris, B. and Dominick, W.D. NALIGE:A user presented solution, in a future incarnation, might interface managementsystem for the development of complementor even replace existing mobile computing natural language interfaces, International Journal of modalities in manyapplication domains. Since it does not Man-MachineStudies 38, 6 (1993), 891-921. require muchphysical device area for alphanumeric data entry (only microphoneand perhaps speaker, for feedback), 10. Weiser, M. The world is not a desktop. Interactions, the physical device mayshrink as muchas advances in January1994, p. 7. microelectronics allow. 11. Copestake, A. Applying Natural LanguageProcessing Techniques to Speech Prostheses. 1996 AAAI Fall Symposiumon Developing Assistive Technology for

HCl / IUI 187 People with Disabilities. Available at www- Addison Wesley, Reading, Massachusetts, 1994, p. csli.stanford.edu/-aac/papers.html. 146. 12. Fumas, G.W., Landauer, T.K., Gomez, L.M, and 14. USC News Service, Machine Demonstrates Dumais, S.T. Statistical semantics: analysis of the Superhuman Speech Recognition Abilities, news potential performance of key-word information release 0999025, Sep. 30, 1999. Available at systems, The Bell System Technical Journal 6 (1983), http://uscnews.usc.edu/newsreleases/. p. 1796. 13. Preece, J., Rogers, Y., Sharp, H., Benyon,D., Holland, S., and Carey, T. tluman-Computer Interaction.

188 FLAIRS-2001