Improved Touch-Screen Inputting Using Sequence-Level Prediction

Improved Touch-screen Inputting Using Sequence-level Prediction Generation Xin Wang, Xu Li, Jinxing Yu, Mingming Sun, Ping Li Cognitive Computing Lab Baidu Research No.10 Xibeiwang East Road, Beijing, China 10900 NE 8th St. Bellevue, WA 98004, USA {wangxin60,lixu13,yujinxing,sunmingming01,liping11}@baidu.com ABSTRACT g get good gi go got Recent years have witnessed the continuing growth of people’s de- Q W E R T Y U I O P Q W E R T Y U I O P pendence on touchscreen devices. As a result, input speed with the onscreen keyboard has become crucial to communication efficiency A S D F G H J K L A S D F G H J K L and user experience. In this work, we formally discuss the general Z X C V B N M Z X C V B N M problem of input expectation prediction with a touch-screen input method editor (IME). Taken input efficiency as the optimization The first touch The second touch target, we proposed a neural end-to-end candidates generation solu- Figure 1: An illustration of the input process and the corre- tion to handle automatic correction, reordering, insertion, deletion sponding candidate lists. The first touch falls into the area as well as completion. Evaluation metrics are also discussed base on of ‘g’, so the candidate list shows high-probability words be- real use scenarios. For a more thorough comparison, we also pro- ginning with ‘g’. Then the second touch falls on the board vide a statistical strategy for mapping touch coordinate sequences of ‘i’ which is close to ‘o’, so the candidate list contains both to text input candidates. The proposed model and baselines are possible strings ‘gi’ and ‘go’. evaluated on a real-world dataset. The experiment (conducted on the PaddlePaddle deep learning platform1) shows that the proposed model outperforms the baselines. practice in mobile devices. There are, however, at least three major challenges in such a candidate generation and display process. ACM Reference Format: First of all, there are various mistaken inputs in the typing Xin Wang, Xu Li, Jinxing Yu, Mingming Sun, Ping Li. 2020. Improved Touch- screen Inputting Using Sequence-level Prediction Generation. In Proceedings process. (i): The touch coordinates may fall outside the soft-button of The Web Conference 2020 (WWW ’20), April 20–24, 2020, Taipei. ACM, areas of target characters [26], and the distributions vary according New York, NY, USA, 7 pages. https://doi.org/10.1145/3366423.3380080 to the different characters. Besides, for the same character, different context also leads to inconsistent coordinate distributions. (ii): Sim- 1 INTRODUCTION ilar pronunciation also causes the touch distribution diffusion [8]. For instance, people misuse ‘c’ and ‘k’ because they may have the In most mobile devices, the soft-keys are small and lack tactile feed- same syllable. (iii): Moreover, when typing with two hands, people back of physical key boundaries. It is hence fairly easy for touches sometimes input the sequence in the wrong order. For example, one to fall out of the visible soft-button areas of expected characters. can input ‘doing’ as ‘doign’ for the over-rapid typing of the left hand. Once, an unexpected character sequence has been typed, it is dif- Besides, (iv): missing typing or (v): redundant typing operations also ficult to move the cursor to the precise location. Thus, onehasto make it challenging to cover expected input with candidates whose delete most of the input sequence from the end and re-input it, character numbers are equal with the touch coordinate numbers. which decreases input efficiency and leads to bad user experiences. Secondly, a coordinate sequence may be an incomplete input To minimize the need for manual correction, these days touch- concerning the expectation. For example, when the input method screen input methods generate several possible candidates for the receives the coordinate sequence of ‘sat’, the user’s expected word input sequence from the touch coordinates in real-time. As shown may be ‘Saturday’ or ‘satisfied’. Ideally, an algorithm should au- in Figure 1, when the user touches the soft keyboard, the applica- tomatically complete the input and update the candidate list to tion continuously update a list of the candidate words, until the help users finish typing with one selection. Therefore, it should be user selects one word from the list. This has become the common obvious that the candidates with only character-level corrections are far from being enough to satisfy user expectations. 1https://www.paddlepaddle.org.cn Last but not least, the sizes of mobile device screens are small, This paper is published under the Creative Commons Attribution 4.0 International and hence only limited candidates can be displayed on the list. (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their The generated candidates should be scored and ranked properly. personal and corporate Web sites with the appropriate attribution. WWW ’20, April 20–24, 2020, Taipei To address the above challenges, in this paper we introduce the © 2020 IW3C2 (International World Wide Web Conference Committee), published neural encoder-decoder framework with a beam search. It takes under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-7023-3/20/04. noisy coordinates as input to generate scored candidates. The neu- https://doi.org/10.1145/3366423.3380080 ral sequence-to-sequence models have the promising potential to WWW ’20, April 20–24, 2020, Taipei Xin Wang, Xu Li, Jinxing Yu, Mingming Sun, Ping Li handle symbol mapping, inserting, deleting and reordering, which on the input efficiency. If a character sequence in the candidate list has been proved in translation tasks [2]. These properties are help- is selected by the user, we call such sequence a user expectation ful to correct the mistaken input. Moreover, the encoder-decoder -8 . Our objective is to optimize parameters \ for function 5\ to framework has achieved remarkable performance on image cap- minimize the position of -8 in list ,8,C . tioning or automatic reply [24, 31]. It shows that such a framework arg min ?>B¹-8,,8,C º (2) has the capacity to learn sequence-level representation tasks that \ do not rely on one-to-one symbol mapping. Thus, it is a good choice where ?>B¹퐴, 퐵º represents 퐴’s particular rank position in list 퐵. to encode the coordinate sequence of the first few characters (rel- atively incomplete input) and generate the completed sequence. 2.1 Practical Concerns for the Problem During a beam search decoding process, we obtain the scores of Due to increasing privacy concerns, we only randomly sample lim- possible sequences according to the softmax probabilities, which ited information during data collection. Firstly, we only extract the can be used to select limited candidates to be displayed. In this touch coordinates within sampled words and collect no inter-word way, the input efficiency of users can be improved by avoiding information. With absence of % , we cannot make use of a lan- manual correction and completion. 8−1 guage model as previous works [13, 32]. Secondly, the stream data is The contributions of this work can be summarized as follows: anonymized and randomly shuffled, and no temporal, geographical • We formally describe the task of candidate list prediction on information or password is included. Thus, no personalized infor- touch screen devices and the optimization objective for input mation * is used although it has been proved to be helpful [13, 30]. efficiency improving. The proposed framework is trained Also, limited by the size of the log stream, we cannot save candidate incrementally and validated with different measures on a lists for all time steps W8 . We hence only collect the coordinate se- real-world dataset. The experimental results show that the quence 퐶8 and user expectation of -8 . The word prediction problem proposed methods significantly improve the top-1 accuracy is thus simplified as: and the character-per-touch metric over baseline strategies. • For a simplified prediction problem, we introduce a neural ,8,C = 5\ ¹퐶8,!º (3) sequence-to-sequence model that takes the touch coordi- Note that the result is independent with context word. From now nate sequence as input and generates words by handling on, we will omit the subscript 8 in the denotations. replacing, reordering, inserting, deleting and completing in an end-to-end manner. 2.2 Online Solution and Data Collection • Two evaluation metrics are explained based on real-world Given the coordinate sequence 퐶, there are many methods to pre- application scenarios. We also discuss the inconsistency of dict the candidate word list. The online solution is a rule-based one the two metrics and the underlying difference of assumptions designed by experts. It gets the original user input based on which of user preferences. rectangle areas of characters the coordinates fall into. And then the • For a more thorough comparison, we provide a statistical solution generates candidates by enumerating the possible char- strategy for mapping touch coordinate sequences to word- acter replacing, deleting, inserting and reordering and completing. level input candidates as a baseline. Finally, the candidates are sorted based on the coordinate deviation level (including Bayesian touch distance [5]) and word frequency. 2 PROBLEM STATEMENT Users update the candidate list by typing, deleting and re-typing We examine the user behavior of text input on mobile devices until they get their expectations. During this process, we can collect through a commercial input method application Facemoji. The the original input coordinates 퐶 and corresponding expectation -. task of continuously updating the candidate list can be formally And the rule-based solution can be iteratively updated based on described as follow.

Load more