Design of Urdu Virtual Keyboard
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Conference on Language & Technology 2009 Design of Urdu Virtual Keyboard M. Aamir Khan, M. Abid Khan and M. Naveed Ali Department of Computer Science University of Peshawar, Pakistan [email protected], [email protected], [email protected] Abstract = 1 A + MT log 2 1 This paper presents the first ever virtual keyboard 9.4 W layout based on character frequency analysis of Urdu Here, W is the width of the key and A is the Corpus. To optimize the keyboard layout Monte Carlo distance to move to the target key K. Each mean time Simulation with simulated annealing is used. was weighted by the digraph probability. The wpm Furthermore, the proposed keyboard layout is (words per minute) was calculated by multiplying MT augmented with word prediction list derived from with the average number of characters per word. The Urdu corpus to speed up text entry. Performance computed wpm is an “upper limit” on the text entry analysis of keyboard layout is done for justification speed. The “visual scan” time to find a key was assumed to be zero. The keyboard layout can be done purposes. manually [5] or using optimization techniques such as Monte Carlo simulation [6]. 1. Introduction 2. Urdu virtual keyboard design Virtual/soft keyboards allow users, to input text using touch screen and stylus. For English language, Urdu has 37 base characters. The character set used several virtual keyboard layouts have been proposed. for designing the keyboard, proposed in this research These include MacKenzie’s and Zhang’s OPTI layout, paper, also contains Arabic characters. This facilitates improved OPTI layout in a 5x6 layout (OPTI II) with keypad to be used for entering Arabic text as well, but 38 wpm (words per minute), FITALY keyboard and it is not optimized for Arabic language. Table 1 shows Chubon keyboards [6]. Evaluation of the performance the set of Urdu alphabets. of virtual keyboards involves the use of Fitts’ Law [1][2]. Keyboard input speed is measured in wpm Table 1: Urdu language characters (words per minute). Mean time (MT), to move to a key ا ب پ ت ٹ on virtual keyboard, is computed in terms of moving ث ج چ ح خ to a target key K of width W lying at distance A from د ڈ ذ ر ڑ the current position of pointing device [3]. The layout ز ژ س ش ص of keys on virtual keyboard should be such that to ض ط ظ ع غ .minimize the mean time for all digraph movements ف ق گ ل The digraph frequencies are a natural feature of languages. Mackenzie and Zhang evaluated the م ن و ﮦ ه performance of their virtual keyboard by computing ے ں ئ 27x27 digraph frequencies from a corpus [5]. The ؤ ء ۀ ة ۓ distances (amplitudes) for all the 27 x 27 digraph movements in a given keyboard layout were The first step in designing a virtual keyboard is to computed, and for each movement the Fitts’ Law was determine the digraph frequencies. For Urdu language used to compute the MT [5]. The following equation computing, digraph frequencies require a corpus. A was used to compute the MT [5]. raw corpus consisting of 16,638,852 words was collected. It contained collections of newspaper 126 Proceedings of the Conference on Language & Technology 2009 0.26029 143244 ڑ articles, books and magazines. Table 2 shows the 691 character frequencies of individual Urdu alphabets in 0.25951 142813 ض 636 descending order. 0.18928 104163 ظ 638 0.18231 100331 غ Table 2: Urdu character frequencies 63a 0.14423 79372 ذ 630 Unicode Alphabet Frequency Percentage 0.12655 69641 ث 62b 12.23570 6733610 ا 627 0.05879 32355 ؤ 624 6cc 5752357 10.45266 0.04530 24930 ء 621 6a9 3911143 7.10697 0.00798 4390 ۀ 6c2 6.66768 3669392 ر 631 0.00458 2522 ژ 698 6.04639 3327481 و 648 0.00413 2275 ة 629 5.44098 2994305 ﮦ 6c1 0.00269 1479 ۓ 6d3 5.19302 2857846 ے 6d2 Total 55032482 100.00000 5.04003 2773651 ن 646 The 46x46 digraph frequency table was computed 4.87884 2684946 م 645 from the corpus. The digraph is shown in the form of 3.84803 2117669 ت 62a color chart in Figure 1. The dark shaded cells represent 3.61141 1987451 س 633 higher frequency digraphs, whereas light shaded cells .represent lower frequency digraphs 3.48129 1915841 ل 644 In Figure 1, the character on the Y axis shows the 2.71294 1492997 ب 628 first character while the one on the X axis shows the 2.67018 1469466 ں 6ba second character in a digraph. The order of characters in columns and rows of digraph’s color chart is the 2.60070 1431230 د 62f same as in table 1. The last column and the last row 1.66133 914273 پ 67e represent the space character. The digraph frequency 1.53486 844670 ج 62c table was used for computing the wpm performance of .the keyboard 1.45478 800600 ه 6be 1.20764 664594 ئ 626 To compute the performance of a given keyboard 1.16888 643263 گ 6af layout, the following equation was used [5]. 1.15598 636166 ع 639 k k A 0.99391 546973 ف 641 = 1 i, j + × (MT ∑∑ log 2 1 d i,( j 0.98934 544460 ق 642 i=0j = 0 9.4 W 0.96718 532262 ش 634 0.91147 501602 ح 62d Here, Ai,j is the distance from key i to key j. The d(i,j) represents the digraph frequency of character i 0.82525 454158 ز 632 followed by character j. The variable k is the number 0.76440 420666 ٹ 679 of characters in a given language. The diagonal entries 0.65081 358159 چ 686 in digraph frequency table where i=j denote repeated 0.64095 352729 خ 62e character where no movement of stylus is involved. For repeated characters, the repeat stylus tapping time 0.59498 327434 ص 635 622 259879 0.47223 was set as 0.127 seconds as in Zhai et.al [6]. 0.40088 220613 ط 637 0.33268 183081 ڈ 688 127 Proceedings of the Conference on Language & Technology 2009 Figure 1: Urdu digraph color chart Designing a keyboard layout is a combinatorial of keys is based on the work of Zhai et. al [6]. Figure task and requires O(n!) searches [6]. The layout 2 shows the optimized layout of the Urdu keyboard. should be arranged such that the MT is minimized. For optimizing the keyboard layout, 700 runs of The speed of entering text, using the layout in Monte-Carlo Simulation with simulated annealing figure 2, was computed using the following equation rd were executed. The best layout was found at 53 from Zhai et.al [6]. simulation. In each run 100,000-200,000 random movements were tried each on keyboard layout. wpm = 60 / AWL× MT Annealing schedule was adjusted by trial and error. The width of each key was set at 50 pixels. The shape where AWL is the average word length in a language. MT is the mean time. In Urdu language, the average 128 Proceedings of the Conference on Language & Technology 2009 word size was found to be 7 characters. The constant Figure 3: Improved Urdu virtual keyboard to 60 is the number of seconds in a minute. When a utilize the empty gaps between keys space after each word is added it becomes 8 characters. A prototype version of keyboard was implemented using Microsoft Visual C++ 6.0 for Microsoft windows. The program helped the user by highlighting the next probable keys and drawing rings around the keys. The darker color showed the higher probability of occurrence while the lighter color showed lower probability of occurrence. The most probable next character shows the ring in blinking mode. on ﮦ Figure 4 shows the typing of the word the keyboard. After the first three characters have been entered, the next set of probable characters is highlighted in different shades of red color. To further improve the performance of user input speed, a prediction list was added. Figure 5 shows the use of prediction list. Figure 2: Corpus based optimized Urdu virtual keyboard layout arranged in 7x7 cells For the optimized keyboard, MT was found to be MT = 0.20609985 wpm = (60/8×0.20609985) = 36.3901 wpm The predicted speed of the keyboard is thus 36.3901 words per minute. To utilize the space between circular keys, the shape of keys was changed to hexagonal. Figure 3 shows the improved design of optimized layout. رﮦ Figure 4: Predictive input of the word 129 Proceedings of the Conference on Language & Technology 2009 10 41801 ا پ ن ے اﮯ Figure 5: Input with prediction list assistance 5 41399 ﮦ ا ﮩ For analysis, the performance of the keyboard was 6 40080 پ پ determined on various words. Table 3 shows the 39963 گ ا computed distances of typing 38 most frequently 5 38483 5 ج س occurring words [2] in the corpus along with a space 6 37790 ت ه ے ﮯ character after each word. The distances are 4 37160 ت computed in terms of keys to traverse a given particular word. The distance covered depends on the position of keys and characters in a word. 3. Evaluation Table 3: Distances for 38 frequent words along with a space character The proposed layout presented in Figure 2 was evaluated on 20 students of computer science Word Characters Frequency Distance program. The average text entry speed was found to be 13.47 wpm based on an initial two hour training 3 618958 ے ﮯ prior to the evaluation.