P05-3016.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Portable Translator Capable of Recognizing Characters on Signboard and Menu Captured by Built-in Camera Hideharu Nakajima, Yoshihiro Matsuo, Masaaki Nagata, Kuniko Saito NTT Cyber Space Laboratories, NTT Corporation Yokosuka, 239-0847, Japan nakajima.hideharu, matsuo.yoshihiro, nagata.masaaki, saito.kuniko ¡ @lab.ntt.co.jp Abstract the user doesn’t know the readings such as kana and pinyin. This is a significant barrier to any translation We present a portable translator that rec- service. Therefore, it is essential to replace keyword ognizes and translates phrases on sign- entry with some other input approach that supports boards and menus as captured by a built- the user when character readings are not known. in camera. This system can be used on One solution is the use of optical character recog- PDAs or mobile phones and resolves the nition (OCR) (Watanabe et al., 1998; Haritaoglu, difficulty of inputting some character sets 2001; Yang et al., 2002). The basic idea is the such as Japanese and Chinese if the user connection of OCR and machine translation (MT) doesn’t know their readings. Through the (Watanabe et al., 1998) and implementation with high speed mobile network, small images personal data assistant (PDA) has been proposed of signboards can be quickly sent to the (Haritaoglu, 2001; Yang et al., 2002). These are recognition and translation server. Since based on the document OCR which first tries to ex- the server runs state of the art recogni- tract character regions; performance is weak due to tion and translation technology and huge the variation in lighting conditions. Although the dictionaries, the proposed system offers system we propose also uses OCR, it is character- more accurate character recognition and ized by the use of a more robust OCR technology machine translation. that doesn’t first extract character regions, by lan- guage processing to offset the OCR shortcomings, and by the use of the client-server architecture and 1 Introduction the high speed mobile network (the third generation Our world contains many signboards whose phrases (3G) network). provide useful information. These include destina- 2 System design tions and notices in transportation facilities, names of buildings and shops, explanations at sightseeing Figure 1 overviews the system architecture. After spots, and the names and prices of dishes in restau- the user takes a picture by the built-in camera of a rants. They are often written in just the mother PDA, the picture is sent to a controller in a remote tongue of the host country and are not always ac- server. At the server side, the picture is sent to the companied by pictures. Therefore, tourists must be OCR module which usually outputs many charac- provided with translations. ter candidates. Next, the word recognizer identifies Electronic dictionaries might be helpful in trans- word sequences in the candidates up to the number lating words written in European characters, because specified by the user. Recognized words are sent to key-input is easy. However, some character sets the language translator. such as Japanese and Chinese are hard to input if The PDA is linked to the server via wireless com- 61 Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 61–64, Ann Arbor, June 2005. c 2005 Association for Computational Linguistics PDA with built-in camera and mobile phone image translation Controller image OCR Figure 2: Many character candidates raised by character candidates appearance-based full search OCR: Rectangles de- character candidates Word note regions of candidates. The picure shows that word candidates Recognizer candidates are identified in background regions too. word candidates Language translation Translator region to identify candidates. As full details are given in their paper (Kusachi et al., 2004), we focus Figure 1: System architecture: http protocol is used here on just its characteristic performance. between PDAs and the controller. As this classifier identifies character candidates from anywhere in the picture, the precision rate is munication. The current OCR software is Windows- quite low, i.e. it lists a lot of wrong candidates. Fig- based while the other components are Linux pro- ure 2 shows a typical result of this OCR. Rectangles grams. The PDA uses Windows. indicate erroneous candidates, even in background We also implemented the system for mobile regions. On the other hand , as it identifies multiple phones using the i-mode and FOMA devices pro- candidates from the same location, it achieves high vided by NTT-DoCoMo. recall rates at each character position (over 80%) (Kusachi et al., 2004). Hence, if character positions 3 Each component are known, we can expect that true characters will be ranked above wrong ones, and greater word recog- 3.1 Appearance-based full search OCR nition accuracies would be achieved by connecting Research into the recognition of characters in nat- highly ranked characters in each character position. ural scenes has only just begun (Watanabe et al., This means that location estimation becomes impor- 1998; Haritaoglu, 2001; Yang et al., 2002; Wu et tant. al., 2004). Many conventional approaches first ex- tract character regions and then classify them into 3.2 Word recognition each character category. However, these approaches often fail at the extraction stage, because many pic- Modern PDAs are equipped with styluses. The di- tures are taken under less than desirable conditions rect approach to obtaining character location is for such as poor lighting, shading, strain, and distortion the user to indicate them using the stylus. However, in the natural scene. Unless the recognition target is pointing at all the locations is tiresome, so automatic limited to some specific signboard (Wu et al., 2004), estimation is needed. Completely automatic recog- it is hard for the conventional OCR techniques to nition leads to extraction errors so we take the mid- obtain sufficient accuracy to cover a broad range of dle approach: the user specifies the beginning and recognition targets. ending of the character string to be recognized and To solve this difficulty, Kusachi et al. proposed translated. In Figure 3, circles on both ends of the a robust character classifier (Kusachi et al., 2004). string denote the user specified points. All the lo- The classifier uses appearance-based character ref- cations of characters along the target string are esti- erence pattern for robust matching even under poor mated from these two locations as shown in Figure capture conditions, and searches the most probable 3 and all the candidates as shown in Figure 2. 62 ¡ ¢ £ ¤ ¤ ¥ ¦ ¦ § ¤ ¤ ¦ ¦ ¨ © © © © Figure 3: Two circles at the ends of the string are specified by the user with stylus. All the charac- ter locations (four locations) are automatically esti- Figure 4: A character matrix: Character candidates mated. are bound to each estimated location to make the matrix. Bold characters are true. 3.2.1 Character locations Once the user has input the end points, assumed (Nagata, 1998). to lie close to the centers of the end characters, the The algorithm is applied from the start to the end automatic location module determines the size and of the string and examines all possible combinations position of the characters in the string. Since the of the characters in the matrix. At each location, the characters have their own regions delineated by rect- algorithm finds all words, listed in a word dictionary, angles and have x,y coordinates (as shown in Fig- that are possible given the location; that is, the first ure 2), the module considers all candidates and rates location restricts the word candidates to those that the arrangement of rectangles according to the dif- start with this character. Moreover, to counter the ferences in size and separation along the sequences case in which the true character is not present in the of rectangles between both ends of the string. The matrix, the algorithm identifies those words in the sequences can be identified by any of the search al- dictionary that contain characters similar to the char- gorithms used in Natural Language Processing like acters in the matrix and outputs those words as word the forward Dynamic Programming and backward candidates. The connectivity of neighboring words A* search (adopted in this work). The sequence with is represented by the probability defined by the lan- the highest score, least total difference, is selected as guage model. Finally, forward Dynamic Program- the true rectangle (candidate) sequence. The centers ming and backward A* search are used to find the of the rectangles are taken as the locations of the word sequence with highest probability. The string characters in the string. in the Figure 3 is recognized as “ .” 3.2.2 Word search 3.3 Language translation The character locations output by the automatic location module are not taken as specifying the cor- Our system currently uses the ALT-J/E translation rect characters, because multiple character candi- system which is a rule-based system and employs dates are possible at the same location. Therefore, the multi-level translation method based on con- we identify the words in the string by the probabil- structive process theory (Ikehara et al., 1991). The ities of character combinations. To increase the ac- string in Figure 3 is translated into “Emergency tele- curacy, we consider all candidates around each es- phones.” timated location and create a character matrix, an As target language pairs will increased in future, example of which is shown in Figure 4. At each the translation component will be replaced by sta- location, we rank the candidates according to their tistical or corpus based translators since they offer OCR scores, the highest scores occupy the top row. quicker development. By using this client-server ar- Next, we apply an algorithm that consists of simi- chitecture on the network, we can place many task lar character matching, similar word retrieval, and specific translation modules on server machines and word sequence search using language model scores flexibly select them task by task.