THE PHONE READER

Submitted in partial fulfillment of the requirements of the degree of

BACHELOR OF SCIENCE (HONOURS) of Rhodes University

Mich`eleMarilyn Bihina Bihina

Grahamstown, South Africa

November, 2012 Abstract

The Phone Reader is an Android application that reads text extracted from a photo taken with a mobile Android phone. It uses OCR to provide accurate character recogni- tion in the image, and Apertium translator engine to translate the extracted text. It aims to help people with reading disabilities, or illiterates and non-native speakers, to hear the content of text they have difficulties to read. The system provides a user friendly client interface that communicates with a remote server; and the latter processes uploaded images to extract the text contained in it. ACM Computing Classification System

Thesis classification under the ACM Computing Classification System (1998 version, valid through 2012): B.4.1 [Data Communications Devices]: Receivers (voice, data, image) I.2.7 [[Natural Language Processing]: Speech recognition and synthesis I.4 [Image Processing And Computer Vision]: Image processing software I.7.5 [Document Capture]: Graphics Recognition And Interpretation Optical character recognition (OCR) Scanning General Terms: Image processing, Android application, OCR (Optical Character Recog- nition), Text recognition, Text-To-Speech, Text translation

i Acknowledgments

I would like to thank God for all the strength He has given during this year. I would like to give my deep and sincere thanks to all the members of my family who have helped and supported me this year, especially my mother and my uncle J.V. Nkolo. I also want to thank the following persons for their support:

* My supervisor Mr James Connan for all the help he has offered me during the devel- opment of this project.

* Rhodes University and the department of computer science for the opportunity to pursue my Honours degree.

* All my friends that have assisted and given me advices this year.

And finally, I would like to acknowledge the financial and technical support of Telkom, Tellabs, Stortech, Genband, Easttel, Bright Ideas 39 and THRIP through the Telkom Centre of Excellence in the Department of Computer Science at Rhodes University.

ii Contents

Abstract ...... i ACM Computing Classification System ...... i Acknowledgments ...... ii

1 Introduction 1 1.1 Problem statement ...... 1 1.2 Objectives ...... 2 1.3 Methodology ...... 2 1.4 Progression ...... 2 1.5 Structure of the thesis ...... 3

2 Literature Review 4 2.1 Introduction ...... 4 2.2 Image processing ...... 4 2.2.1 Definitions ...... 4 2.2.2 Image processing methods ...... 5 2.3 Text reading systems ...... 6 2.3.1 Mobile text reading applications requiring OCR ...... 6 2.3.2 Mobile text reading applications not requiring OCR ...... 8 2.4 Object recognition systems ...... 9 2.4.1 Systems using crowd-sourcing for object recognition ...... 9 2.4.2 Mobile applications using visual search on specific types of objects . . 10 2.5 Text and object recognition systems offering extended functionalities . . . . 11 2.6 Common tools used by object recognition and text reading mobile applications 13

iii 2.6.1 Mobile operating systems ...... 13 2.6.2 OCR ...... 14 2.6.3 Text-To-Speech Engine ...... 14 2.7 Plan of Action ...... 15 2.8 Conclusion ...... 15

3 Design of the system 17 3.1 Introduction ...... 17 3.2 Textual description ...... 17 3.3 System Design ...... 19 3.3.1 System architecture ...... 19 3.3.2 UML approach ...... 20 3.4 Conclusion ...... 24

4 Implementation 25 4.1 Introduction ...... 25 4.2 System requirements ...... 25 4.3 Description of the tools used for the system ...... 26 4.4 Code documentation ...... 28 4.4.1 Image processing techniques used with Imagemagick ...... 28 4.4.2 Java implementation of the classes of the system ...... 31 4.5 Conclusion ...... 40

5 Tests and Results 41 5.1 Different font sizes for the same text ...... 41 5.2 Lighting conditions ...... 43 5.3 Testing the translation accuracy ...... 46

6 Conclusion 47 6.1 Goals achieved by the system ...... 47 6.2 Limits of the system ...... 47 6.3 Future work ...... 48

iv 6.4 Conclusion ...... 48

A User’s guide 49

v List of Tables

1.1 Progression of the Phone Reader project ...... 3

4.1 System specifications ...... 26

vi List of Figures

3.1 Flowchart diagram of the system ...... 18 3.2 Architecture of the system ...... 19 3.3 Use case diagram ...... 21 3.4 Class diagram on the client side ...... 22 3.5 Class diagram on the server side ...... 23

4.1 Function to apply Unsharp method to image ...... 31 4.2 Initialization of the Camera Activity class ...... 32 4.3 Function to open camera mode ...... 32 4.4 Display bitmap image on phone screen ...... 33 4.5 Snippet of code for the event ACTION-UP ...... 34 4.6 Map a language to its code ...... 35 4.7 Php script to upload image on server ...... 36 4.8 URL to download the text file ...... 37 4.9 Calling the image processing methods from the main class ...... 38 4.10 Calling the OCR function ...... 39 4.11 Calling the translation function ...... 40 4.12 Function to perform Text-To-Speech ...... 40

5.1 OCR results of a text with font size 12 ...... 42 5.2 OCR results of a text with font size 14 ...... 42 5.3 OCR results of a text with font size 16 ...... 43 5.4 OCR results under low light ...... 44 5.5 OCR results under low light using the camera flash ...... 44

vii 5.6 Result of the pre-processing of an image taken with flash activated ...... 45 5.7 OCR result of a pre-processed image taken with flash activated ...... 46 5.8 Representation of accurately translated words in a text ...... 46

A.1 Screen 1 ...... 49 A.2 Screen 2 ...... 50 A.3 Screen 3 ...... 51 A.4 Screen 4 ...... 52 A.5 Screen 5 ...... 53 A.6 Screen 6 ...... 53 A.7 Screen 7 ...... 54

viii Chapter 1

Introduction

In today’s society, mobile phones offer a wide variety of functionalities that are not always related to calling or sending messages. Those functionalities include web browsing, playing games or music, banking, taking photos and so much more. The Phone Reader is an Android application that aims to allow the user to hear a text contained in a picture that has been taken with a mobile phone. It is an application meant to help those who cannot read a text they encounter, like non-native speakers, the visually impaired and the blind people, estimated at 285 millions in 2010 by the World Health Organization [22]. This project is mainly related to image processing to recognize characters in an image.

1.1 Problem statement

Reading or understanding a text can at times be a challenge if it is written in a foreign language, if the reader is illiterate, or if one has reading disabilities. The solution to this problem is the goal of the Phone Reader project. This latter aims to develop a mobile application that can read a text for the user through a mobile Android device. To use it, the user has to photograph the text with his phone, choose a language for the translation if necessary, and send the photo to the server, which extracts the text in the photo and produces its speech.

1 1.2 Objectives

The Phone Reader is meant to help different type of people unable to read a text. The following list presents cases in which the Phone Reader can be used:

• Blind people can use it when they have a text to read.

• Non-native speakers (like tourists) can use it when they do not understand a text written in a foreign language, or when they are not sure of the right pronunciation of words.

• The illiterate or dyslexic can use it when they have difficulties reading a text.

1.3 Methodology

The system was programmed using Android [26], which is a -based for mobile devices developed by the Open Handset Alliance. The phone sends requests to the Apache Server by uploading photos to it. The Apache Server processes the client’s request by pre-processing the image sent, before extracting its text with an OCR (Optical Character Recognition) program. TTS (Text To Speech) engine produces the speech on the phone after performing a required translation of the extracted text. The programming languages used are Java and PHP.

1.4 Progression

The following table presents the different steps that need to be accomplished in order to develop the Phone Reader:

2 Table 1.1: Progression of the Phone Reader project Steps Tasks

Step 1 Review existing technologies Step 2 Determine system requirements Step 3 Configure web service Step 4 Implement and evaluate preprocessing Step 5 Implement OCR Step 6 Implement translation Step 7 Implement user interface/phone client Step 8 Implement TTS functionality on the phone Step 9 Testing the system Step 10 Documentation

1.5 Structure of the thesis

This thesis has seven chapters, the first one is the introduction. In Chapter 2, which is the literature review, we review all the related works to this project; we examine which tools have been used and which tools we could use for our system. Chapter 3, which is the design of the system, describes how the system has been designed and presents an overview of the structure of the system. Chapter 4 is the implementation, it describes all the technical aspects of the Phone Reader: the system requirements and the programming aspect. Chapter 5 presents and discusses the results obtained from different tests performed with the system. Chapter 6 is the conclusion, it presents the system performance and how it can be improved. The last chapter, contained in the appendices, is the user‘s guide for the Phone Reader. It describes to the end user how to use the system.

3 Chapter 2

Literature Review

2.1 Introduction

In this literature review, we study and analyze different mobile applications that were de- signed to read text, recognize text or recognize objects in a picture and inform the user about the result of the request. This project is mainly related to image processing in order to extract text from images. Section 2.2 of this chapter investigates image processing, and Section 2.3 presents applications that can read text from an image. Section 2.4 looks at systems that recognize objects in an image. Section 2.5 presents systems recognizing objects with extended functionalities, followed by section 2.6 which describes the tools needed and used to create the applications presented previously; and the last section, before we conclude, defines a plan of action.

2.2 Image processing

Image processing is the field of study related to the Phone Reader project. In this section, we give a brief overview of image processing.

2.2.1 Definitions

Image processing consists of converting an image into a digital form [6], and then performing operations on it such as extracting its content, or the information in it. It is also used for

4 object recognition. A digital image is an array of square picture elements or pixels arranged in columns and rows [25] . There are colour images, grayscale images and binary images. Colour images can be converted to grayscale in order to facilitate the extraction of some information within the image. A grey scale image is an 8-bit image, in which each pixel has an assigned intensity between 0 (black) and 255 (white). A binary image is an image in which pixels can only have two values: black (0) or white (1). Most common image formats are: GIF, JPEG, TIFF, PNG, PS, and PSD.

2.2.2 Image processing methods

There are two types of methods used for image processing [6]:

- Analog image processing or visual techniques of image processing: used for printouts and photographs.

- Digital image processing: processing digital images by using a computer. This technique includes three phases for processing images: pre-processing, enhancement and display, information extraction. Let us briefly define each of those phases [5]:

* Image pre-processing or image restoration consists of correcting the image from different errors, noise and geometric distortions.

* Image enhancement improves the visual aspect of the image, after the correction of errors, to facilitate the perception or interpretability of information in the image.

* Information extraction utilizes the computer’s decision-making capability to iden- tify and extract specific pieces of information or pixels.

The different image processing techniques used in the Phone Reader Project help in extracting the text contained in the image taken by the user.

5 2.3 Text reading systems

This section presents several mobile applications able to read text from an image. Some of them require OCR (Optical Character Recognition), and others do not.

2.3.1 Mobile text reading applications requiring OCR

The Phone Reader Application, system designed for the blind and visually impaired, was implemented in 2009 by Computer Science Honours students from the university of Western Cape [13]. This application [14] runs on Android mobile phones with embedded wireless Internet connections, and requires a wireless router to manage the session between the phone and the server, as the Phone Reader server application resides on a server computer. The client-server architecture is used for this system, and is considered to allow faster processing, and to produce better results for the translation, with the use of online dictionaries. When the user takes a photo of a text, that photo is then processed with MODI (Microsoft Office Document Imaging) on the remote server. Afterwards, the TTS engine reads the extracted text from the server through the phone’s speaker. But the quality of the image has to be very good in order for the system to produce accurate results. This means that before the image is sent to the Optical Character Recognition (OCR) program, it must already be of a good quality, otherwise the OCR will not be able to efficiently extract the content. This presents a weakness of the system because, as it was created for the blind and visually impaired, it should take into consideration the fact that those users cannot always manage to take good images. Taking good images means being aware of lighting conditions, or of the best position for the camera. Agnes Kukulska, is an author that highlights the possibility of using a type of ”phone reader application” to learn a language in her paper [17]. She discusses how mobile appli- cations can contribute to learning a language with systems like Captura Talk 1, an Android application designed in UK in 2008. She explains that this application enables the user to take a photo of a text and hear the text being read back to him, therefore it is the type of application that can be used for people learning English. This application uses a commercial

1www.capturatalk.com

6 OCR (ABBYY Mobile OCR Engine 3.00) and a Text-To-Speech (TTS) engine. It can also recognize and translate text in more than twenty languages, and includes a talking word processor. Similarly, the National Federation of the Blind (USA) and Kurzweil Technologies have created a mobile application for the blind and dyslexic: the knfbREADER [29] in 2008. However, this application and the Captura Talk are not identical; the knfbREADER does not include the talking word processor, but enables the user to enlarge the text for a better view. It is integrated in the Nokia N82 cell phone and runs on the Symbian Operating System. The images are processed with Kurzweil image processing software. According to Dr. Marc Maurer, President of the National Federation of the Blind, this was the first cell phone able to read text from a picture taken by the user; and it will promote equal opportunity for the blind. Those two systems have many features and provide good quality images, but are both very expensive. Contrary to the software described previously, R-MAP Android [28] is a low cost system especially designed for the blind and visually impaired that facilitates their use of mobile phones to read text. This application does not require an Internet connection because it uses an OCR program and a text-to-speech engine integrated in the Android mobile phone designed for blind people, and it does not offer translation of texts. This paper [28] contains a very rich list of related works, which goes from Braille to the knfbREADER. It also lists a large number of OCRs: open source OCR such as GOCR, OCRAD, or Tesseract; and commercial OCRs like ABBYY FineReader or the Microsoft office Document Imaging, also used for the Phone Reader Application [14] mentioned earlier in this section. The cost of very powerful mobile applications for the visually impaired, like the knfbREADER is a concern raised in this paper. All visually impaired people cannot afford such an expensive application. This system provides a user friendly interface, easy to handle for blind users: the buttons to access the different functionalities of the software are located at the corners of the screen and are therefore, easy to access. Also if the text in the picture taken is not readable, the system will inform the user who could make another attempt. The system uses Tesseract OCR and TTS software available on the Android mobile phone. But this application does not seem to be widely known, nor used despite its low cost.

7 In this subsection, we have reviewed types of mobile systems that can produce accurate audible results from an image with text by using an OCR program. There are different OCR programs, some are free and open source, others are commercial OCRs and can recognize text in a wide variety of languages. Using a commercial OCR program results in creating a costly application.

2.3.2 Mobile text reading applications not requiring OCR

Trinetra Grocery shopping assistant [20] assists blind users when they are doing their grocery shopping, so that their phone can tell them which item is on the shelf in the store, based on the product-level identification. The product is identified with the Barcoda Pencil, then its barcode is sent via Bluetooth to the Trinetra system which will read out loud the information found on the product (brand and name) to the user. Trinetra system is meant to be cost- effective for the blind user. The user uses a Barcoda Pencil, a Bluetooth headset, and the Nokia 6620 cell phone. This system does not use an OCR program because there are no pictures taken, but it uses TALKS TTS program to produce the speech to the user. One limit of this system is the fact that the user has to carry two accessories besides his cell phone to use the system, and also has to manage to locate the barcode on each product. Moreover, Trinetra also offers a currency identifier system [20]. It consists of taking a picture of an US currency note and hear its value. This system uses the phone’s camera to take the photo of the currency note and the Microsoft’s online Lincoln object-recognition technology to identify the currency note. Many different images of the US currency notes need to be stored in the system’s database for it to compare the image sent by the user to the images contained in the database. When a currency note is identified, the TALKS TTS reads out loud the result to the blind user. Once again, there is no OCR required. In this section, we can notice that extracting readable information in images does not necessarily require the use of an Optical Character Recognition (OCR) program. Instead, data about an object can be stored and retrieved from a database.

8 2.4 Object recognition systems

This section presents systems that provide object recognition with different techniques.

2.4.1 Systems using crowd-sourcing for object recognition

Some systems like Vizwiz [3] or CrowdSearch [31] use crowd-sourcing to improve object recognition from images:

• Vizwiz is a mobile application that runs on iPhone for blind users. It sends a picture taken with a phone to a remote server and ask a question about the content of the image; the answers are given by persons especially employed for that task. The user can then hear the answer to his question, sometimes receiving that answer can take a few minutes, which can be annoying for the user. However, this method provides more accurate results as the answers are not programmed with a computer, but are verified by humans.

• In this paper about CrowdSearch [31] the author declares that Google Goggles does not work with all types of images because of the high rate of errors that can be provided from a poor quality image taken from a phone, but was mainly designed to recognize building landmarks. CrowdSearch is an image search system of which the results are validated by humans working on the query image. According to the author, local image search on mobile phones can only be efficient energy-wise, but it limits the accuracy of the results whereas remote processing on powerful servers is more accurate and fast. This application was designed for the Apple iPhones and each query sent by the user has a cost, a defined deadline and is validated by human validators after the automated image search. As this author explains, it might really be hard to have very accurate results if the research is done locally because of the limited size of the database located on the phone. The author does emphasize the fact that CrowdSearch is a costly system and that human validation may increase the delay of the results, but it gives more accurate results.

9 Croud searching is a method that requires human intervention to process a request from a user, but it is a method that provides better results as they are not programmed results. The user is under the impression that he is in a conversation with another person.

2.4.2 Mobile applications using visual search on specific types of objects

There are systems that offer visual search on some specific types of objects. This means that the user takes a photo of an object and the system will find information on it. But the results are not always audible outputs.

• Kooaba 2 is an image recognition system that runs on iPhone. With it, a user can have information on a picture he has taken with his camera phone. It is mostly designed to recognize prints (newspaper, brochure), paintings, products (books, Cd’s), and places. A huge database is used and contains over 28 millions items for object recognition.

• Google Goggles is an image search system [30] that can scan for text contained in a picture using OCR. It can translate text in a few languages [11], but it does not automatically read out the extracted text to the user. Google Goggles’ limited range of image searched is mentioned in papers like the one about CrowdSearching [31] or the one about Mobile Image Recognition [30], which both explain that this application was not designed for the recognition any type of objects, but mostly for landmarks or products.

• SnapTell 3 is a system mentioned in the article about mobile image recognition [15], that runs on Android and iPhone. It was designed for books and CD covers recognition, and is based on the same principle as Kooaba.

Visual search techniques can be used by mobile applications, and also provide accurate results about a user request on a specific object. But those applications are not always well- suited for people with reading disabilities. And without the help of a translation program, they can be useless for non-native speakers, but useful for illiterate users. 2www.kooaba.com 3www.snaptell.com

10 2.5 Text and object recognition systems offering ex- tended functionalities

Some systems allow the recognition of text and object from images, and provide even more functionalities, other than the translation. As mentioned earlier, this paper [15] highlights the fact that Google Goggles can mostly recognize landmarks and logos; and that camera phones generally produce low quality images contrarily to the paper about Mobile Product Recognition [30]. The mobile image recognition system presented in this paper captures and recognizes video frames of a document (mostly newspaper), not snapshots of images, enabling the system to process different frames of the same video until one is recognized. The video taken concerns the same document captured from different angles. It is not a client-server architecture because this author believes that that type of architecture causes delays to respond to the user, but instead it is a client-only architecture in which the database remains on the phone and is periodically updated from the database generator. The size of the database is computed by estimating the number of newspaper pages a user can want to read daily, and in this case it reached about 10.5MB. Once an image is recognized, a thumbnail of the document is displayed with “hot spots” which are each linked to some electronic content. The user can then access that electronic content by clicking on a “hot spot”. The image recognition algorithm used in this system processes the image to improve its low quality. The recognition process takes about 6 seconds. This idea sounds original, but does not seem very useful as the output is not read to the user, and there are no translations available. That makes it unusable for visually impaired people, or for the non-native speakers. It does not seem to be useful for a wide range of users. In this case, only persons who can see and read can use this system, and if a user has access to the Internet, as this system requires, and needs information on a newspaper, the easiest way would be to open that newspaper website, and click on each interesting link. The necessity of this system is not clearly explained in this paper. Like the Trinetra Grocery shopping assistant [20], this research article [19] presents a system that only provides an audible output, and it describes the knfbREADER as a sys- tem having a high level of accuracy for text recognition only for documents without many

11 colours or images; and the AdvantEdge Reader as an efficient portable scanning and reading device only for flat documents. The device presented in this paper is called Multifunctional assistant for the blind, which is a PDA phone (a personal organizer assistant and cellphone combined in one device [23]) that uses a multi-functionalities system including an in-house optical character recognition(OCR), audio messages recording, object recognition, banknotes recognition, colour recognition and the ability to listen to audio speeches or records. Depend- ing on the user’s level of vision, the system offers different designs of user interfaces. The only output is a synthetic voice used to guide the user when he is using the system. To recognize objects, they propose a tag-based object recognition which will enable the user to stick a dedicated tag on an object, take a photo of the tag, and record a description of that object. Hence each time the photo of this object will be taken, the system will then play its previously recorded message to the blind user. The banknote recognition in this system is made for the Euro currency but the author mentions the possibility of easily extending the system to recognize other currencies. Once a picture of a banknote is taken, this system starts by locating the “region of interest” which is the zone on the banknote where its value will be read; this region is the tag in the case of object recognition. The image is then pro- cessed and only the numbers representing the value of the banknote are extracted and read to the user. The colour recognition is also done by processing a picture taken and identifying the main colour of the object. But the problem of the quality of the picture taken from a camera phone remains because of the fact that OCR engines are meant to work with scanned documents, this system can only process 1.3 megapixels images. The approach aims to use specific tags for similar-by-touch objects, but this can cause a problem when the system needs to differentiate them and verify if a tag already has an accompanying description. The number of tags required for the classification of similar-by-touch objects can be very large, which means that the user might have to regularly buy them. Contrarily to the previous authors, the author of the Mobile Product Recognition paper (MPR) [30] considers the camera phone as a very efficient tool for image processing and defines a few image retrieval systems like Google Goggles, SnapTell, or Kooaba as too slow to process a query compared to his system. It is a system similar to Google Goggles, a mobile visual search system as defined in the paper, meaning that when an image is taken by

12 the user, the system seeks information about it on a remote server using a wireless Internet connection. The MPR system uses a set of functionalities such as feature compression image to accelerate the query processing time, and achieves the queries in less than 2 seconds. The system does not send the whole query image to the server, but just its smaller compressed features. The image processing is implemented on the phone and the query is processed by a search in the database. But this author’s point of view about the efficiency of a camera phone for image processing is not shared by many other authors, all the previous authors have an opposite point of view, and even in this Science Daily article [27], the camera phone is described as a not efficient enough tool to take good quality images. In conclusion, reading a text from an image can be extended to seeking information about that image. It can be the colour of an object, the value of a banknote or even a special de- scription assigned by the user to the object, or image. Some applications do not work with snapshots from a camera, but with video frames, to find the best frame to process.

2.6 Common tools used by object recognition and text reading mobile applications

To develop a mobile application that will process an image and then read the content to the user, we have noticed that essential tools are required.

2.6.1 Mobile operating systems

• Android Android [26] is an operating system developed by Google’s team for mobile devices. It is an open development Linux based platform. It was originally designed to support applications written in Java, but with Android NDK (Native Developpement Kit) [1], applications can also be written in C or C++. A Text-to-Speech engine is also provided with the Android mobile phones [28]. Android applications can be developed using Windows, Mac OS and Linux systems [10], thus no special hardware is required. The

13 most recommended IDE for developing Android apps is Eclipse. Dalvik is the virtual machine environment for mobile devices created by Google, and when projects are compiled, each application runs on its own VM, not on the Java VM.

• iOS iOS is the mobile operating system created by Apple that runs on the iPhone since 2007. Its development requires Macintosh computers running on Mac OS X [10]. iOS apps are written in Objective-C with the modern IDE Xcode, quite similar to Netbeans, Eclipse, or Visual Studio.

2.6.2 OCR

An Optical Character Recognition (OCR) program is a program that converts an image that contains text into an editable format [21]. It can take as input different document formats such as PDF, PNG, TIF and produces an editable document as an output of various formats like TXT, DOC, or PDF. In order to extract the text from a scanned document, or a digital image, the OCR performs the following steps: after loading the input image, the OCR detects features like the resolution and inversion, and font size. Each OCR expects the image to have a predefined background or foreground colour (most often back and white). A deskewing algorithm can also be applied when necessary. Finally there is a page layout analysis that is performed to detect position of important areas in an image.

2.6.3 Text-To-Speech Engine

A text-to-speech software or speech synthesis program is used to read a text out loud [2] to a user. The one included in Android 1.6 can also translate texts in English, French, German, Italian and Spanish. A TTS engine uses natural human voices to avoid sounding like a robot during the reading of a text to a user. There are also many different free TTS engines available online where the user can choose the voice he prefers the text to be read with.

14 2.7 Plan of Action

To implement this Phone Reader mobile application, we will perform the following tasks:

1. Develop a web server on which a user can upload images that need to be processed.

2. Pre-process the images that have been uploaded in order for the OCR to recognize words more easily.

3. Evaluate the pre-processing of images.

4. Send the image to the OCR for text extraction.

5. Implement a user interface for the client.

6. Perform the translation of the extracted text, if required by the user.

7. Connect the application to a Text-To-Speech software for the reading of the text ob- tained to the user.

8. Test and improve the application.

9. Write the documentation of the project.

2.8 Conclusion

We have reviewed different mobile applications designed to help users with reading disabili- ties when they come across a text they cannot read. We can then say that the Phone Reader project is not a brand new topic, related works have been implemented since 2008 by ma- jor companies whose software are expensive. A few other systems have been implemented by researchers that are not big companies, but are not widely used. Nevertheless, those applications designed for visually impaired can also be used for non-native speakers when they include a translation functionality, and for illiterates if they generate audible results. Furthermore it has been noticed that camera phones often take low quality images, hard to process efficiently by an OCR. This is the reason why to implement this Phone Reader application, we will have to mainly work on the image processing aspect, so that the OCR

15 will easily recognize characters from the text in the processed image. Thus the TTS engine will be able to give more accurate results.

16 Chapter 3

Design of the system

3.1 Introduction

In this chapter, we do an analysis of the Phone Reader in order to produce the final design of the system. This design is used for the implementation of the system. We have used an object oriented approach.

3.2 Textual description

The Phone Reader application is composed of a client and a server side. The client side runs on an Android device and the server side is made up of an HTTP remote server that processes images sent by the client, and generates a readable text file out of it.

1. Client side The user takes a picture with his mobile device. He can save or erase the image; or he can take another photo. Once the image is saved, the system allows the user to have a look at the image he just took and highlight a desired region of interest in the image. This means that the user makes a rectangular selection of a particular region in the image that he wants to be read. This step is followed by the selection of a language for the translation, and finally the user can send the image to the server.

17 2. Server side Once an image is uploaded to the server, it is processed using Imagemagik commands. After the processing of the image, the text can be extracted from it by performing OCR. The OCR engine used is Tesseract OCR which generates a text file. That text file contains the text extracted from the image, and it is then used for translation. The translation program used is Apertium which produces a final text file ready to be read and downloaded by the client. The translation can be ignored if the user has not selected a language for the translation; in this case, only the text produced by the OCR will be downloaded.

3. Client side After the text has been translated on the server side, the client automatically downloads it, saves it and displays its content on the phone screen. Simultaneously, that text is read by the Text-To-Speech program: Android Text-To-Speech.

Each of the described steps detailed above is represented in Figure 3.1.

Figure 3.1: Flowchart diagram of the system

18 3.3 System Design

This section presents the design of the system.

3.3.1 System architecture

The architecture of the system is represented in Figure 3.2, and presents the succession of processes in the system.

Figure 3.2: Architecture of the system

Legend

1. The picture is taken and stored in the phone.

2. The language is chosen by the user.

3. The photo is uploaded to the server and the language is saved in the server’s database.

4. The server retrieves the uploaded image.

19 5. The server retrieves the language saved in the database.

6. The image is sent for pre-processing with Imagemagick processing tool.

7. The processed image goes through the Optical Character Recognition (OCR) program: Tesseract OCR. This latter puts the extracted text in a text file.

8. The text file produced by the OCR step is translated, with the specified language; if there was no specified language, no translation will be performed.

9. The final text, with or without translation is produced and stored on the server.

10. The text is made available for downloads.

11. The client downloads the final text from the server.

12. The device reads the downloaded text out loud via Android Text To Speech in the selected language.

3.3.2 UML approach

Some UML diagrams were designed to have a better overview of the different parts of the system. Use case diagram Fig 3.3 represents the use case diagram of the Phone Reader. The system offers two main functionalities to the end user : take a photo and listen to the text. The first functionality triggers all the others, and requires the user to also select a language, it offers the option to select a region of interest within the image, but this option can be ignored.

Class diagram There are two class diagrams for this application, one for the client (Figure 3.4) and one for the server (Figure 3.5). Each of them represents different operations.

20 Figure 3.3: Use case diagram

• Client side

In this diagram (Figure 3.4), five classes are designed: TakePhoto, SeeImage, Choose- Language, Validation, TextToSpeech. Each of them represents a different screen on the client interface. The initial event take a photo in the class TakePhoto triggers the process of uploading the image and sending its data (language, region of interest) to the remote server. Class SeeImage displays on the phone the image that the user has taken, and allows him to select a region of interest. Class ChooseLanguage contains the list of languages that the user can select. Class Validation retrieves all the selections made at the previous stages by the user and communicates with the server. It uploads the image and its data and waits for the final text to be ready to download it. Class TextToSpeech reads the content of the file downloaded at the previous step.

21 Figure 3.4: Class diagram on the client side

• Server side

There are six classes in the following diagram: Image, FileManager, Processing, DatabaseC- onnector, Translation, Tesseract. Each of them interacts with the class Image, except the class Translation, which is called by the Tesseract class. The following is an expla- nation of the diagram:

* Class Image is an abstract class, it represents one record in the Database. Each image created is referenced by its path.

* FileManager moves the image between directories. Each process works in a specific directory.

22 Figure 3.5: Class diagram on the server side

* DatabaseConnector saves the data associated with the uploaded image, and allows the interaction between the different classes and the database.

* Processing processes the image and produces an new one.

* Tesseract performs the OCR on the new image and generates a text file.

* Translation translates the text produced by the OCR.

23 3.4 Conclusion

The design of the Phone Reader application has been presented in this chapter. The object oriented approach has allowed us to design all the necessary classes that will be implemented for the interaction of the different components of the system.

24 Chapter 4

Implementation

4.1 Introduction

In this chapter, we present the technical and programming aspect of the Phone Reader application. The system requires a client interface and a server to process the uploaded images sent by the phone.

4.2 System requirements

PhR (Phone Reader) system was developed on Ubuntu 11.04 and requires at least 8G of disk space and a minimum of 1.5G RAM. This table presents the other system specifications.

25 Table 4.1: System specifications Operating systems Ubuntu 11.04 Android 2.1 Programming languages PHP 5.0 Java OpenJDK 6 Runtime HTTP server LAMP server Image processing software Imagemagick 6.8.0-2 OCR program Tesseract OCR 2.04-2.1 Translation program Apertium 3.1.0-1.2 Text To Speech program Android Text To Speech Database MySQL Mobile devices HTC Desire C with Android 2.1 Samsung Galaxy Tab GT-P1000 with Android 2.2.

4.3 Description of the tools used for the system

The tools used for this system are Linux-based software, free and open source.

• Apertium Apertium is used for the translation part of the system because of its fast speed of translation. The latter is performed directly on the server, not on a remote server like some web-based translation APIs. Apertium is a free and open source machine translation platform [7] also described as a rule-based machine translation program available in Ubuntu. Its project was launched in 2004. It was initially designed to treat pairs of closely related languages. The original two language pairs were Spanish- Catalan and Galician-Spanish. The rest of the language pairs have been developed with academia, as research projects by students; some of them are: French-Catalan, English- Spanish, English-Esperanto, English-Catalan, Swedish-Danish and Breton-French.

• Imagemagick

26 Imagemagick [16] is free open source software designed to process bitmap images. It is distributed under the Apache 2.0 license. It is very easy to use even for beginners with very little knowledge of image processing. The user of Imagemagick command does not have to know all the details of a command in order to execute it, low level command details are hidden from the end user. Commands are simple to type and easy to understand by the user. A bitmap image is an image made of dots or pixels [24], each representing a specific intensity value, a pixel value, in other words a colour. Imagemagick can read and write in over 100 image formats such as GIF, JPEG, PDF, PNG, Postscript, SVG, or TIF.

• Android As mentioned in the second chapter, Android is a free open source operating system for mobile devices, running on a Linux kernel, and owned by Google. Android provides various applications written in the Java programming language. This operating system includes a set of core libraries [26] that provides most of the functionalities available in the core libraries of the Java programming language. In order to develop Android applications, developers use the Android System Development tool Kit (SDK). It pro- vides all the necessary tools to write, compile and run an Android application with or without a connected mobile device, as the emulator emulates an Android mobile phone. Once the latter is installed, it is easy and simple to use it with Eclipse IDE; in our case, we used Eclipse Indigo.

• Tesseract OCR Tesseract OCR is an open source OCR engine [18] developed at first as a PhD research project in HP lab between 1984 and 1994. Its appearance at the University of Nevada in Las Vegas Annual Test of OCR Accuracy in 1995 showed good results. It was released in 2005 by HP as open source. Tesseract assumes that its input is a binary image with optional polygonal text regions defined. It is able to detect inverse text and recognize it as easily as black-on-white text. It was also one of the first engines able to handle white-on-black text. It uses a line finding algorithm to recognize skew pages and another one to resolve fuzzy spaces. It also uses an adaptive classifier to

27 recognize different fonts. Once installed from the Ubuntu repository, Tesseract is very easy to use. The current version can support 40 languages [4] such as: Arabic, English, Bulgarian, French, Italian, Spanish, Russian, Japanese and Korean.

• Android text to speech It is the text to speak engine [2] provided by Android. It converts an input text to an output speech in different languages. The following languages are some of the avail- able languages in Android Text to speech: English, French, Spanish, Arabic, Italian, Japanese, Chinese, and Korean. Thus, the text to speech is directly performed on the client side, once the final text produced on the server is downloaded on the phone.

4.4 Code documentation

This section explains how the system was programmed by describing which methods were used for the pre-processing of images, and how the client interacts with the server.

4.4.1 Image processing techniques used with Imagemagick

The following operations [12] were applied on the original image to improve the OCR accu- racy:

• Crop: This option is used in case the user provides the new dimensions of the image that needs to be processed. He can do it by selecting a region of interest in the image, which he just took, from the mobile device screen. To highlight the region of interest of the image, the user slides his finger on the screen of the device from the top right corner of the region of interest to its bottom left corner; this will create a red rectangle on the phone screen. The coordinates of this rectangle are scaled to the real image size and sent to the server. The process of cropping an image consists of reducing the size of the image by removing unwanted parts of it. In our case, the red rectangle will produce the new image, which is just a part of the original image. The processing operations will then be performed on this new reduced image. With Imagemagick, cropping an image consists of using the method -crop followed by

28 a geometry, which contains the dimensions of the new image. The geometry used is width*height+x+y, where width and height are the dimensions of the cropped image, and x and y are the coordinates of the top left corner of the rectangle on the original image.

• Despeckle: In a scanned image, there can be some extra pixels present due to the imperfection of the scanner or of the camera. The process of removing them in order to improve the sharpness of the image is called despeckling. This technique is also known as noise removal. This method was easy to use as it did not require any geometry, but only the name and path of the image that needed to be cleaned.

• Grayscale: This technique consists of converting a colour image to a black and white image with different shades of grey. In our system, the text in the image should be easily identified; therefore, grayscaling the image will facilitate edge detection to identify characters within the image. To grayscale the image, we only had to use the method -colorspace followed by Gray, and Imagemaick automatically converts it to grayscale image.

• Threshold: It is a technique that consists of assigning only two colours to a grayscale image in order to eliminate all the grey shades to have a monochrome black and white image. It is also known as average dithering. A threshold value is chosen, and any pixels that have a value smaller than the threshold become black and all the pixels that have a value greater than the threshold become white. With this method, the background is set to a unique colour different from the foreground, which allows a better object detection within the image. The threshold chosen in our system is set to 45%, which corresponds to 115. This mean that all pixels with a value less than 115 on the grayscale image get a value of 0 (black) and the rest get a value of 255 (white). This value was chosen after applying many tests with different threshold values to the grayscale image. We could notice that the quality of the image taken from the camera was very often the same, and the grayscaling technique produced similar results, therefore, a fixed threshold could be used for all images.

29 • Deskew: This method consists of realigning the objects in the image by removing skew in the image. Skew can occur if the text or the camera was misaligned when the photo was taken. Framing the camera in the same exact position or angle of the text we want to capture can be a difficult or challenging task; therefore, processing the image must include a process to realign the picture. We used a threshold value of 40% to straighten the image.

• Converting to TIF image: After performing those various transformations on the original image, its original JPEG (Joint Photographic Experts Group) format is changed to TIFF (Tagged Image File Format) because the OCR engine Tesseract processes TIF images. It is versatile a format which supports many compression options and allows the exchange of bitmaps between applications. Converting the original image to TIF only required us to use the method -convert with the name of the original image and the name of the output image with a TIF extension.

• Unsharp: It is a technique that consists of sharpening an image without increasing noise or blemish [9]. The edges of the image are sharpened to increase their detection. Our system sharpens and enhances the edges of the characters in the image, so that words can easily be identified. The values assigned to the unsharp mask filter affect the quality of the output image. We used -unsharp with the following parameters chosen after performing several tests: 1.5*1.2+1.5+0.1. The radius of the Gaussian operator is 1.5, 1.2 is sigma: the standard deviation of the Gaussian; 1.5 is the amount: the fraction of the difference between the input image and the blur one, it is added back to the input image; and the last parameter 0.1 is the threshold.

• Brightness-contrast: This option [8] is used to increase or decrease the value of the brightness and of the contrast within an image. The value chosen for this method are relatively low, so that the image is not totally transformed. This modification can be noticed if the image is too dark or to bright, but in case of a clear image, those modifications might seem inexistent, which the point is because there is no need to adjust an image that does not need adjustment. This method also requires two parameters to know how much brightness and contrast is needed : brightness*contrast,

30 so we used -brightness-contrast 5*12. Those values can vary between -100 and 100, if the contrast value is 0, it means there are no contrast applied to the image.

Figure 4.1: Function to apply Unsharp method to image

4.4.2 Java implementation of the classes of the system

This section presents the programming aspect of the Phone Reader. It uses an object-oriented programming model in which an object is an instance of a class; and those objects interact with each other to perform a given task.

• Take a photo The Activity class TakePhotoPHR implements the photo capture. Once the button Click to take a photo is clicked, the system opens the camera mode of the mobile phone and creates a specific directory, in the SDcard, named “PHR” where the photo taken will be stored. All the files in this directory are removed each time a new photo is saved. The photo taken is given a specific name “photoPHR.jpeg” when it is stored. Figure 4.2 shows the initialization of the Camera Activity class, and Figure 4.3 shows the function for to open the camera mode.

31 Figure 4.2: Initialization of the Camera Activity class

Figure 4.3: Function to open camera mode

• Select Region of Interest Once the photo has been stored, it is automatically displayed on the phone screen, through an ImageView, as a bitmap image in the Activity class SeeImageActivity (Code source in Figure 4.4). A region of interest (ROI) can then be highlighted on the image if the user draws half a rectangle on the phone screen by sliding his finger from the top left corner of the ROI to its bottom right corner. To implement this option, the Android function onTouch was used with the event ACTION-DOWN and ACTION-UP. ACTION-DOWN saves the coordinates of the point, within the ImageView, in a variable, when the user puts his finger on the screen: it represents the top left corner of the ROI. ACTION-UP saves the coordinates of the point when the user lift his finger from the screen: it represents the

32 bottom right corner of the ROI. These two points are used to compute the width and height of the ROI, and to draw the rectangle on the screen with the Android function onDraw. A very important aspect to consider in this step is the orientation of the camera. The rectangle height and width are different depending of the orientation of the camera. The coordinates on the phone screen (the ImageView size) do not match the coordinates on the real-size image (the image resolution), so they have to be mapped by computing a ratio: dividing the image resolution by theImageView size. That ratio is used to compute the size of the ROI on the real-size image. The dimension of the image, and the coordinates of the top left corner of the ROI are sent to the server to crop the image.

Figure 4.4: Display bitmap image on phone screen

33 Figure 4.5: Snippet of code for the event ACTION-UP

• Choose language The Activity class ChooseLangActivity implements the selection of the language by the user. Language pairs are items of a list that is displayed in the screen. The first item of the list is “None”, if it is selected, it means that the user does not want any translation and the text will be read in the default language: English. Each item selected by the user is sent to the server’s database. However, if a language pair contains two identical languages, it means also means that no translation is required, and the system will read the text in the language selected. But if the languages in a pair of language are different, the first language is the language of the original text, and the second one is the language for the translation. For example the language pair “French-Spanish”, means that the text needs to be translated from French to Spanish. Once a user clicks on a language pair, the item is split, and each language is kept in a different variable, one that stores the original language and one that stores the transla- tion language. The function to split the item is not called for the first item of the list. The values saved in the database are language codes that map to Apertium language codes, because each language in Apertium is referenced by its code, for example, the

34 code for English is “en”. Figure 4.6 presents the function to map a language to its code.

Figure 4.6: Map a language to its code

• Interaction between the client and the server After the language has been selected, the system allows the user to send the request to the Apache server. This step is implemented by the Activity class ValidateActivity. In this step, all the selected data and the image are uploaded to the server via a wireless Internet connection. The image is uploaded to a specific folder in the server, and its data are saved in the database. Each time the server receives a new request, all the data related to a previous request are removed. The server performs the pre-processing of the image, the text extraction and the translation if necessary. The job is finished when the server produces the final text file that contains the text that will be read to the user. As soon as the final text is available on the server, the client downloads it in the same folder as the one where the image was saved. To upload a photo, the client uses the URL of the PHP script residing on the server that uploads the image (see Figure 4.7). To download the text file, the client uses the URL that contains the path of the text file residing on the server (see Figure 4.8).

35 Figure 4.7: Php script to upload image on server

• Main class on the server that calls processing, OCR and translation func- tions On the server, one main class (named mainClass) calls all the functions of the different objects inside the Java project to process a request, and performs the following steps:

1. The image is moved from the directory in the PHP project, to one in the Java project. This new directory inside the java project will contain the processed image.

36 Figure 4.8: URL to download the text file

2. To start with the processing of the image, the values saved in the database are retrieved and stored in different variables in the main class. Those values include the coordinates of the left corner of the of the region of interest selected by the user, its width and height, the language of the text and the language for the translation. The image name is also saved in the database, but as it is a constant, it does not need to be retrieved. If the values for the region of interest are equal to zero, no cropping will be performed on the image. If the text language is equal to “None”, then there will be no translation.

3. An instance of the class processImages to process the images is created to crop the image if necessary before continuing with the other processing methods. If the image is cropped, the new image created represents the region of interest, and the original image remains inside the same directory (See Figure 4.8).

37 Figure 4.9: Calling the image processing methods from the main class

4. After the image has gone through pre-processing, the OCR function (from the class tesseract-code) is called. The function for the OCR takes as parameters the name and the path of the processed image, and the language of the text, to produce a new text file containing the extracted text. Depending on the image size, the OCR takes between 1 and 3 sec mostly.

5. The next step concerns the translation of the extracted text by the class transla- tion. An instance of the class is created, and the function for the translation takes as parameters the text language, the translation language, and the name of the text file produced by the OCR; it will produce another text file that will be saved in the same directory as the OCR text file.

• The client waits for the main class to finish executing its job before downloading the text file.

• Once the text file is downloaded on the phone SDcard, in the same folder as the original image, it is opened and read by the Activity class TTSActivity. The file is opened so that its content can be displayed on the phone screen. When this class is initialized, the translation language chosen by the user is transmitted to this class and stored in a variable. If the user did not select any language pair, the TTS will read the text

38 Figure 4.10: Calling the OCR function in the default language: English, else, the Android TTS will read the language in the second language in the selected language pair; for example: with the language pair “Spanish-English”, the text will be read in English, not in Spanish. The function speak in the TTS class takes as parameters a String which contains the content of the opened file. The text is then read after the user clicks on Hear the text.

39 Figure 4.11: Calling the translation function

Figure 4.12: Function to perform Text-To-Speech

4.5 Conclusion

In this chapter, we have described the tools and methods that were used to develop the Phone Reader application. The Apache server processes the request sent by the client using Java and PHP as programming languages. Once the job is done on the server, the extracted text is downloaded and read by the phone. All the tools used are free and available on Ubuntu.

40 Chapter 5

Tests and Results

Several tests were performed on an English text containing four sentences and 100 words. The OCR accuracy was measured by the percentage of recognized words in the text out of the total number of words in the text. Ten pictures of the same text (written with Times New Roman) were taken, and on each we evaluated the accuracy of the OCR. It takes about 10 sec for the server to process a request, if the request takes a lot longer, it means that there is an internal error on the server.

5.1 Different font sizes for the same text

To perform the following tests, we switched on all the lights inside a room to have bright light projected directly onto the text.

41 • Font size 12

Figure 5.1: OCR results of a text with font size 12

The average OCR accuracy in this chart is 93%. This means that 93/100 words were recognized by the OCR with a text of a font size 12 in a bright room.

• Font size 14

Figure 5.2: OCR results of a text with font size 14

The average OCR accuracy in this chart is 96%. This means that 96/100 words were recognized by the OCR with a text of a font size 14 in a bright room.

42 • Font size 16

Figure 5.3: OCR results of a text with font size 16

The average OCR accuracy in this chart is 98%. This means that 98/100 words were recognized by the OCR with a text of a font size 16 in a bright room.

We can notice that the best results were obtained from the text when its font size was 16, in which the OCR accuracy even reached 100%.

5.2 Lighting conditions

To test the system under different lighting conditions, we used the text that produced the best OCR results under bright light: the one with font size 16.

• Low light To test our system under low light, we switched off the lights located above of the text, and left on the lights in the corner of the room, so there was no light directly projected onto the text, however it was still readable.

43 Figure 5.4: OCR results under low light

The average trendline is 92%. This means that 92/100 words were recognized by the OCR with a text of a font size 16 in a room with low light.

• Using the flash of the camera

Figure 5.5: OCR results under low light using the camera flash

We only used five photos on this example because the number of recognized words for each photo was almost the same. The average trendline is 90%. This means that 90/100 words were recognized by the OCR with a text of a font size 16, in a room with

44 low light, when we took a photo using the flash of the camera. In the following image we can see a processed image taken with the flash activated:

Figure 5.6: Result of the pre-processing of an image taken with flash activated

The flash enlightens certain regions of the image. In this picture, we see that the enlightened region forms a circle; all the words that are outside the circle are left in the dark, and are more difficult to recognize by the OCR (see Figure 5.7). Using the flash did not help in providing more accurate results, but provided even worse results.

45 Figure 5.7: OCR result of a pre-processed image taken with flash activated

5.3 Testing the translation accuracy

We used a Spanish text of 35 words, font size 16 in a bright room, to translate into English. The translation produced an English text of 35 characters. This chart presents the percentage of characters that were properly translated.

Figure 5.8: Representation of accurately translated words in a text

An average of 89.8% translation accuracy means that 32/35 words were correctly translated. We can conclude that the translation did not provide very accurate results, especially due to the OCR results: if a word or a character is not correctly identified, it cannot be correctly translated.

46 Chapter 6

Conclusion

In this chapter, we look at the goals that were achieved by the Phone Reader and the aspects that could be improved.

6.1 Goals achieved by the system

The following list presents the major goals that have been achieved by the system:

• The Phone Reader allows the user to take a photo with any Android mobile phone, and send it to a remote server using a wireless Internet connection.

• The system allows the text to be read in a variety of languages, with or without translation.

• Even though the accuracy of the OCR does not always reach 100%, it can recognize more than 80% words in a text.

• It takes a few seconds (about 10 sec) for the server to process a request.

• It is a low cost system developed with .

6.2 Limits of the system

The following list presents some limits of the system:

47 • The mobile device must provide an auto focus function to take good quality pictures.

• If a human with a good sight cannot identify words from the original picture, it will also be difficult for the OCR to identify those words.

• The lighting of the text plays a very important role in the recognition of characters within the image, low light prevents good OCR results.

6.3 Future work

To improve the system, some more work can be done:

• Increase the number of available languages for character recognition and for the trans- lation.

• Include a spell checker for not well-recognized words.

• Improve the representation of special characters after a translation.

• Implement a dialog to let the user know if the system has not been able to extract the text from the image of bad quality.

6.4 Conclusion

Overall, the Phone Reader application is a low cost working mobile application that runs on Android phones. The main goal of the application has been reached: enabling a user to listen to a text, contained in a picture taken with a phone, in a chosen language. The application works best on mobile phones with a camera taking good quality pictures. But not all users can have Android phones with a good camera, therefore this presents a limit of the system. The system offers a user friendly interface and is light-weight. However, the main challenge using the system remains the use of the camera, because the picture needs to be correctly taken. Therefore a blind user might not always hear a very accurate speech as he cannot know if the text inside the image is readable or not.

48 Appendix A

User’s guide

The phone reader application allows the user to take a photo of an image with text, and hear the content of the text. This chapter describes how to use the Phone Reader. In order to use the system follow these steps:

1. Click on the application icon, the name of the application is “Welcome to Phone Reader”.

2. First Screen A button to launch the camera is placed at the top of the first screen: see Screen 1.

Figure A.1: Screen 1

49 3. Take a photo To take a photo, adjust the settings of the camera

(a) We recommend you to choose low light mode if your device offers that option.

(b) Select a resolution with integer values, not decimal values.

(c) Check that the orientation of the icon on the phone, if there is one, corresponds to the actual position of the device.

(d) Zoom the text as much as you can to avoid having too many unnecessary objects in the photo.

(e) Make sure you can read the text in the picture; otherwise, it might be difficult for the system to read it too.

4. View Image Once the image is taken, you can save it, or discard it and take another photo. If you save the image, the next screen will present you the image you have taken. You directly click Next if you have no region of interest to highlight: see Screen 2.

Figure A.2: Screen 2

50 5. Select region of interest With your finger, you can draw half a rectangle starting at the top left corner of the first word you choose, sliding up to the bottom right corner of the last word you choose on the screen. This creates a red rectangle representing the region of interest in your image. If you are not satisfied, you can do as many selections as you want, and only the last one will be taken into consideration when you click next: see Screen 3.

Figure A.3: Screen 3

6. Select a language Clicking Next on the previous screen opens the list of language available for reading or for translating the text. On this screen, you click only once on the language pair you want, or you click on None to not perform any translation: the text will be read in English: see Screen 4.

51 Figure A.4: Screen 4

7. Upload the photo The button click to send photo will send the image and the language pair that you have selected to the server for the processing, and translation: see Screen 5. A dialog bar will appear while the server is processing the request (Screen 6).

8. Hear the text As soon as the server has finished executing the request, the phone downloads the final text, and the text-to-speech screen displays the downloaded text file. You can see the content of the text and click on the button to hear the text. You can listen to the text as many times as you wish. The button Quit at the bottom of the page allows you to close the Phone Reader application: see Screen 7.

52 Figure A.5: Screen 5

Figure A.6: Screen 6

53 Figure A.7: Screen 7

54 Bibliography

[1] Android. What is the NDK? Online. Available from: http://developer.android. com/tools/sdk/ndk/overview.html.

[2] Android. TextToSpeech. Online, November 2012. Available from: http://developer. android.com/reference/android/speech/tts/TextToSpeech.html.

[3] Bighamy, J. P., Jayant, H. C., Jiy, H., Littlex, G., Miller, A., Millerx, R. C., Millery, R., Tatarowiczx, A., Whitez, B., Whitey, S., and Yehz, T. Vizwiz: Nearly realtime answers to visual questions. UIST (October 2010).

[4] Documentation, U. OCR - Optical Character Recognition. Online, June 2012. Avail- able from: https://help.ubuntu.com/community/OCR.

[5] Dr. Kumar, N. Digital image processing techniques for image enhancement and infor- mation extraction. Proceedings of Workshop on Remote sensing and GIS applications in water resources engineering, Bangalore IV (1997), 33–43.

[6] Dr Rao, K. Overview of image processing. Readings in Image Processing (25-26 September 2004), 1–7.

[7] Forcada, M. L., Tyers, F. M., and Ramirez-Sanchez, G. The apertium machine translation platform: Five years on. Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation (November 2009), 3–10.

[8] GIMP. Brightness-Contrast. Online, 2012. Available from: http://gimp. open-source-solution.org/manual/gimp-tool-brightness-contrast.html.

55 [9] GIMP. Unsharp Mask. Online, 2012. Available from: http://docs.gimp.org/en/ plug-in-unsharp-mask.html.

[10] Goadrich, M. H., and Rogers, M. P. Smart smartphone development: ios versus android. SIGCSE (March 2011).

[11] Google. Introducing Google Play. Online, 2012. Available from: https://play. google.com/store/apps/details?id=com.google.android.apps.unveil&hl=en.

[12] GraphicsAcademy.com. Graphics Glossary. Online, 2012. Available from: http: //www.graphicsacademy.com/what.php.

[13] Hossein, and Shayesteh, H. The Phone Reader. Online, 2009. Available from: http://www.cs.uwc.ac.za/~hashayesteh/.

[14] Hossein, and Shayesteh, H. Phone reader application. Tech. rep., University of Western Cape, Faculty of Computer Science, November 2009.

[15] Hull, J. J., Liu, X., Erol, B., Graham, J., and Moraleda, J. Mobile image recognition: Architectures and tradeoffs. HotMobile (February 2010).

[16] ImageMagick. Convert, Edit, and Compose Images. Online, 2012. Available from: http://www.imagemagick.org/script/index.php.

[17] Kukulska-Hulme, A. Learning cultures on the move: where are we heading? Journal of Educational Technology and Society 13(4) (2010), p4–14.

[18] Language Technologies Unit (Canolfan Bedwyr), B. U. An overview of the tesseract ocr (optical character recognition) engine, and its possible enhancement for use in wales in a pre-competitive research stage. SALT Cymru project (April 2008).

[19] Mancas-Thillou, C., F. C., Demeyer, J., Minetti, C., and Gosselin, B. A multifunctional reading assistant for the visually impaired. EURASIP Journal on Image and Video Processing 2007, Article ID 64295 (September 2007), 1–11.

[20] Narasimhan, P., Gandhi, R., and Rossi, D. Smartphone-based assistive technolo- gies for the blind. CASES (October 2009).

56 [21] Nicomsoft. Optical Character Recognition (OCR) How it works. Online, 2012. Available from: http://www.nicomsoft.com/ optical-character-recognition-ocr-how-it-works/.

[22] Organization, W. H. New estimates of visual impairment and blindness: 2010. Online, 2012. Available from: www.who.int/blindness/en/index.html.

[23] pcmag.com. Definition of: PDA phone. Online, 2012. Available from: http://www. pcmag.com/encyclopedia_term/0,1237,t=PDA+phone&i=56917,00.asp.

[24] PC.net. Bitmap. Online, 2012. Available from: http://pc.net/glossary/ definition/bitmap.

[25] Sachs, J. Digital image basics. Digital Light & Color (1999), 1–14.

[26] Saha, A. K. A developer’s first look at android. Linux for you (January 2008), 48–50.

[27] ScienceDaily. Better pictures with mobile devices. ScienceDaily (12 October 2011).

[28] Shaik, A. S., Hossain, G., and Yeasin, M. Design, development and performance evaluation of reconfigured mobile android phone for people who are blind or visually impaired. SIGDOC (September 2010).

[29] Technologies, K., and the National Federation of the Blind. First cell phone that reads to the blind and dyslexic. Voice of the Nation’s Blind (27 January 2008), 1–3.

[30] Tsai, S. S., Chen, D., Chandrasekhar, V., Takacs, G., Cheung, N., Vedan- tham, R., Grzeszczuk, R., and Girod, B. Mobile product recognition. MM (October 2010), 1–4.

[31] Yan, T., Kumar, V., and Ganesan, D. Crowdsearch: Exploiting crowds for accu- rate real-time image search on mobile phones. MobiSys (October 2010).

57