Camera Based Equation Solver for Android Devices
Total Page:16
File Type:pdf, Size:1020Kb
Camera Based Equation Solver for Android Devices Aman Sikka Benny Wu Department of Electrical Engineering Department of Applied Physics Stanford University Stanford University Stanford, California 94305 Stanford, California 94305 Email: [email protected] Email: [email protected] Abstract—In this paper we demonstrate an Android based detected equation, prompts for confirmation, and displays the mobile application equation solver that takes a camera image of solution. The second is a server running the image processing, an equation and displays the solution. The program is capable of text recognition, and equation solving algorithms. The overall solving simple arithmetic equations (addition, subtraction, and multiplication) and systems of linear equations of up to two system flow is shown in Figure 1. variables written by hand or in computer font. I. INTRODUCTION Android Phone Server Solve Segment Correct OCR: Mathematical equations are a ubiquitous part of student life. Binarize Printed Line Skew Tesseract Camera Generally, one has to make use of a calculator or computer to Image Solve Hand Segment OCR: Binarize solve complex equations. Depending on the assistive device, Written Char SVM the equations needs to be entered in a specific format. Our goal Recognized Text is to develop a mobile based application that bridges the gap Confirm / Edit MATLAB between technology and the traditional pen and paper approach Solver in an intuitive manner. The user can capture a camera image Solution of a mathematical expression - both in computer generated font or in handwritten text, and a solution will be calculated and displayed on the mobile device. Fig. 1. Overall system flow showing processes running on the Android Initially, we limit the scope of the task to solving simple platform and on the server. arithmetic equations, quadratic equations, and systems of linear equations of up to two variables. Mobile applications III. IMAGE CAPTURE for solving such equations do exist. However, these apps are essentially templates for pre-defined equations where users Within the mobile application, the user is given two options enter numbers in static text boxes. While the constrained - ”Solve Printed” or ”Solve Hand Written” which capture an interface ensures robust capture of the input, it is cumbersome image and initiate two different PHP scripts on the server. The for the user, who must click multiple times to input an image is downsized to 640 x 480 pixels from its original size equation. A more intuitive approach would be for the user to speed up transmission to the server. to snap a picture of the written expression. IV. BINARIZATION AND LINE SEGMENTATION II. SYSTEM OVERVIEW Binarization and segmentation are crucial steps in identify- The general approach for solving an equation contained in ing the regions of interests. Otsu’s adaptive thresholding was an image is as follows: initially used. However, the processing time was fairly slow and the binarization was unsatisfactory for certain lighting 1) Capture image. conditions. Instead, the image is binarized using Maximally 2) Binarize image. Stable Extremal Regions (MSER) from the VL FEAT toolbox 3) Segment expressions. [2]. Regions that are either too small or too large are excluded 4) Correct perspective distortion. from the candidate list. Regions in the binarized image are 5) Recognize text. labeled, and their centroids and bounding boxes calculated. 6) Prompt for user confirmation. Individual lines (a single equation or expression) are seg- 7) Solve equation. mented out by plotting the y-centroids of the regions in a Due to limitations in the text detection accuracy, a prompt histogram with a bin size of 20 pixels, as shown in Figure for confirmation from the user was added between steps 5 and 2(b). The average y-centroid of each line is assumed to be the 7 to ensure that the correct equation is passed to the solver. maximum of the well-separated peaks. The midpoint between The overall system consists of two subsystems. The first is an each line is then calculated and used to segment the binarized Android application that captures the input image, displays the image into images of individual lines. (a) (b) Histogram of Y-Centroid used to correct for the perspective distortion. The resulting image after correction, as shown in 3(d), is then passed to the 6 Tesseract OCR engine. Y-Centroid Y-Centroid of Line 1 of Line 2 4 (a) (c) Detected Counts in Bin Counts 2 RANSAC Fit Desired RANSAC Fit for Skew Estimate (b) 240 0 0 100 200 300 400 500 260 (c) Y-Centroid of Region 280 300 Y Centroid Position Y Centroid (d) Baseline 320 Topline RANSAC Fit Baseline RANSAC Fit Topline 340 200 250 300 350 400 450 X Centroid Position Fig. 2. (a) Camera image taken by Android phone. (b) After binarization with MSER and region labeling, a histogram of the y-centroid of the regions is Fig. 3. (a) Segmented line image showing perspective distortion. (b) plotted. Regions from separate expressions form distinct peaks. (c) Using the RANSAC fits to the topline and the baseline. (c) Top: The detected, distorted histogram results, the binarized image is segmented into separate expressions. bounding box. Bottom: The desired, corrected bounding box. (d) Corrected image after projective affine transformation is applied to the distorted text. V. TEXT RECOGNITION Text recognition is divided into two separate classes - B. Handwritten Text Recognition computer printed expressions and hand written expressions. Though Tesseract performed well for printed text, with a Computer printed expressions are processed with Tesseract [3], detection rate better than 85% for fonts within its database. a free Optical Character Recognition (OCR) engine originally However, its detection rates for handwritten text is below developed by HP Labs and now maintained by Google. 50%, likely due to size variations in the writing and a lack of Handwritten expressions are processes with a Support-Vector matching fonts in its database. A machine learning algorithm Machine (SVM) prediction model. based on support vector machines (SVM) was applied instead. In order to improve detection acuracy the number of charac- SVM is a supervised learning method that analyzes data ters in the database are limited to digits (0 - 9), two variables and recognizes patterns. It is often used for classification (a, b), and a subset of mathematical operators (+,-, x, and =). and regression analysis. The prediction model was created These characters are sufficient for composing both quadratic using libSVM, a set of tools for training and modeling SVM equations and the majority of a system of linear equations developed at National Taiwan University [6]. in two variables. The dictionary can be easily expanded to In order to create a training dataset, character images must include more characters in the future. first be converted into vector form. After line segmentation, region labels are used to determine the bounding box for A. Computer Text Recognition individual characters. A small amount of padding is added to Recognizing computer generated text is considerably easier the border, as shown in Figure 4(a). The segmented character than recognizing handwritten text. Due to the predictable is downsampled to 32x32 pixels and further divided into 64 structure (font style) of the characters, a template matching 4x4 regions, with region 1 at the top left and region 64 at algorithm can be used. The free OCR engine Tesseract was the bottom right. The count in each region is the determined chosen for printed text recognition. vector value, as shown in Figure 4(b). This conversion thus As demonstrated in [4], the accuracy of optical text- results in a 64 dimensional vector for each character image. recognition algorithms improves significantly if perspective distortion is first corrected. Following a similar method to (a) (b) 0 1 6 8 [4], the topline and baseline of the regions in the segmented 0 8 16 14 expression are found. The segmented line image is shown in 0 8 12 3 3(a). The RANSAC MATLAB Toolbox from the Graphics 0 8 16 16 and Media Lab of Lomonosov Moscow State University [5] is used to compute RANdom SAmple Consensus (RANSAC) fits to the topline and baseline as shown in 3(b). A bounding box for the distorted text is found, and the four corner points 0 identified. Using the leading edge of this bounding box, the Fig. 4. (a) Segmented character from input image. (b) Character downsam- desired, corrected bounding box is then calculated, as shown pled to 32 x 32 pixels. The image is then further divided into 64 4x4 regions, in 3(c). A projective affine transformation that maps the four with region 1 at the top left and region 64 at the bottom right. The count in corner points of the distorted box to the desired box is then each region is the vector value. Initially, the Optical Recognition of Handwritten Digits Data more accurate than segmenting and detecting individual char- Set from the Machine Learning Repository at the University acters. of California at Irvine [7] was used as the training set. The Tesseract performs well for fonts that exist in its database, 64 dimensional vector conversion of this data set was kindly with an accuracy in the range of 85-90%. The notable excep- provided by Sam Tsai. After training libSVM, a test of our tions were the characters ’0’,’ -’, ’=’, and ’a’. The character ’0’ handwritten digits resulted in a low prediction rate of between was at times mistakenly identified as ’()’ (prior to the removal 20-30%. We thus decided to create a new training data set of ’(’and ’)’ from the dictionary). The character ’a’ was at based on our own handwriting. A simple application for times mistakenly identified as ’3’or ’8’. The characters ’-’ and taking a camera image of a single character, segmenting and ’=’ were at times completely absent from the detection, likely downsampling the character, calculating the corresponding 64 due to the down-sampling of the camera image to 640x480 dimensional vector, and writing it to an output text file, was pixels.