Analyzing and Experimenting Open Source OCR Engines in RPA With

www.rspsciencehub.com Volume 02 Issue 12S December 2020

Special Issue of First International Conference on Science, Technology & Management (ICSTM-2020) Analyzing and Experimenting Open Source OCR Engines in RPA with Levenshtein Distance Algorithm Malathi T1, Diwaan Chandar C S2, Nithish S3, Niranjan V4, Swashthika A K5 1Assistant Professor, Department of Computer Science & Engineering, Bannari Amman Institute of Technology, Sathyamangalam, Erode, Tamilnadu - 638401 2,3 Second Year, Department of Information Science & Engineering, Bannari Amman Institute of Technology, Sathyamangalam, Erode, Tamilnadu - 638401 4, 5 Second Year, Department of Computer Technology, Bannari Amman Institute of Technology, Sathyamangalam, Erode, Tamilnadu - 638401 [email protected] Abstract Robotic Process Automation is a platform used to automate boring and repetitive computer processes using software bots so that humans could involve in tasks which include creativity and decision making which could not be done by robots. Optical Character Recognition takes out printed characters in an image and converts it to text. Google Tesseract OCR and Microsoft OCR were the commonly used OCR engines available in UiPath, a tool for Robotic Process Automation. In Previous, research on comparing those two open source OCR engine, there we made comparison on basic factors which included speed, hardware requirements, accuracy ,but in that case, accuracy was been calculated manually which gave us results but with less precise, as it was a manual process to substitute scraped data to that formulas, In this research we’ve made results with more precision by performing a String comparison algorithm named, “Levenshtein Distance Algorithm” which is deployed in UiPath. Keywords: Optical Character Recognition(OCR); Robotics Process Automation(RPA); Google Tesseract OCR; Microsoft OCR 1. Introduction Artificial Intelligence) that they’ll overrule the Robotics and automation has stepped into reality humans by making their own decisions and works a few years ago and is evolving so rapidly around which involve their own creativity. [4]But these the world in areas such as industrial automation, theories clearly states that machines can’t stand a space engineering, stellar space engineering, even chance against human intelligence cause they’re in urban and rural areas all over the world. I still competing with our skills,[3]but there are always wonder how these programs work many tasks which humans seek for technology seamlessly 24 by 7 hours,[1]as it was designed to help like repetitive tasks as mentioned above to do that actually, As we use this technology to make the result more efficient and with more overcome bored repetitive tasks and human risky precision, and there’s many investors currently jobs, we often deploy them daily to do that kinda investing billions of dollars in these kind of tasks, However they depend on humans to get technologies to seek more compound profit in engaged with that technology,[7] there are many future. Microsoft OCR, which is a built in OCR myths in this century regarding future predictions engine in Microsoft windows 10 and Tesseract about these technologies (mainly robots or OCR,[2]an open source OCR engine developed

International Research Journal on Advanced Science Hub (IRJASH) 98

www.rspsciencehub.com Volume 02 Issue 12S December 2020 by Google were the two available open source comparison methods would not operate in such OCR engines in UiPath, a tool for Robotic instances, but the operation will give a percentage Process Automation. In the previous of similarity. paper[1]research made is by checking the 3. Methodology accuracy of Tesseract OCR and Microsoft OCR In this research proposal, an string comparison by using some manual methods, which is not algorithm plays an important role to give more precise. Also, we had used different sets of accurate results than our previous study, the images for testing the accuracy and had also used whole sequence of the execution will be like; systems of different specifications for this unzipping and feeding the data from our local research which may result in error for time taken machine storage to the workflow ocr and accuracy percentage. Hence, to propose a engines(either Microsoft OCR engine or more valid result, we had planned to improve our Tesseract OCR Engine first) ; and redirecting the results by using a string comparison algorithm extracted data to the string algorithm analysing named, “Levenshtein algorithm”, which is used to container as show at figure 2.1 and there the calculate the similarity between two input strings major part plays on comparing the extracted data and returns it's accuracy in percentage. We had with the original data from the images which was also used the same set of images for testing both used to feed the OCR engines, and eventually the OCR engines. And executed the workflow on the accuracy(data) will be saved to local storage or same system in order to calculate the time taken can also use cloud storage for purpose, if we’re for the execution error-free. deploying this workflow to the UiPath 2. String comparison - levenshtein distance Orchestrator. algorithm The main upgrades from the previous paper[1] is The operation accepts two strings and returns the about: percentage of similarity between two strings Same set of source data is supplied to both ocr using the Levenshtein Algorithm in the machines to expect better comparison results System.Single form. The Levenshtein algorithm Both workflows for ocr has been executed with (also referred to as Edit-Distance) calculates the same hardware equipments whereas used minimum number of editing operations needed to different hardwares for previous comparison[1] change one string in order to obtain another Images with lighten backgrounds are used to string. The dynamic programming approach is the extract more data to obtain more precise most prevalent way of measuring this. A matrix is information. initialized and the Levenshtein distance between 3.1 What does container refer to? the m-character prefix of one is calculated in the Images with fancy or decorable fonts are used to (m,n)-cell with the n-prefix of the other term. test the algorithm [1]Containers or Blocks which From the upper left to the lower right corner, the is often used in uipath studio to classify set of matrix can be filled. An insert or a delete, activities or program in an order to execute in a respectively, corresponds to each hop horizontally sequential manner, if a container is set to top level or vertically. The cost is typically set at 1 for each node, the activities which is under that container of the tasks. If the two characters in the row and executes first and then the workflow further column do not match or 0, if they do the diagonal moves to next container which holds set of jump will cost either one. The expense is often instructions readily to run followed by the reduced locally by each cell. This way, the previous block. Levenshtein gap between both terms is the 4. Architecture and workflow number in the lower right corner. 4.1 Architecture If it is needed to compare a string to some sample Some people might be good at programming by data, this could effectively be used. For example, nature. But it's not necessary to be everyone. the status of an application needs to be updated on There are many people, who always struggle to the basis of a statement from Approvers. "The grasp the basics of programming or simply they request is approved "Application is approved"The can't program. [1]Hence UiPath studio offers a request raised is approved yesterday"Yesterday's no-code environment where we can enter with request is approved. The traditional string

International Research Journal on Advanced Science Hub (IRJASH) 99

www.rspsciencehub.com Volume 02 Issue 12S December 2020 minimal or zero programming background using that are predefined. To model the workflow for some visual code blocks.[1]There were several the automation process, users can drag and drop boring and repetitive tasks in Information activities. In simple terms, UiPath Studio is a Technology and Business Process which may be method used to model the automation workflow mundane for many employees. Hence, we could to automate repetitive processes using predefined deploy a bot so that humans could involve in activities and libraries. And, UiPath Robots is a some other creative activities and other activities software that hosts the process installed in the which include human decisions in order to make UiPath studio that allows us to carry out our an accurate process in a very low time.[1]The projects with or without human monitoring (or) primary architecture of UiPath software consists supervision on any computer.UiPath Orchestrator of three components; UiPath Studio, UiPath is a web application that allows us to manage the Robot, UiPath Orchestrator which plays a vital development and deployment of our resources in role in automating a task: our machine. It allows us to launch and schedule [1]UiPath Studio is a design tool that enables a our bots on our or other desktops, and also control user to create programs.[1]It has many the bot's status and evaluate the effects of their activities(pre-defined functions) and repositories work.

Fig.1. OCR Flow chart

4.2 Workflow OCR were fed into the columns C and D of the excel sheet.[1]And the similarity between the This research consists of two workflows, one for string 1 and 2 (String 1 is same for both OCR and Tesseract OCR and another for Microsoft OCR. String 2 is the extracted text respected to the OCR A folder which consists of 100 images is given as engine) were calculated and the percentage is input. An excel sheet which consists of Original returned in columns E and F respectively. Text data of the image is used for the calculation Similarly the time taken for the execution has of the results. The original text data is read as been calculated on the UiPath Studio for both string 1 for comparison on both workflows and it OCRs. Finally, the average of accuracy for the extracts the text data using Tesseract OCR on 100 samples and average time taken for the Workflow 1 and Microsoft OCR on Workflow 2. extraction of a single image of each OCR has The extracted data is given for String 2 for the been calculated. comparison on respected workflows. Now, the extracted data by Tesseract OCR and Microsoft 5. Result analysis

International Research Journal on Advanced Science Hub (IRJASH) 100

www.rspsciencehub.com Volume 02 Issue 12S December 2020 a.) Mean Accuracy- There is a measure of the extracted including the time taken for the string similarities between the original text and the comparison is calculated and tabulated. c.) Mean extracted text calculated by using Levenshtein time taken(per images) - The average time taken to Distance Algorithm-String Comparison in uipath, extract each picture is calculated and calculated and the value is determined between them, and and tabulated from the total time taken, and mean finally the mean accuracy is calculated for the time taken value will be calculated from the total determined values b.) Overall Execution Time (or) time taken divided by total no of images fed. Time taken - The time taken for 100 images to be Table 1. Characteristics/ Engine

Characteristics/ Engine Tesseract OCR Engine Microsoft OCR Engine

Mean accuracy(in percentage) 71.04549606 76.68340455

Time taken(in hour format) 00:19:11.00 00:27:08.00

Mean time taken(s)per image 00:00:11.51 00:00:16.28 6. Error analysis Some datas has been classified below acc to their accuracy percentage, Table 2. Tesseract OCR Data/accuracy Original Extracted text

0 - 20% "STAY "m HOME xﬁom STAY SAFE" ? smv gm"

20 - 40% "Never forget "Never forget 3 types of people 3 types of p_eople in your life: In your life: 1. Who helped you in your difficult times. LVMNVIMJMEOMM 2. Who left you in your difficult times. 2.V\m mmhmdmmm 3. Who put you in your difficult times." mummywindrrmnm"

40 - 60% "FASTER "FASTER THAN A THAN A MERCEDES MERCEDES . MY HEART BEATS MY HEART BEATS FOR YOU FOR Loesje YOU P.O.-BOX 1045 6801 BA ARNHEM HOLAND" 1922/;"

60 - 80% HEAVEN, """YOUR EYES YOUR HANDS PROMISEO ME TOOK ME HEAVEN, THERE."" YOUR HANDS -Lenon Hodson" TOOK ME THERE."""

100% BLACKLANE BLACKLANE

International Research Journal on Advanced Science Hub (IRJASH) 101

www.rspsciencehub.com Volume 02 Issue 12S December 2020

Table.3 Microsoft OCR Data/accuracy Original Extracted Text

0 - 20% "STAY HOME STAY SAFE"

20 - 40% "Never forget "Never forget 3 types of people 3 types of people in your life: un your life: 1. Who helped you in your difficult CCu times. I. In attest b•tm 2. Who left you in your difficult 3. in dTÄE Sa•" times. 3. Who put you in your difficult times."

40 - 60% "To heal a "To heal wound woun you need yoti to stop toucl?inæ touching it." it."

60 - 80% Hello World! t lollo World!

100% Blacklane Blacklane

Note: All images were taken in Joint Photographic group format (.JPG or JPEG).

6.1 Decorative font outcomes

For these types of fonts, OCR Engines returned null values and also random alphabets such as, Microsoft ocr returned null and Tesseract returned “ﬁ‘lﬂl/” for the value “love” as in the above image. 6.2 Images with grey background Images with grey background provide (attached above) an accuracy of nearly 80 - 90% in both Google Tesseract and Microsoft OCR.

International Research Journal on Advanced Science Hub (IRJASH) 102

www.rspsciencehub.com Volume 02 Issue 12S December 2020 7. Future works Technical Report 95-03, Information Science Research Institute, University of Nevada, Las [1]So, the overalls analysis and this comparison Vegas, July 1995. research has been done under three major factors which includes, velocity, accuracy, time taken, [3] R. Smith, “A Simple and Efficient Skew using distance algorithm, majorly this research was Detection Algorithm via Text Row the improvised version of the previous research Accumulation”, Proc. of the 3rd Int. Conf. on which gave more precise results over the first one, Document Analysis and Recognition (Vol. 2), [1]but there are still some portions of this research IEEE 1995, pp. 1145-1148. can be upgraded for better purposes such as like [4] A.C. Kak e M. Slaney. Principles of deploying storage area from local storage to cloud Computerized Tomography. IEEE Press, 1988. storage which will be used to manipulate data over worldwide collaborators for further analysis,[1] [5] S.V. Rice, G. Nagy, T.A. Nartker, Optical so,moreover the future work will be on storing Character Recognition: An Illustrated Guide to data with cloud support, Which’ll be also used to the Frontier, Kluwer Academic Publishers, run other OCR Engines available in the UiPath USA 1999, pp. 57-60. framework? [6] P.J. Schneider, “An Algorithm for Conclusion Automatically Fitting Digitized Curves”, in By making calculations at different factors, such as A.S. Glassner, Graphics Gems I, Morgan precision, time taken, to put the experiment to an Kaufmann, 1990, pp. 612-626. end.[1] In certain cases the results of the Microsoft OCR are more reliable compared to the results of [7] L.R. Franca Neto, C.A.B. Mello and R.D. Lins. the Tesseract OCR. But Tesseract OCR also gives Filtering Techniques for Digital Images of better results in certain cases as well.This provides Historical Documents. XV Brazilian different accuracy values.But when considering Symposium of Telecommunications, Recife, the time it took to identify the characters in the Brazil,September, 1997. images compared to Microsoft OCR, Tesseract got [8] B.A. Blesser, T.T. Kuklinski, R.J. Shillman, efficient result,[1]So by performing these comparison with the use of String comparison “Empirical Tests for Feature Selection Based algorithm on a Psychological Theory of Character Recognition”, Pattern Recognition 8(2), Acknowledgements Elsevier, New York, 1976. The Authors sincerely thank Ms.Malathi T, [9] R.D. Lins, M.S. Guimar~aes Neto, L.R. Franca Assistant Professor department of Computer Neto and L.G. Rosa. An Environment for science and Engineering, Bannari Amman Institute Processing Images of Historical Documents. of technology, Sathyamangalam, Erode- 638 401 Microprocessing & Microprogramming, pp. for guiding us for the successful completion of this 111-121,North-Holland, January, 1995. research. [10] R.J. Shillman, Character Recognition Based 10. References on Phenomenological Attributes: Theory and Methods, PhD. Thesis, Massachusetts Institute [1] Malathi T, Diwaan Chandar C S, Nithish S, of Technology. 1974. Niranjan V, Swashthika A K , An experimental [11] C.A.B.Mello, L.R.Franca Neto e R.D.Lins. A performance analysis on robotics process New Technique for Compressing Static automation(RPA) with open source OCR Images. Proceedings of the XVI Computation engines: microsoft OCR and tesseract OCR. Brazilian Society Congress, Brazil, 1996. ICMMM2.0, 2020. [12] R.W. Smith, The Extraction and Recognition [2] S.V. Rice, F.R. Jenkins, T.A. Nartker, The of Text from Multimedia Document Images, Fourth Annual Test of OCR Accuracy,

International Research Journal on Advanced Science Hub (IRJASH) 103

www.rspsciencehub.com Volume 02 Issue 12S December 2020 PhD Thesis, University of Bristol, November 1987. [13] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, Wiley- IEEE, 2003. [14] K.Sayood. Introduction to Data Compression. Morgan Kau man Publishers, Inc., 1996. [15] I.H. Witten, A.Mo at and T.C. Bell. Managing Gigabytes - Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.

International Research Journal on Advanced Science Hub (IRJASH) 104