A Study on the Accuracy of OCR Engines for Source Code
Total Page:16
File Type:pdf, Size:1020Kb
A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming Screencasts Abdulkarim Khormi1;2, Mohammad Alahmadi1;3, Sonia Haiduc1 1Florida State University, Tallahassee, FL, United States { khormi, alahmadi, shaiduc } @cs.fsu.edu 2Jazan University, Jizan, Saudi Arabia 3University of Jeddah, Jeddah, Saudi Arabia ABSTRACT 1 INTRODUCTION Programming screencasts can be a rich source of documentation Nowadays, programmers spend 20-30% of their time online seeking for developers. However, despite the availability of such videos, information to support their daily tasks [6, 11]. Among the plethora the information available in them, and especially the source code of online documentation sources available to them, programming being displayed is not easy to fnd, search, or reuse by programmers. screencasts are one source that has been rapidly growing in popu- Recent work has identifed this challenge and proposed solutions larity in recent years [17]. For example, YouTube alone hosts vast that identify and extract source code from video tutorials in order amounts of videos that cover a large variety of programming topics to make it readily available to developers or other tools. A crucial and have been watched billions of times. However, despite their component in these approaches is the Optical Character Recogni- widespread availability, programming screencasts present some tion (OCR) engine used to transcribe the source code shown on challenges that need to be addressed before they can be used by de- screen. Previous work has simply chosen one OCR engine, without velopers to their full potential. One important limitation is the lack consideration for its accuracy or that of other engines on source of support for code reuse. Given that copy-pasting code is the most code recognition. In this paper, we present an empirical study on common action performed by developers online [6], programming the accuracy of six OCR engines for the extraction of source code screencasts should also strive to support this developer behavior. from screencasts and code images. Our results show that the tran- However, very few programming screencast creators make their scription accuracy varies greatly from one OCR engine to another source code available alongside the video, and online platforms and that the most widely chosen OCR engine in previous studies is like YouTube do not currently have support for automatic source by far not the best choice. We also show how other factors, such code transcription. Therefore, developers often have no choice but as font type and size can impact the results of some of the engines. to manually transcribe the code they see on the screen into their We conclude by ofering guidelines for programming screencast project, which is not only time consuming, but also cumbersome, creators on which fonts to use to enable a better OCR recognition of as it requires pausing-and-playing the video numerous times [23]. their source code, as well as advice on OCR choice for researchers Researchers have recently recognized this problem and proposed aiming to analyze source code in screencasts. novel techniques to automatically detect, locate, and extract source code from programming tutorials [1, 2, 14, 19, 20, 24, 32, 34]. While CCS CONCEPTS diferent approaches to achieve these goals were proposed, they all • Software and its engineering → Documentation; • Computer have one component in common: an Optical Character Recognition vision → Image recognition; (OCR) engine, which is used to transcribe the source code found in a frame of the screencast into text. To ensure the most accurate ex- KEYWORDS traction of source code from programming screencasts, one needs to Optical Character Recognition, Code Extraction, Programming carefully choose the OCR engine that performs the best specifcally video tutorials, Software documentation, Video mining on images containing code. However, no empirical evaluation cur- rently exists on the accuracy of OCR engines for code transcription. ACM Reference Format: Previous work that analyzed programming screencasts [14, 25, 32] 1 2 1 3 1 Abdulkarim Khormi ; , Mohammad Alahmadi ; , Sonia Haiduc . 2020. A simply picked one OCR engine, namely Tesseract1, and applied it Study on the Accuracy of OCR Engines for Source Code Transcription to programming screencast frames. However, without validating its from Programming Screencasts. In 17th International Conference on Mining accuracy or comparing it to other OCR engines on the task of code Software Repositories (MSR ’20), October 5–6, 2020, Seoul, Republic of Korea. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3379597.3387468 transcription, there is no evidence that Tesseract is the best choice for transcribing source code. Choosing a suboptimal OCR engine Permission to make digital or hard copies of all or part of this work for personal or can lead to noise that is difcult to correct, and therefore code that classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation is hard to reuse without expensive post-processing steps or man- on the frst page. Copyrights for components of this work owned by others than ACM ual fxing. To this end, there is a need for an empirical evaluation must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, that examines the accuracy of diferent OCR engines for extracting to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from [email protected]. source code from images, with the goal of fnding the engine with MSR ’20, October 5–6, 2020, Seoul, Republic of Korea the lowest risk of code transcription error. © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7517-7/20/05...$15.00 https://doi.org/10.1145/3379597.3387468 1https://github.com/tesseract-ocr 1 2 1 3 1 MSR ’20, October 5–6, 2020, Seoul, Republic of Korea Abdulkarim Khormi ; , Mohammad Alahmadi ; , Sonia Haiduc In this paper, we address this need and present a set of exper- OCR has also been recently adopted by researchers in software iments evaluating six OCR engines (Google Drive OCR, ABBYY engineering for extracting source code from programming video FineReader, GOCR, Tesseract, OCRAD, and Cuneiform) applied tutorials [14, 24, 32]. It brings great promise in this area of research, to images of code written in three programming languages: Java, allowing the text found in programming tutorials, including the Python, and C#. We evaluate the OCR engines on two code im- source code, to be extracted, indexed, searched, and reused. While age datasets. The frst one contains 300 frames extracted from 300 researchers in software engineering have only used one OCR engine YouTube programming screencasts (100 videos per programming thus far (i.e., Tesseract), there are several other OCR engines that use language, one frame per video). The evaluation on this dataset is diferent computer vision algorithms, leading to high-quality text relevant to researchers working on the analysis of programming extraction and real-time performance. In this paper, we investigate screencasts and software developers wanting to reuse code from and compare six well-known OCR engines: Google Docs OCR, screencasts. The experiments on this dataset allow us to determine ABBYY FineReader, GOCR, Tesseract, OCRAD, and Cuneiform. the best OCR engine for source code extraction from screencasts, Google Drive2 is a fle storage and synchronization service therefore reducing the risk of error in the OCR results and increas- developed by Google, which interacts with Google Docs, a cloud ing the potential for reuse of the extracted code. The second dataset service allowing users to view, edit, and share documents. Google is composed of 3,750 screenshots of code (1,250 per programming Drive provides an API, which allows converting images into text language) written in an IDE using diferent font types and font using the Google Docs OCR engine. The text extraction process sizes. The code was extracted from a variety of GitHub projects, is performed through a cloud vision API that Google uses in the then locally opened in a popular IDE, where a screenshot of the back-end with a custom machine learning model in order to yield code region was taken. This dataset, unlike the frst one, allowed us high accuracy results. 3 to manipulate the font used to write the code in the IDE and change ABBYY FineReader is a commercial OCR software company its size and type, in order to observe how these factors impact the that ofers intelligent text detection for digital documents. It can accuracy of OCR engines. The results on this dataset are relevant process digital documents of many types, such as PDF or, in our for creators of programming screencasts, revealing font types and case, a video frame containing source code. FineReader uses an sizes that enable better OCR on the code in their screencasts. At advanced algorithm to preserve the layout of the image during the the same time, researchers can use these results as guidelines for text extraction process. It ofers a paid plan, in which users can identifying screencasts that lend themselves to OCR the best. create a new pattern to train the engine. However, we used the Our results show that there are signifcant diferences in accuracy built-in default patterns of the free trial version of FineReader 12 in between OCR engines. Google Drive OCR (Gdrive) and ABBYY our work. FineReader ofers a software development kit (SDK)4 to FineReader performed the best, while the default choice in previous facilitate the OCR extraction process for developers. We used the work, Tesseract, was never in the top three best performing OCR SDK in our study to enable the efcient processing of our datasets. engines. We also found that applying OCR on code written using the GOCR5 is a free OCR engine that converts images of several DejaVu Sans Mono Font produced the best results compared to other types into text.