<<

Extracting Code Segments and Their Descriptions from Research Articles

Preetha Chatterjee, Benjamin Gause, Hunter Hedinger, and Lori Pollock Computer and Information Sciences University of Delaware Newark, DE 19716 USA Email: preethac, bengause, hedinger, pollock @udel.edu { }

Abstract—The availability of large corpora of online - alone, ICSE, is 8,459 at present [13]. In total, the IEEE Xplore related documents today presents an opportunity to use machine digital provides web access to more than 3.5-million learning to improve integrated development environments by full-text documents of publications in the fields of electrical first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and engineering, computer science and electronics [12]. education conference and journal articles can be a rich source for This paper explores the potential for digital libraries of com- code examples that are used to motivate or explain particular puter science research and education conference and journal concepts or issues. Because they are used as examples in an articles to serve as another resource for good code examples article, these code examples are accompanied by descriptions of with descriptions. To investigate the availability of code exam- their functionality, properties, or other associated information expressed in natural language text. Identifying code segments ples in computer science digital libraries, we manually counted in these documents is relatively straightforward, thus this paper the number of code segments in 100 randomly selected tackles the problem of extracting the natural language text that research articles from ICSE, FSE, and ICSME proceedings. is associated with each code segment in an article. We present 70% of the selected articles contained one or more code and evaluate a set of heuristics that address the challenges of segments, with an average of 3-4 code segments per article. the text often not being colocated with the code segment as in developer communications such as online forums. The examples always have some associated descriptions of Index Terms—mining software repositories, information ex- their functionality, properties, or other associated information traction, code snippet description, text analysis expressed in natural language text. As an example of the kind of information that can be I.INTRODUCTION extracted from descriptions in research literature, With the increased online sharing of software-related infor- consider a code snippet and its description in Figure 1, mation, software engineers often look beyond documentation extracted from a paper published in ICSE 2014. The descrip- and their local resources, seeking examples and advice from tion of the code snippet provides useful information about experiences of other developers not geographically nearby. The the source code, including (1) the it examples are more useful if there is an explanation of their was written in, (2) the intent of the overall code that the functionalities and properties that they exhibit. These code programmer is implementing of which this code segment is descriptions are often not found in others’ source code, but a part of (i.e., a web application), (3) some of the APIs it are instead in other software-related artifacts such as Q&A uses, and (4) the sub-steps being implemented by the code forums, blog posts, and emails. The vast availability of online segment, i.e., a description of its functionality. resources has also motivated researchers to develop techniques Mining code segments and their descriptions from research to help developers more efficiently locate code examples with articles presents challenges beyond those faced in mining from descriptions by automatically mining code examples from unstructured documents such as forums, bug reports, emails, various sources, including emails [1], [2], [3], [4], Q&A and issue tracking. In all of these unstructured documents, forums [5], [6], [7], [8], API documentation [9], bug reports including research articles, the code segments are intermixed [10], and stack traces. with natural language text, sometimes separated by blank lines Digital libraries for computer science research and educa- and sometimes single code statements within paragraphs or tion articles could potentially provide a large amount of code even individual identifiers within sentences. In all of these examples with descriptions. The ACM Digital Library contains documents, the code segments are embedded in the main- an archive of every article and publication published by ACM stream text. In contrast, code segments in research articles are from 1950s to present [11]. The IEEE Xplore DL includes sometimes embedded within the text, but often separated as over 180 journals, over 1,400 conference proceedings, more figures, which are rarely positioned in the flow consecutively than 3,800 technical standards, over 1,800 eBooks and over with the text that describes them. The figure could be located 400 educational courses. Each month, 20,000 new documents in a different section or different page. This physical separa- are added to IEEE Xplore on average [12]. The publication tion of code segment from description makes the description count of the top conference in the field of software engineering identification problem, i.e., the problem of identifying all the articles and other similar documents such as dissertations. The Section 2 presents a scenario that motivates our approach 1 $("#addphoto").on(’click’, and demonstrates the kinds of links it can identify. The ap- 2 function() { useGetPicture();} main contributions of this paper are: proach and our implementation of it are explained in detail 3 ); • a set of heuristics to automatically identify and map in Sections 3 and 4. Section 5 then presents our evaluation of 4 function useGetPicture() { 5 var cameraOptions = { ... }; text that is describing code segments in research articles, Baker. The documentation linking prototype is described in 6 navigator.camera.getPicture(onCameraSuccess, including segments embedded as figures Section 6, followed by discussion in Section 7. Related work 7 onCameraError, cameraOptions); is described in Section 8; Section 9 concludes the paper. 8 } • a set of heuristics to expand the neighborhood of iden- 9 function onCameraSuccess(imageData) { tified descriptions to include informative, yet less obvi- 10 var image = document.getElementById(".."); 11 image.src = "data:image/jpeg" + imageData; ously related text 12 } • a tool that takes research articles as input and outputs 2. SCENARIO 13 function onCameraError(message) { Consider the Java code snippet shown in Figure 1. This 14 alert("Failed: " + message); an XML-based representation with markups to associate snippet (pertaining to a library called GWT) was posted to 15 } identified code segments with their corresponding de- Stack Overflow to assist a developer who did not understand scriptions how to manipulate the state of History objects. The figure contains a number of bolded elements. These are the types Figure 2: A JavaScript code snippet containing Cor- • an evaluation study that evaluates the effectiveness of the and methods that our tool, Baker, can uniquely link to the dova, JQuery and JavaScript DOM API usage. Each presented code description identification techniques API; i.e., the elements for which it can determine a fully- of the bolded elements can be linked back to the rel- qualified name. With this information we can automatically evant API documentation. II.MOTIVATING EXAMPLES augment the HTML version of the ocial API documenta- tion for History by dynamically injecting the code example (a)3. Code APPROACH segment as a figure In addition to the example in Figure 1, we present three into the web page. We can also inject the links to the ocial Next,Identifying consider API the elements JavaScript in code snippets snippet requires in Figure the 2, additional code snippets and their descriptive text extracted API into the Stack Overflow post; these two additions to the whereability a to developer parse these snippets.is trying This to is make more dia webcult app than from research articles and discuss how they could be used to parsing full files because code snippets can be ambiguous. documentation would make it easier for developers to learn that can take a photo and inject it into an further motivate extracting code segments with descriptions how to use this class. elementDagenais in and an Robillard HTML document. highlighted four This kinds example of ambiguity interactsthat can hamper with the the identification JavaScript of DOM elements [9]; two of from research articles. The description of the code snippet (getElementById),these were specific to the takes plain-text a photo analysis using they the were per- shown in Figure 2 explains this code’s inefficiency and pro- 1 public FirstPanel() { Cordovaforming, whileproject the other (getPicture), two were more and generally uses JQuery relevant. vides useful information including (1) it is a method used 2 History.addHistoryListener(this); toThese detect two were whendeclaration the photo ambiguity shouldand beexternal taken reference 3 String token = History.getToken(); ($ambiguity and on).. For each of these method references for testing more than one test scenario, (2) the specific test if (token. () == 0) { 4 length BakerDeclaration can identify Ambiguity. theSnippets API that are, it by is definition, from. in- coverage of the method, (3) it contains redundancy in the 5 History.newItem(INIT_STATE); 6 } else { complete fragments of code. That is, snippets might not be source code, (4) it contains example usage of API methods, 7 History.fireCurrentHistoryState(); (b)embedded Code-related in methods descriptive or classes, text they may reference fields and (5) a proposed solution to remove the redundancy. Beyond 8 } whose declaration is not included, and their identifiers are 9 .. rest of code Fig.largely 1: A unqualified. sample code In snippet source code with examples description this extracted is often from using the code and its description to show usage of particular 10 } theexacerbated article “Live by authors API Documentation” ending lines with (ICSE ‘. . . ’ 2014) or using APIs in text methods, using this example code segment to code comments to describe parts of the functionality that learn about redundancies in test methods, an IDE could be are elided. Figure 1: A Java code snippet representing a Java designed to detect redundancy in lines of code, prompting the API usage. Baker can associate each of the bolded textExternal that contains Reference description Ambiguity. of theSource functionality code examples or property frequently refer to external identifiers; for example, Java user to make separate methods for different test scenarios and terms with a fully qualified name; this information of an embedded code segment, more difficult. The problem can be used to include the code example in the API snippets frequently reference types from the JDK. While congregate the ones with similar functionalities. documentation. isa further previous complicated study [9] dealt by with the common external references situation by in elid- which the The next description in Figure 3 indicates that the code researching everything article thatcontains was not multiple from a code pre-specified segments, library, in which we case segment demonstrates a common memory leak pattern the designed Baker to handle these kinds of ambiguities. We ac- the textual descriptions that are identified need to be mapped programming language. The description of the code snippet Next, consider the JavaScript snippet in Figure 2, where complished this by using an oracle: a large contain- a developer is trying to make a web app that can take a toing the information corresponding about thecode code segment. elements Lastly, in popular the APIs. description provides useful information, including (1) the programming photo and inject it into an element in an HTML docu- identificationWhen Baker encountersproblem would an ambiguous be simpler code when element, there such is as less text language of the source code, (2) a comment description of ment. This example interacts with the JavaScript DOM tothe analyze,History butclass unfortunately in Figure 1, it research uses the oracleconference to identify articles in the pattern of recurring memory leaks, (3) the functionality of (getElementById), takes a photo using the Cordova pro- the possible types of the code element. In this case, there ject (getPicture), and uses JQuery to detect when the generalare 58 containHistory 400classes lines in of the natural oracle, language but by using text, informa- considerably the procedure shown, (4) functionalities of individual method the photo should be taken ($ and on). For each of these longertion from than other emails, parts bug of reports, the code and snippet, forum we entries. can identify If one wants calls within the procedure, (5) the type of database used by the method references Baker can can identify the API that it is towhich scaleof a code the 58 segment is the correct and description one. Section mining4 will technique, present the program, and (6) the reason of the memory leak. By mining from. techniquemore information cannot analyzeabout how every the oracle line ofis constructed, text. what such code segments and their descriptions, a C programming The code snippets in Figures 1 and 2 were both submit- it contains, and how much of a problem ambiguity really is. ted as the correct solution to problems developers posted Researchers have developed techniques to automatically language tutorial could include this code in a lesson on fixing on Stack Overflow. Since Stack Overflow posts are ranked, extract3.1 code Deductive segments Linking from emails, bug reports, Q&A sites and recurring memory leaks, or an IDE could be extended to and accepted answers are known to have solved a real prob- tutorialsBaker [1], handles [2], [10], declaration [5]. These ambiguity techniques and external are also refer- applicable identify such memory leak patterns as they are implemented. lem, Stack Overflow is a good source of high quality code toence extract ambiguity code through segments a process from we research call deductive articles. linking. Less work snippets that demonstrate the correct usage of many APIs. At a high level, it generates an incomplete abstract syntax From this example, the consequences of using unsuitable loop Increasing the integration between these examples and the hastree focused (AST) for on the the code code snippet description being analyzed, identification then uses problem. exit statements also can be evaluated and avoided. The last ocial API documentation will make documentation main- Traceabilityinformation fromanalyses the oracle can to use deduce code facts terms about in the a sentenceAST. or example code description shown in Figure 4 indicates that tenance easier and increase the visibility and accessibility of paragraphWe perform as this an deduction indicator step of iterativelywhich API since is each being phase described the code snippet depicts a security vulnerability in a typical the ocial API documentation within source code examples. can reveal new facts that can be used in subsequent phases. [5], [14]. Text preceding code segments in Stack Overflow can SQL injection. The description of the code snippet specifies be extracted as potential comments for similar code segments (1) the functionalities of specific SQL queries, (2) the security in an application [6]. Similarly, method descriptions can be issues inherent of the vulnerable code, (3) the vulnerability extracted using clues in the text [15], [16]. in the code. Without looking carefully, a reader would not be To the best of our knowledge, this is the first paper to ad- able to understand the security vulnerability in this example dress the code description identification problem for research code without the corresponding description. This code and its The paper is organized as follows. Section II explains the or control-flow analysis [7], [10], [11]. It could hardly technical challenges. Section III describes the test code distinguish different test scenarios, because the scenarios are patterns. Section IV presents our approach. Section V presents syntactically dependent on each other. our evaluation. Section VI reviews related research work. Section VII makes theSafe conclusion. Memory-Leak FixingIII. TEST for CODE P CATTERNS Programs We have conducted a pilot study on two Java libraries, II. CHALLENGES Commons-Math1 and HttpClient2. The goals of our pilot study A test scenarioQing in Gao, unit tests Yingfei usually Xiong, contains Yaqing three phases: Mi, Lu Zhang,are: 1) gaining Weikun an Yang,overview Zhaoping of the multiple Zhou, test Bing scenarios Xie, issue; Hong Mei data preparation, test executionKey Laboratory and result of verification. High Confidence In 2) Software investigating Technologies the characteristics (Peking of University), test code; 3) MoE seeking a JUnit, result verification is implementedInstitute by of assertion Software, methods. School of Electronicssolution for Engineeringseparating test and scenarios. Computer Through Science, the study, we The testing engine will run the unit test, and judge whether it summarize a set of test code patterns. For each pattern, we passes or fails the test according to the returnPeking values University, of propose Beijing, a set 100871,of heuristic P. rules R. China for separating test scenarios. assertion methods.gaoqing11, Also xiongyf04, developers miyq13, can easily zhanglu, infer xiebing, the meih @sei.pku.edu.cn, weikunyang, zhouzhaoping @pku.edu.cn { } { } expected program state by reading the assertion methods (such A. Pilot Study as Line 12-13 in Fig. 1). These characteristics of test code 1) Procedure make the mined usage examples more readable to understand the API designment and rationale. We randomly sample 36 classes and 15 classes from two Abstract—Automatic bug fixing has become a promisingsubjects, direc- Commons-Math1recordp; and HttpClient, respectively. That is 1 public void testKeySetByValue() { ⇤ 2 tion BinaryTree for reducing m = new manual BinaryTree(); effort in debugging. However, generalapproximately2 int10% badof therecord test classesid ; in the two subjects. All 3 approaches LocalTestNode to automatic nodes[] = bugmakeLocalNodes(); fixing may face some fundamentalthe test methods3 while (with(has JUnitnext annotation ()) @Test ) in the 51 { 4 difficulties. Collection In c1 this= new paper, LinkedList(); we argue that automatic fixingclasses of were4 thoroughlyif (search inspectedcondition by three != null authors. ) We specific... types of bugs can be a useful complement. manually counted5p=get the followingnext indicators: (); 5 Thism = BinaryTree.initial(); paper reports our first attempt towards automatically 6 else 6 fixing c1.clear(); memory leaks in C programs. Our approach generates onlyA. Number7p=search of test scenarios for in a next test method.( search Wecondition have ); 7 for (int k = 0; k < nodes.length; k++){ safe fixes, which are guaranteed not to interrupt normal execution tried8 to separateif (is thebroken scenarios (p)) manually. 8 m.put(nodes[ k ].getKey(), nodes[ k ]); 9badrecord id=p >{ id ; of the program. To design such an approach, we have to deal 9 if (k % 2 == 1) B. Number10 of sharedbreak ;data objects among test scenarios in 10 with several c1.add(nodes[ challenging k ].getKey()); problems such as inter-procedural leaks, a 11 method, i.e., the same data objects are used by 11 } global variables, loops, and leaks from multiple allocations. We different12 scenarios.} f r e e ( p ) ; 12 assertTrue(m.keySetByValue().retainAll(c1)); propose solutions to all the problems and integrate the solutions 13 13 assertEquals(nodes.length / 2, m.size()); C. Number} of test execution statements in a test scenario. into... a coherent approach. 14 . . . // operations on bad record id 15 return ; 14 Wem = implementedBinaryTree.initial(); our inter-procedural memory leak fixing intoD. Whether the test scenarios in a method have the similar 15 a toolc1.clear(); named LeakF ix and evaluated LeakF ix on 15 programs test execution phase (i.e., method invocation sequences 16 with for 522k (int linesk = 0; of k code. < nodes. Ourlength evaluation; k++){ shows that LeakF ix is are the same)?Fig. 1. The code of procedure check_records 17 able tom.put(nodes[ successfully k ].getKey(), fix a substantial nodes[ numberk ]); of memory leaks, 18 if (k % 2 == 0) memory leak [14, 15]. Second, memory leaks cannot be easily and LeakF ix is scalable for large applications. E. NumberTo understand of assertion statements the difficulty in a scenario. of fixing a memory 19 c1.add(nodes[ k ].getKey()); handled by general bug-fixing approaches, as we cannot easily 20 } 2) Resultsleak, let us take a look at an example program 21 assertTrue(m.keySetByValue().removeAll(c1));I. INTRODUCTION inspecify Fig. the 1. condition This is of a “no contrived leak” as an example assertion or mimicking a test case. 22 assertEquals(nodes.length / 2, m.size()); TABLErecurringThird, I. depicts the “nothe leak statistics leak” patterns condition of multiple we is scenarios general. found in Wein test real can build C it into ...Recently, a lot of research effort has been put into automaticmethods. There are 51.9% (232/447) of the test methods 23 } programs.our approach Procedure without relying check_records on user-defined checks test cases and bug fixing [1, 2, 3, 4, 5, 6]. Given a violated correctnesscontainingwhether more than there one test is scenario. any badA test record method incontains a large file, Fig. 1. A test method with multiple test scenarios. assertions. Fourth, the problem of fixing memory leaks takes condition, these approaches try to modify the code to3.6 satis- test scenariosand the on calleraverage. This could number either will increase check to all 6.0 records, a much simpler form than fixing general bugs, as the main fy the condition. However, automatic bug fixing facesif we two onlyor concern specify the methodsa search with condition multiple test to scenarios. check only part A majorA major obstacle obstacle to toextracting extracting API examples API examplesfrom test code Accordingtask to the is last to find column a suitable in TABLE location I. , a totest insert scenario the deallocation is the multiplefundamental test scenarios difficulties. in a test First, method. the correctness Fig. 1 depicts condition is of records. In this example, both get_next and from test code is the multiple test scenarios incontains a search_for_next 1.3 assertion statements will on average. allocate These and statistics return such a test method. Lines 2-4 are the declaration of some data statement. test method.often under-specified Fig. 1 depicts in practice. such Current a test approaches method. usuallyconfirm a that heap multiple structure, test scenarios which is is expected a widespread to be freed objects. Lines 5-13 depict a test scenario that contains the To understand the difficulty of fixing a memory leak, let us Linesrely 2-4 on are test the cases declaration or assertions, of both some of data which are usuallyphenomenonat in line test code. 12. However, the execution may break out objects.usage of some Lines API 5-13methods, depict such as a keySetByValue test scenario, put, and that take a look at an example program in Fig. 1. This is a contrived inadequate in the code, and rarely ensure correctness. Second, the loop at line 10, causing a memory leak. containsgetKey. Lines the 14-22 usage depict of some another API test methods, scenario, such which as example mimicking recurring leak patterns we found in real C the search space is often very large (even infinite), andTABLE it is I. PILOT STUDY: MULTIPLE SCENARIOS IN TEST METHODS keySetByValue,contains a similar usage put, to the and previous getKey. one. LinesSuch multiple 14-22 test programs. Procedure check_records checks whether there is depictscenariosveryanother are difficultquite reasonable test to find scenario, when an efficient aiming which at fixing covering contains algorithm testing a in general. Fig. 3: A sample code snippet with description Excerpt from the any# bad of Test record Methods in a large Average file, Scenarios and the caller# of could either check similarinput domains. usage Butto they the bring previous redundant code one. for Such API users multiple to paper “Safe Memory-leak Fixing for C Programs”in Multi- (ICSEAssertion ’15) Current approaches usually run in hours and may produceSubject Single Multiple in All test scenarios are quite reasonable when aiming all records, or specify a searchScenario condition toper check only part of read. Inundesirable fact, there fixes. are actually 200+ code lines containing Scenario Scenarios Methods atsimilar covering test scenarios testing in inputthe test domains. method in But Fig. they 1. It isbring records. In this example, both get_nextMethods Scenarioand search_for_next Commons- redundantDue code to these for fundamental API users difficulties, to read. we In argue fact, that instead 170 169 3.0 5.0 1.2 necessary to separate different test scenarios from one test Math will allocate and return a heap structure, which is expected to thereof are general actually bug fixing, 200+ we code should lines also containing study fixing approaches article, which may be a conference or a journal publication, method and cluster the similar usages to remove redundancy. HttpClient be freed 45 at line 63 12. However, 5.4 the 8.6 execution 1.3 may break out the similarfor test specific scenarios types of bugs. in the In this test paper method we report our attempt Unfortunately, existing approaches might fail to solve this Total inloop pdf 215 format. at line 10, 232The causing preprocessing 3.6 a memory 6.0 phase leak. renders 1.3 the entire input in Fig.1.of developing It is necessary an approach to that separate fixes a specific different type of bugs problem. We can see that the two scenarios in Fig. 1. share the documentMany existinginto plain detection text. Existing approaches pdf-to-text report only converters the alloca- can be test scenarios– memory leaks from in one C programs. test method There and are several cluster reasons to thesame similar data objects, usages the same to data remove preparation redundancy. procedures (lines usedtion for that this may phase, be leaked, such as which pdftotext may[17], be far convertmypdf from the place [18], 2-4) andchoose are executed memory sequentially. leaks as our The target code extraction problem. First,module dealing with 1 http://commons.apache.org/proper/commons-Math/,convertpdftotextwhere the leak occurs. [19], etc. In this making accessed example, surein Mar. the they2013. leaked can allocation handle both in existingmemory approaches leak is is an based important on data-flow problem analysis in software [6], [12] developmen- 2 Fig. 2: A sample code snippet with description Excerpt from http://hc.apache.org/httpcomponents-client-ga/, the singlewill be and inside double the procedures column accessed formattingget_next in Mar. 2013.and of search_for_next articles. Since some, t. While many approaches [7, 8, 9, 10, 11, 12, 13] have been paper “Mining API Usage Examples from Test Code” (ICSME ’14) where the actual leak occurs in another procedure. Fastcheck proposed to detect memory leaks, it is still difficult to fix a code segments are embedded into articles as images, this phase leverages[9] is a existing detection image-to-text approach that converters gives a pathsuch whereas ocrconvert the memory is leaked. This noticeably reduces the search space for We sincerely thank Zhenbo Xu and Jian Zhang at Institute of302 Software, [20], abbyyfinereader [21], etc. to process the images into text description could serve as a good lesson in a SQL tutorial Chinese Academy of Science, for their advice on implementation. soidentifying all code segments the leak, but can it be is still identified. difficult to correctly fix the or be usedThis as work a pattern is supported in by a the vulnerability National Basic detection Research Program tool of that China leak. To fix the leak, we have to insert a deallocation statement under Grant No. 2014CB347701, and the National Natural Science Foundation The first text analysis phase partitions the content such providesof feedback China under Grantabout No. the 61202071, vulnerabilities 61225007, 61421091, found, and based 61332010. on satisfying the following conditions. (1) In any execution, the this descriptionYingfei Xiong text. is the corresponding author. thatmemory each chunk block has is classifiedto be allocated into before a single the deallocation. category of (2) either These examples are just a sampling of the kind of infor- source code or natural language text. Various tools are already mation that can be learned about mined code snippets in developed by researchers to do similar content classification research articles if the descriptive text can be mined along tasks [10], [1], [4], [5]. In this paper, we use a tool that takes with the code snippets to provide the writer’s perspective on the previously created plaintext as input, and outputs a tagged the properties and functionalities of the mined code segments. XML file with separate tags for natural language text and code Because they originate in research papers, the descriptions segments. typically include the functionality and properties, and are used The main focus of this paper is the code description miner, commonly as examples of those properties, which is a rich which is comprised of two phases. The first phase identifies kind of information for many purposes. code-related seeds, which are natural language sentences that are directly related to an embedded code segment through III.APPROACH either the content or location of the natural language text. The second phase identifies natural language text neighboring A. Overview of CoDesNPub Miner code-related seeds that is highly likely to also be containing Figure 5 presents the phases of our whole process for au- useful information about the code, but not directly identifiable tomatic preprocessing, classification, identification and mark- without the seed text. We refer to these two kinds of code- up of research articles. The input document is the research related text as seeds and neighbors. We use linguistic and Figure 1: Architecture of our static analysis framework. 1.4 Contributions vulnerabilities have recently been appearing on special- A unified analysis framework. We unify multiple, ized vulnerability tracking sites such as SecurityFocus seemingly diverse, recently discovered categories of se- and were widely publicized in the technical press [39, curity vulnerabilities in Web applications and propose an 41]. Recent reports include SQL injections in Oracle extensible tool for detecting these vulnerabilities using a products [31] and cross-site scripting vulnerabilities in sound yet practical static analysis for Java. Mozilla Firefox [30]. A powerful static analysis. Our tool is the first prac- 2.1 SQL Injection Example tical static security analysis that utilizes fully context- sensitive pointer analysis results. We improve the state Let us start with a discussion of SQL injections, one of the art in pointer analysis by improving the object- of the most well-known kinds of security vulnerabilities naming scheme. The precision of the analysis is effec- found in Web applications. SQL injections are caused tive in reducing the number of false positives issued by by unchecked user input being passed to a back-end our tool. database for execution [1, 2, 14, 29, 32, 47]. The hacker A simple user interface. Users of our tool can find may embed SQL commands into the data he sends to the a variety of vulnerabilities involving tainted objects by application, leading to unintended actions performed on specifying them using PQL [35]. Our system provides a the back-end database. When exploited, a SQL injection GUI auditing interface implemented on top of , may cause unauthorized access to sensitive data, updates thus allowing users to perform security audits quickly or deletions from the database, and even shell command during program development. execution. The ReferencesCodeFigure heuristic identifies sentences Experimental validation. We present a detailed ex- Example 1. A simple example of a SQL injection is that contain the word “figure” or “listing”, and uses the figure perimental evaluation of our system and the static analy- shown below: sis approach on a set of large, widely-used open-source or listing number reference to check whether the referenced HttpServletRequest request = ...; figure has been classified as code. For example in Figure Java applications. We found a total of 29 security errors, String userName = request.getParameter("name"); including two important vulnerabilities in widely-used li- Connection con = ... 1, ReferencesCodeFigure would identify the sentence, “Next, braries. Eight out of nine of our benchmark applications String query = "SELECT * FROM Users " + consider the JavaScript snippet in Figure 2, where a developer had at least one vulnerability, and our analysis produced " WHERE name = ’" + userName + "’"; is trying to make a web app that can take a photo and inject con.execute(query); only 12 false positives. it into an element in an HTML document.” This code snippet obtains a user name (userName) by in- Located Immediately Before or After Inlined Code. Some- 1.5 Paper Organization This code snippet obtains a user name (userName)voking request by invoking.getParameter("name") and uses it to times, authors of research articles place their code examples The rest of the paper is organized as follows. Section 2 request.getParameter("name")andconstruct a query to be passed to a database uses for execution directly inlined within the running text, similar to the use of presents a detailed overview of application-level security it(con to.execute construct(query a query)). This to seemingly be passed innocent to a piece code segments in online forums and emails. When this occurs, database for execution (con.execute(query)). vulnerabilities we address. Section 3 describes our static Thisof code seemingly may allow innocent an attacker piece to gain of access code to may unautho- allow it is most likely that they are discussing the code segment in analysis approach. Section 4 describes improvements anrized attacker information: to gain if an access attacker to has unauthorized full control of string sentences just before or just after (or both before and after) that increase analysis precision and coverage. Section 5 information:userName obtained if an from attacker an HTTP has request, full control he can for of the code segment itself. describes the auditing environment our system provides. stringexample userName set it to ’ obtainedOR 1 = 1; from. Twoan HTTP dashes request, are used he can for example set it to ’OR 1 = 1;--. The TextBefore and TextAfter heuristics identify the sen- Section 6 summarizes our experimental findings. Sec- Twoto indicate dashes comments are used in to the indicate Oracle dialect comments of SQL, in so the tences immediately before and immediately after any inlined tion 7 describes related work, and Section 8 concludes. OracleWHERE dialectclause of ofthe SQL, query so effectively the WHERE becomes clause the of tau- the code segment as potential code descriptions, respectively. querytology effectivelyname = ’’ OR becomes 1 = 1. the This tautology allows the attacker name = ’ ’ OR 1 = 1. This allows the attacker For example, in Figure 2, TextAfter would extract “A major 2 Overview of Vulnerabilities toto circumvent circumvent the the name name check check and and get access get access to all user obstacle to extracting API examples from test code is the In this section we focus on a variety of security torecords all user in the records database. in2 the database. multiple test scenarios in a test method.”, since this is the vulnerabilities in Web applications that are caused by SQL injection is but one of the vulnerabilities that sentence that occurs immediately after the code segment in unchecked input. According to an influential sur- Fig.can 4: be A sample formulated code as snippettainted with object description propagationExcerptprob- from the paper the document. vey performed by the Open Web Application Security “Findinglems. Security In this Vulnerabilities case, the in input Java Applications variable withuserName Static Analysis”is con- (SSYM’05) Without combining with other heuristics, these heuristics Project [41], unvalidated input is the number one secu- sidered tainted. If a tainted object (the source or any can be inaccurate if the author always describes code segments rity problem in Web applications. Many such security other object derived from it) is passed as a parameter to structural features of the text and embedded code segments to before or after and not both locations. These heuristics capture identify and map the code descriptions to the associated code the relative location only of sentences surrounding inlined snippets in the document. The system can be implemented to code. Additional sentences in the nearby location will be con- output either as an XML version of the original article with sidered by the neighborhood sentence identification heuristics. the code segments and related seeds and neighbors marked Contains Code Identifiers. Sentences describing code seg- up, or a set of extracted code segments and related seeds ments in a research article often contain code identifiers from and neighbors for a database of mined code examples with the associated code segment. The use of code identifiers is descriptive information. particularly common when describing the steps comprising the code, explaining the functionality of each statement or B. Code Description Identification block. Thus, the ContainsCodeIdentifiers heuristic identifies To develop our automatic code description identification all the code segments in a research article that contain a word technique, we analyzed the text of randomly selected computer that also appears in any of the code segments as a user- science research articles from ACM and IEEE digital libraries, defined identifier. This heuristic requires tokenization of the which collectively included over 200 code examples. Based on code segments within the document, creation of a dictionary of our manual inspection of both text related to the code segments variable, method, and class names for each code segment, and and text not related to the code segments, we developed a removal of keywords from those dictionaries. The heuristic set of heuristics that focus on features of sentences, including identifies and maps sentences to associated code segments location and lexical and phrasal information. We first describe based on occurrences of the dictionary names in the sentence. our individual heuristics for identifying sentences as code For example in Figure 3, ContainsCodeIdentifiers would description and then describe how we combine the heuristics identify the sentence “In this example, both get next and to perform code description identification. search for next will allocate and return a heap structure, References Figure Containing Code. While some code which is expected to be freed at line 12.”, based on the pres- segments in research articles are embedded within the run- ence of the code identifiers “get next” and “search for next” ning text, many are included as separate figures or listings. in the code segment described by this sentence. This heuristic Typically, when code appears as a figure or listing, authors has the potential to be inaccurate when code segments use will refer the reader to the code they are discussing by using identifiers that are commonly used as regular words in sen- phrases such as “In Figure 1, ...” or “Listing 1 ...”. These tences. references are very accurate cues for an automatic system to References Code By Position. Authors somtimes use specific identify sentences related to the code segment, when we are cue words or phrases pertaining to software engineering and able to identify that the figure being referenced is indeed a development when describing code segments in research ar- code segment. ticles. ReferencesCodeByPosition identifies the sentences that Fig. 5: Overview of CoDesNPub Miner have specific cue words or phrases that suggest that a sentence scheme are a score of 3 for ReferencesCodeFigure, a is describing a code segment in the document. This heuristic score of 2 each for ContainsCodeIdentifiers and Ref- looks for phrases such as “...in the following code...”, “in the erencesCodeByPosition, and a score of 1 for each of running example”, etc. Specifically, this heuristic aims to iden- TextBefore and TextAfter. tify sentences containing code-indicating words such as ‘code’, Each heuristic is applied to the document resulting in a ‘method’, ‘loop’, ‘Javascript’, etc. Based on the adjective in score for each sentence. The heuristics can be applied in the phrase, it searches either before or after the sentence for a any order as their application order does not affect the final code segment located in the designated relative position near scoring. A threshold is used with the final scores to classify the neighboring two paragraphs to the sentence, to confirm a given sentence as code description. In our experimental that the sentence is referring to a code segment. For example, study, we perform a threshold analysis, including a threshold in Figure 4, ReferencesCodeByPosition would identify the that requires only one heuristic to be triggered to consider sentence, “ obtains a user name (user- This code snippet a sentence as a code description, up through requiring some Name) by invoking request.getParameter(“name”)and uses it combination of heuristics that achieves a high score. We to construct a query to be passed to a database for execution evaluate the various precisions and numbers of sentences (con.execute (query)).” identified with different thresholds and scoring schemes. 1) Putting It All Together: A given sentence may be iden- tified as a potential code description sentence by more than 2) Identifying Neighboring Code-related Text: Our obser- one heuristic. For instance, a given sentence might say “In the vations during development revealed that there may be neigh- code below,” and also be located immediately before a code boring text to the code description sentences identified by the segment, or a sentence might mention a code identifier and heuristics that is also related, but the heuristics do not indicate include the phrase “In our example code.” A given sentence that directly. The additional sentences often describe finer might contain more than one identifier which is a stronger details such as the intuition for implementing the code, etc. and indicator than just one identifier that might occur in more than are important for better understanding and reuse of the same one segment. code example. Figure 4 shows an example where the code- We combine the heuristics by assigning a score to a sentence related neighboring text would be the sentence “This allows each time a heuristic indicates that it is potentially a code the attacker to circumvent the name check and get access to description. We pose two scoring schemes as follows: all user records in the database.”. Identifying this sentence is important since it describes the consequence of the security • All cues are treated as equally contributing Equal Scores: vulnerability threat in that example code, but the heuristics to the potential for a sentence to be a code description. with the textual cues would not identify this sentence. Each instance of any heuristic being triggered for a given Our manual analysis suggested that that if a part of a sentence results in adding a score of 1 to the total score paragraph of text contains sentences identified by the heuristics for that sentence. to describe a code segment, then the entire paragraph often • Accuracy-based Scores: Some heuristics such as Refer- describes the code segment extensively. However, not every encesCodeFigure are highly likely to accurately identify paragraph with an identified seed sentence was entirely a a sentence as a code description whereas others could be code description. Therefore, we explored several percentages less accurate, such as TextBefore and TextAfter. Thus, this of paragraph sentences as minimum numbers of seed sentences scoring scheme assigns different scores to each instance needed to consider the whole paragraph as code description of different heuristics depending on the basis of our text. observations of relative accuracy during our work with the development set. Based on our development set analysis, • At least one sentence in the paragraph matches one or the final scores for each heuristic for the best scoring more heuristics to identify text directly related to code. • At least (25%, 50%, or 75%, respectively) of the total RQ2: What kinds of information are available in natural number of sentences in the paragraph matches one or language text describing code segments in research articles? more heuristics to identify text directly related to code. RQ3: How do authors typically reference code segments within their code description text in research articles (i.e., C. Code Description Identification Example What cues are most prominent?) Consider a paragraph extracted from a paper published in ICSE 2014. We consider each heuristic on each sentence of A. Evaluation Design this paragraph as our code description identification process 1) Implementation: Our code description identification pro- works at sentence granularity. We describe the example using cess is fully automatic. It takes XML with markup classifi- the accuracy-based scoring scheme. cation of code and natural language text of a single article Fig.5 shows a typical test method of this pattern. as input and outputs XML with additional markup for code The method tests a set of basic functionality of description text, as well as data for our evaluation study. Due API class BasicAuthCache, including the method put, get, remove and clear. There are three test to the inaccuracies of the tools for pdf-to-text conversion and scenarios in the method: line 4-5, line 6-7, OCR-to-text conversion that we experienced, the preprocess- line 8-10. They share two data objects, cache and ing is currently semi-automatic. That is, we apply current state authScheme. Their method invocation sequences are not same and there is no unified test target of the art tools to convert to text, but then manually clean up method. But there is a common subsequence the inaccuracies, so that our evaluation study is not affected among three method invocation sequences, i.e., by the inaccuracies of the preprocessing. the invocations of get and HttpHost. 2) Subjects and Measures: The subjects in our study are Excerpt from the paper “Mining API Usage Examples from Test Code” (ICSME ’14) research articles (disjoint from our development set) that ReferencesCodeFigure would identify the first sentence as contain in total 100 code segments, selected from ACM DL code description due to the presence of the word “Fig.5”, and IEEE Xplore in the domain of software engineering. which is the figure number for the code segment described in Because many articles have more than one code segment, this paragraph. We would assign a score of 3 to this sentence. our final evaluation set consists of 4 journal papers and 4 The next sentence contains code identifiers “BasicAuthCache” conference papers published between 2011 through 2015. and “get”, found in the described code segment of the research To answer RQ1, we measure the effectiveness of the overall article. So ContainsCodeIdentifiers would indicate the second precision and recall of the code description identification, and sentence is code description and assign a score of 2. The also the precision of the seed identification. We do not compute third sentence would not be identified as seed of a code recall of the seed identification because we did not want to description by any or our heuristics. The fourth sentence would reveal details of our approach to the human annotators in be identified by ContainsCodeIdentifiers since it contains creation of the gold set. Precision is calculated by determining the code identifiers “cache” and “authScheme”, found in the percentage of automatically identified code description the described code segment. Hence, this sentence would be sentences that are marked as code-related descriptions by assigned a score of 2. The fifth sentence would be identified by human judges. Precision of the seed heuristics is computed ReferencesCodeByPosition due to the presence of the phrase similarly, instead focusing on only those automatically iden- “method invocation”, and assigned a score of 2. The next tified seed sentences. Recall of the overall code description sentence would also be identified by ContainsCodeIdentifiers identification is computed by determining the percentage of since it contains the code identifiers “get” and “HttpHost”. all the sentences that describe the code segments in the study However, the same sentence would also get an additional score (as identified by human judges) that are also identified as code by TextBefore since this sentence is found immediately above descriptions by the automatic technique. the code segment that it describes. Hence, the last sentence To answer RQ2, we computed the frequency that each seed would be assigned a score of 2 by ContainsCodeIdentifiers, heuristic was triggered, including counts for each time a given and add a score of 1 for TextBefore, making the total score of heuristic is triggered more than once on a given sentence. this sentence 3. At this point, all the seeds are identified, and To answer RQ3, one of the authors used the results from the minimum of at least 50% of the total number of sentences a previous study [22] and manual analysis of the human in the paragraph to include the whole paragraph would mark annotated sentences to develop a labeling scheme to code the whole paragraph as code description text. the annotated sentences. We defined six major categories of labels, or codes, and twenty sub-labels for the observed code IV. EVALUATION properties as described in Table IV. RQ3 is addressed by We designed our evaluation to answer the research question: coding each annotated sentence and computing the frequency RQ1: How effective is our approach to automatically iden- of occurrence of each label in the subject set of research tify code descriptions in natural language text of research articles. articles? 3) Methodology: We created a gold set for our evaluation In addition, we also collected data to answer two questions by recruiting human annotators. Our human annotators con- about how code segments are described in research articles. sisted of 10 computer science students - 9 graduate students Namely, we collect data to answer: and 1 senior undergraduate researcher. These participants had Thresholds Scoring Scheme no knowledge of our techniques, are not authors on this 1 2 3 paper, and are equipped with prior computer science and Equal score (=1) 62.69 80.26 71.42 programming experience. Accuracy-based score 62.69 69.33 72.89 We designed a set of instructions and had two of the TABLE I: Precision of seed heuristics (Scoring: References participants test the annotation procedure while keeping a note Figure Containing Code:3, Located Immediately Before or of the time they required for each code segment. Based on the After Inlined Code:1, Contains Code Identifiers, References timing results, each of the ten judges was assigned research Code By Position:2) papers for 20-30 of the randomly selected code segments. To account for potential subjectivity of human opinion, each of the 100 code segments was analyzed by two judges separately. Minimum # of Seeds Precision Recall 1-24% 39.05 70.20 Therefore, in total, we collected 200 annotated objects for this 25% 53.41 50.33 ≥ evaluation study. Since there were inconsistencies in some of 50% 66.04 28.45 ≥ 75% 68.30 20.53 the human annotations, we considered any sentence that either ≥ annotator highlighted in our evaluation as a code description. TABLE II: Effectiveness of code description identification Specifically, the judges were instructed to annotate natural with different schemes to identify neighboring code-related language text in the papers with the following instructions, text “Your task is to review several assigned research papers and highlight any text in the entire paper that you think is describing an embedded code segment identification provides the highest precision, in fact, higher or any property of the code segment (highlighted in than any of the thresholds with the accuracy-based schemes. yellow in the document), and label each highlighted Table II addresses the last research sub-question by report- text with the related code segment number”. ing precision and recall for different neighboring code-related For our evaluation data set, the humans annotated 745 text identification using the best scoring and threshold com- sentences as code descriptions. The gold set does not include bination for identifying seeds. Table II shows results for four any captions. We did not ask the human judges to highlight minimum number of seed sentences needed to consider the the captions of the figures containing code segments in the whole paragraph as code description text. As expected, the pre- evaluation set, since we assume that a caption to a figure cision is higher at the higher minimums ( 50% and 75%), ≥ ≥ containing code is always relevant to that code, and we did with a tradeoff of reduced recall. With higher precision in the not want the captions to bias the results. identification as a priority over missing descriptions, the higher minimums would be used. B. Results and Discussion Our qualitative analysis focuses on: When our system is not We organize our evaluation results by research questions. effective, what is the breakdown between, and the character- RQ1: How effective is our approach to automatically istics of, incorrectly identified code descriptions and missed identify code descriptions in natural language text of code descriptions? We examined the evaluation set where research articles? our best configuration either missed code descriptions or As part of evaluating the effectiveness, we considered incorrectly identified code descriptions. There exist 71 out several configurations for code description identification: (1) of 224 sentences (31%) that were identified incorrectly as Should all seed heuristics be treated equally or with different code descriptions, and 592 out of 816 sentences (72%) that scores reflecting their perceived accuracy? (i.e., What scoring the human judges indicated described code examples, but the approach provides better precision?) (2) For each of the seed system missed them. Table III shows examples from each heuristic scoring schemes, which threshold provides higher of these categories along with some correctly identified code precision? (3) How does the minimum number of seed sen- description sentences. tences used to identify neighboring code-related text affect the Analysis of the sentences incorrectly identified as code de- precision and recall of CoDesNPub Miner? Note that we are scriptions indicates that these sentences were either describing most interested in higher precision than recall because we want an algorithm (or pseudo code) or referring to figures with the identified descriptions to indeed be descriptive, whereas statistical analysis from experiments in the article. In the third missing some descriptions is not critical. example in Table III, the author is explaining the results of Table I presents results to answer the first two research an experiment using figures containing charts. This sentence subquestions, by reporting the precision for the two seed is identified using our seed heuristic CodeFigureListing. Our scoring schemes under three thresholds. The precision is tool is currently not able to discard a sentence that describes the same for threshold of 1 because it indicates that only figures containing statistical analysis such as tables or charts. one heuristic is needed to identify a seed, in either scoring In the fourth example, the author explains the intuition behind scheme. Higher thresholds with equal scores mean at least implementing a functionality. Although,this sentence does 2 or 3 heuristics, respectively, need to indicate a seed. In not describe a code segment specifically, it gives us some the equal scoring scheme, requiring two heuristics for a seed information about the implementation, which might be useful Identified correctly as code descriptions First, we notice that EVOSUITE uses the method toString rather than getRootElementName in the assertion. Listing 9 shows an example of three statements that were single statement blocks after the first phases, but can be merged into a single block because they have similar RHSs. Identified incorrectly as code descriptions The results of our initial study are summarized in the form of boxplots in Figure 2, and detailed statistical analysis is presented in Table (a) for Option, Table III (b) for Rational, and finally Table III (c) for DocType. Since our choice of a particular algorithm may not match what the user needs , having the ability to add user-defined functions was important. Missed code descriptions Meanwhile, if it appears in a requires clause (i.e., the precondition of the updated version), E should be evaluated in the pre-state of the previous version (i.e., (σ1, h1)). Such a difference is captured in the two topmost rules in Figure 5 (c) where notations “ensures” and “requires” designate the clause in which a prev expression appears. TABLE III: Examples of Analyzed Code Description Sentences

Labels Sub-Labels Description for building code recommendation systems. Programming Programming language Our analysis of the sentences where CoDesNPub Miner- Language missed sentences describing code segments revealed some Design Framework Framework used Time/Space Code complexity limitations of using a system based only on features of phrases Complexity contained in sentences. The fifth example in Table III contains Data Structure Data structures or variable types assumptions of specific code implementation, explaining the Control Flow Types of control statements used pre-conditions needed before implementing an algorithmic Structure Data Flow Data flow chains included Lines of code Length of code step in the code. Absence of phrases indicating explicit men- Rationale Why being implemented in this way tion of code implementation accounted for the tool missing Functionality What is being implemented to identify such sentences. Lastly, in the sixth example, the Methodology How functionality is implemented Output of Explanatory Results of running code figure referred to in this sentence does not contain real code code examples, but rules for an implementation. This sentence Syntactic or semantically similar code Similarity contains information about the rationale for implementing a blocks Modification Change(s) to existing code code, which our tool fails to identify, again due to absence of High Code is clean and understandable code specific phrases in the sentence. Clarity Low Code is unclear or overly complex RQ2: What kinds of information are available in natural Efficient Better/efficient code example Efficiency Inefficient Inefficient code example language text describing code segments in research arti- Conditions to be met to ensure Assumptions cles? correctness Figure 6a and Figure 6b depict our frequency distribution Compilation Code that fails to compile Contains runtime errors or exceptions of the kinds of information described in the natural language Erroneous Runtime thrown text that was annotated in our gold set, as coded by our labeling scheme and sub-labeling, respectively. Figure 6b TABLE IV: Description of labels and sublabels indicates that Methodology information is the most prevalent kind of information, which shows that the main purpose of mining code examples with descriptions from articles can be code description text is shown in Figure 7, which depicts to explain the aspects of their implementation. The second that References Figure Containing Code is the most prevalent most prevalent kind of information is Rationale, which shows heuristic. The next prevalent heuristic is Neighboring Code- that authors also explain why a code segment is implemented related Text which helps in identifying sentences that describe in a particular way, which could be valuable meta-data for a less obvious details about code segments. mined code example for learning. These results also suggest that research articles rarely C. Threats to Validity contain overly complex code examples, since they mostly Our subjects are selected from both journal and conference describe novel ways to address a problem rather than going papers in software engineering, across different years from into the details of code complexity. Looking beyond these two ACM DL and IEEE Xplore digital libraries, which contain categories, we see that a wide variety of information can be millions of full-text documents of publications. The results gained from descriptions associated with code segments in may not transfer to papers from different disciplines in com- digital libraries. puter science; we chose publications in the field of software RQ3: How do authors typically reference code segments engineering as we believe these contain a large number of within their code description text in research articles (i.e., analyzable code segments. What cues are most prominent?) One possible threat could be programming language de- The relative frequency of each feature used to indicate pendence. The technique we used to identify code segments (a) Frequency of labels (b) Frequency of sub-labels Fig. 6: Kinds of information in research articles in unstructured documents is capable of identifying code identify both code fragments and pseudocode in our future segments in different programming languages from documents work. containing code segments and natural language. All of our As with any study based on human annotators for estab- heuristics for extracting code descriptions are also program- lishing the ground truth, there might be some cases where the ming language independent. Our heuristic ReferencesCodeBy- humans may not have correctly annotated the descriptions for Position uses a manually created dictionary of words implying the code segments. To limit this threat, we ensured that the description of code segments. To create this dictionary, we human judges had considerable programming experience and have selected papers in our evaluation set that contain code research paper reading experiences, and we also ensured that examples in various programming languages such as Java, each code segment was judged by at least two judges, and C++, C, Python, etc. when they disagreed, we considered any sentence that either annotator highlighted in our evaluation. The dataset used for evaluating CoDesNPub consisted of a total of 8 papers including both journal and conference publi- cations. Considering the amount of research work produced in IEEE and ACM publications for the period of 4 years (2011-2015), it is possible that scaling to more than 100 code segments in our evaluation set might lead to different results. However, we needed to make the human annotation work reasonable to recruit judges. We will expand the evaluation study in the near future with more participants, and research papers containing more code segments.

V. RELATED WORK The most related work to this research is in collecting and analyzing information from sets of research articles, Fig. 7: Relative frequency of each feature indicating code identification of code snippets from unstructured documents, description text and identification of any textual descriptions associated with embedded code snippets. Research papers often interleave pseudocode and code frag- Analyzing Collections of Research Articles. Cruzes et. al ments; however, CoDesNPub miner is not able to distinguish [14] ran an entity recognition tool called Site Content Ana- between pseudocode and code fragments. It identifies all code lyzer on software engineering papers to analyze the linguistic fragments in a research paper, and also cannot differentiate features of the documents such as word density and frequency. between novel code contributions and code segments that are Based on the results, they claim that information extraction used only as examples in an empirical study. Our datasets techniques like text mining can support systematic reviews and for development and evaluation consist of papers that do creation of repositories of SE empirical evidence. Researchers not contain pseudocode. We plan to extend our approach to have analyzed repositories of research articles to support their evaluations. For example, Siegmund et. al [23] discussed the Code Description Identification. Panichella et. al [4] devel- tradeoff between internal and external validity and replication, oped a feature-based approach to automatically extract method complemented with a literature review about the status of descriptions from developer communications in bug tracking empirical research in software engineering. Kampenes et. al systems and mailing lists. Evaluation on two open source [24], [25] reported systematic reviews of controlled and quasi- systems indicated that the approach is able to extract method experiments published in major software engineering proceed- descriptions with a precision up to 79% for Eclipse and ings. They investigated the selection bias, practice of effect 87% for Lucene. Vassallo et. al [27] built on their previous size reporting, summarized standardized effect sizes detected work [4], to design a tool that extracts candidate method in the experiments, and provided advice for improvements documentation from StackOverflow discussions, and creates based on the results. Tichy et. al [26] discussed the lack of Javadoc descriptions. Their tool is able to extract descriptions experimentally validated results and quantitative evaluations for 20% and 28% of the Lucene and Hibernate methods with in computer science journals, supported by a survey of 400 a precision of 84% and 91% respectively. research articles. Wong et. al [7] proposed an automatic comment generation Code Segment Extraction. Bacchelli et. al. [1] used approach, which mines comments from Stack Overflow, and lightweight regular expression-based techniques to identify uses the code-description mappings in the posts to automati- code blocks in emails. Their features were lexical, focusing on cally generate descriptive comments for similar code segments programming language specific characteristics such as special matched in open-source projects. For Java and Android tagged characters and keywords and end of line markers. Their eval- Q & A posts, they extracted 132,767 code-description map- uation suggests that using lightweight methods are to be pre- pings, to generate 102 comments automatically for 23 Java and ferred over heavyweight techniques for source code extraction Android projects. Rahman et. al [28] developed a heuristic- from emails. Tang et al. [2] filtered out non-NL text including based technique for mining comments from Stack Overflow email headers, signatures, and code-related content (stack Q & A site for a given code segment. Evaluation on 292 traces, patches, and source code snippets) before cleaning the Stack Overflow code segments and 5,039 discussion comments remaining text with paragraph and sentence detection. They showed that their approach has a recall of 85.42%. Most of manually labeled data sets and then used SVM classification these systems focused on identifying source code descriptions with specific features for each filtering target. Cerulo et. al from Stack Overflow posts, where the text describing the [3] introduced an approach, based on Hidden Markov Models code is always found next to the code snippet. StackOverflow (HMMs), to extract coded information islands, such as source posts also have specific XML-tagged formats, which makes code, stack traces, and patches, from emails. They trained a the extraction of the information straightforward. HMM for each category of information contained in the text of the emails, and used the Viterbi algorithm to recognize whether VI.CONCLUSION AND FUTURE WORK the sequence of tokens observed in a text switches among those HMMs. Evaluation showed an accuracy of 82%-99%. This paper takes a first step towards unleashing the potential This approach does not require manual definition of regular to mine the vast number of computer science articles in digital expressions or parsers. libraries for code segments that come with useful descrip- Bettenburg et al. [10] developed a tool called InfoZilla that tive information about their functionality and properties. We identifies and classifies code patches, stack traces, source code, present and evaluate the first technique to automatically iden- and enumerated lists in bug reports. They apply specific filters tify natural language descriptions of code segments embedded for each category, using island parsing for identifying source within articles, where code segments can be separated as code. Evaluation showed almost perfect accuracy for each kind figures that are not located next to their descriptive text. Our of structure. The approach focuses on bug reports, all with evaluation study indicates that we can achieve precision of the same programming language used in the code snippets, 68.30% with recall of 20.53% with a single configuration of patches, and stack traces, and would require developing new scoring and threshold scheme, which is promising. Analysis parsers to handle a broader class of developer documentation. of the information available in the descriptions shows that a Subramanian et. al. [16] performed analyses of source code variety of information about code segments could be learned. snippets found in Stack Overflow, constructing an Abstract Future work includes fully automating the front-end pre- Syntax Tree (AST) for each code snippet and then parsing processing of articles, more extensive evaluation and study to effectively identify specific API usage. Building on their with other types of articles and different domains, and more previous work [16], Subramanian et. al. [6] developed an research to improve the precision and recall of the automated iterative, deductive method of linking source code examples to description identification. API documentation. Rigby et. al [5] developed a tool that uses an island parser to identify code elements in a Stack Overflow ACKNOWLEDGMENT post. Evaluation on documents that contain over 7058 distinct tags on StackOverflow showed an average precision and recall This research is supported by the National Science Founda- of 0.92 and 0.90, respectively. These techniques are also tion under Grant No.1422184 and the DARPA MUSE program applicable to extract code segments from research articles. under Air Force Research Lab contract no. FA8750-16-2-0288. REFERENCES 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’17), Feb. 2017. [1] A. Bacchelli, M. D’Ambros, and M. Lanza, “Extracting source code [23] J. Siegmund, N. Siegmund, and S. Apel, “Views on internal and from e-mails,” in Program Comprehension (ICPC), 2010 IEEE 18th external validity in empirical software engineering,” in Proceedings of International Conference on, June 2010, pp. 24–33. the 37th International Conference on Software Engineering - Volume [2] J. Tang, H. Li, Y. Cao, and Z. Tang, “Email data cleaning,” in 1, ser. ICSE ’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 9–19. Proceedings of the Eleventh ACM SIGKDD International Conference [Online]. Available: http://dl.acm.org/citation.cfm?id=2818754.2818759 on Knowledge Discovery in Data Mining, ser. KDD ’05. New [24] V. B. Kampenes, T. Dyba,˚ J. E. Hannay, and D. I. K. Sjøberg, York, NY, USA: ACM, 2005, pp. 489–498. [Online]. Available: “Systematic review: A systematic review of effect size in software http://doi.acm.org/10.1145/1081870.1081926 engineering experiments,” Inf. Softw. Technol., vol. 49, no. 11-12, pp. [3] L. Cerulo, M. Ceccarelli, M. Di Penta, and G. Canfora, “A hidden 1073–1086, Nov. 2007. [Online]. Available: http://dx.doi.org/10.1016/j. markov model to detect coded information islands in free text,” in Source infsof.2007.02.015 Code Analysis and Manipulation (SCAM), 2013 IEEE 13th International [25] V. B. Kampenes, T. Dyba,˚ J. E. Hannay, and D. I. K. Sjøberg, Working Conference on, Sept 2013, pp. 157–166. “A systematic review of quasi-experiments in software engineering,” [4] S. Panichella, J. Aponte, M. D. Penta, A. Marcus, and G. Canfora, Inf. Softw. Technol., vol. 51, no. 1, pp. 71–82, Jan. 2009. [Online]. “Mining source code descriptions from developer communications,” Available: http://dx.doi.org/10.1016/j.infsof.2008.04.006 in Program Comprehension (ICPC), 2012 IEEE 20th International [26] W. F. Tichy, P. Lukowicz, L. Prechelt, and E. A. Heinz, “Experimental Conference on, June 2012, pp. 63–72. evaluation in computer science: A quantitative study,” J. Syst. [5] P. C. Rigby and M. P. Robillard, “Discovering essential code elements Softw., vol. 28, no. 1, pp. 9–18, Jan. 1995. [Online]. Available: in informal documentation,” in Proceedings of the 2013 International http://dx.doi.org/10.1016/0164-1212(94)00111-Y Conference on Software Engineering, ser. ICSE ’13. Piscataway, [27] C. Vassallo, S. Panichella, M. Di Penta, and G. Canfora, “Codes: Mining NJ, USA: IEEE Press, 2013, pp. 832–841. [Online]. Available: source code descriptions from developers discussions,” in Proceedings http://dl.acm.org/citation.cfm?id=2486788.2486897 of the 22Nd International Conference on Program Comprehension, ser. ICPC 2014. New York, NY, USA: ACM, 2014, pp. 106–109. [6] S. Subramanian, L. Inozemtseva, and R. Holmes, “Live api [Online]. Available: http://doi.acm.org/10.1145/2597008.2597799 documentation,” in Proceedings of the 36th International Conference [28] M. Rahman, C. Roy, and I. Keivanloo, “Recommending insightful on Software Engineering, ser. ICSE 2014. New York, NY, USA: comments for source code using crowdsourced knowledge,” in Source ACM, 2014, pp. 643–652. [Online]. Available: http://doi.acm.org/10. Code Analysis and Manipulation (SCAM), 2015 IEEE 15th International 1145/2568225.2568313 Working Conference on, Sept 2015, pp. 81–90. [7] E. Wong, J. Yang, and L. Tan, “Autocomment: Mining question and answer sites for automatic comment generation,” in Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, Nov 2013, pp. 562–567. [8] C. Treude and M. P. Robillard, “Augmenting api documentation with insights from stack overflow,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16. New York, NY, USA: ACM, 2016, pp. 392–403. [Online]. Available: http://doi.acm.org/10.1145/2884781.2884800 [9] J. Montandon, H. Borges, D. Felix, and M. Valente, “Documenting apis with examples: Lessons learned with the apiminer platform,” in Reverse Engineering (WCRE), 2013 20th Working Conference on, Oct 2013, pp. 401–408. [10] N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim, “Extracting structural information from bug reports,” in Proceedings of the 2008 International Working Conference on Mining Software Repositories, ser. MSR ’08. New York, NY, USA: ACM, 2008, pp. 27–30. [Online]. Available: http://doi.acm.org/10.1145/1370750.1370757 [11] “ACM wiki page,” https://en.wikipedia.org/wiki/ Associa- tion for Computing Machinery. [12] “IEEEXplore wiki page,” https://en.wikipedia.org/wiki/IEEE Xplore. [13] “ICSE publication history,” http://dl.acm.org/event.cfm?id=RE228 &tab=pubs&CFID=723067040&CFTOKEN=52119863. [14] D. Cruzes, M. Mendonc¸a, V. Basili, F. Shull, and M. Jino, “Automated information extraction from empirical software engineering literature: Is that possible?” in Proceedings of the First International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 491–493. [Online]. Available: http://dl.acm.org/citation.cfm?id=1302496.1302980 [15] G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classification,” in Proceedings of the 37th International Conference on Software Engineering - Volume 1, ser. ICSE ’15. Piscataway, NJ, USA: IEEE Press, 2015, pp. 869–879. [Online]. Available: http://dl.acm.org/citation.cfm?id=2818754.2818859 [16] S. Subramanian and R. Holmes, “Making sense of online code snippets,” in Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on, May 2013, pp. 85–88. [17] “pdftotext online tool,” http://pdftotext.com. [18] “convertmypdf online tool,” http://www.convertmypdf.net/. [19] “convertpdftotext online tool,” http://www.convertpdftotext.net/. [20] “ocrconvert ocr tool,” http://www.ocrconvert.com/. [21] “abbyyfinereader ocr tool,” https://www.abbyy.com/en-us/finereader/. [22] P. Chatterjee, M. Nishi, K. Damevski, V. Augustine, L. Pollock, and N. Kraft, “What information about code snippets is available in different software-related documents? an exploratory study,” in Proceedings of the