Extracting Code Segments and Their Descriptions from Research Articles
Total Page:16
File Type:pdf, Size:1020Kb
Extracting Code Segments and Their Descriptions from Research Articles Preetha Chatterjee, Benjamin Gause, Hunter Hedinger, and Lori Pollock Computer and Information Sciences University of Delaware Newark, DE 19716 USA Email: preethac, bengause, hedinger, pollock @udel.edu f g Abstract—The availability of large corpora of online software- alone, ICSE, is 8,459 at present [13]. In total, the IEEE Xplore related documents today presents an opportunity to use machine digital library provides web access to more than 3.5-million learning to improve integrated development environments by full-text documents of publications in the fields of electrical first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and engineering, computer science and electronics [12]. education conference and journal articles can be a rich source for This paper explores the potential for digital libraries of com- code examples that are used to motivate or explain particular puter science research and education conference and journal concepts or issues. Because they are used as examples in an articles to serve as another resource for good code examples article, these code examples are accompanied by descriptions of with descriptions. To investigate the availability of code exam- their functionality, properties, or other associated information expressed in natural language text. Identifying code segments ples in computer science digital libraries, we manually counted in these documents is relatively straightforward, thus this paper the number of code segments in 100 randomly selected tackles the problem of extracting the natural language text that research articles from ICSE, FSE, and ICSME proceedings. is associated with each code segment in an article. We present 70% of the selected articles contained one or more code and evaluate a set of heuristics that address the challenges of segments, with an average of 3-4 code segments per article. the text often not being colocated with the code segment as in developer communications such as online forums. The examples always have some associated descriptions of Index Terms—mining software repositories, information ex- their functionality, properties, or other associated information traction, code snippet description, text analysis expressed in natural language text. As an example of the kind of information that can be I. INTRODUCTION extracted from source code descriptions in research literature, With the increased online sharing of software-related infor- consider a code snippet and its description in Figure 1, mation, software engineers often look beyond documentation extracted from a paper published in ICSE 2014. The descrip- and their local resources, seeking examples and advice from tion of the code snippet provides useful information about experiences of other developers not geographically nearby. The the source code, including (1) the programming language it examples are more useful if there is an explanation of their was written in, (2) the intent of the overall code that the functionalities and properties that they exhibit. These code programmer is implementing of which this code segment is descriptions are often not found in others’ source code, but a part of (i.e., a web application), (3) some of the APIs it are instead in other software-related artifacts such as Q&A uses, and (4) the sub-steps being implemented by the code forums, blog posts, and emails. The vast availability of online segment, i.e., a description of its functionality. resources has also motivated researchers to develop techniques Mining code segments and their descriptions from research to help developers more efficiently locate code examples with articles presents challenges beyond those faced in mining from descriptions by automatically mining code examples from unstructured documents such as forums, bug reports, emails, various sources, including emails [1], [2], [3], [4], Q&A and issue tracking. In all of these unstructured documents, forums [5], [6], [7], [8], API documentation [9], bug reports including research articles, the code segments are intermixed [10], and stack traces. with natural language text, sometimes separated by blank lines Digital libraries for computer science research and educa- and sometimes single code statements within paragraphs or tion articles could potentially provide a large amount of code even individual identifiers within sentences. In all of these examples with descriptions. The ACM Digital Library contains documents, the code segments are embedded in the main- an archive of every article and publication published by ACM stream text. In contrast, code segments in research articles are from 1950s to present [11]. The IEEE Xplore DL includes sometimes embedded within the text, but often separated as over 180 journals, over 1,400 conference proceedings, more figures, which are rarely positioned in the flow consecutively than 3,800 technical standards, over 1,800 eBooks and over with the text that describes them. The figure could be located 400 educational courses. Each month, 20,000 new documents in a different section or different page. This physical separa- are added to IEEE Xplore on average [12]. The publication tion of code segment from description makes the description count of the top conference in the field of software engineering identification problem, i.e., the problem of identifying all the articles and other similar documents such as dissertations. The Section 2 presents a scenario that motivates our approach 1 $("#addphoto").on(’click’, and demonstrates the kinds of links it can identify. The ap- 2 function() { useGetPicture();} main contributions of this paper are: proach and our implementation of it are explained in detail 3 ); • a set of heuristics to automatically identify and map in Sections 3 and 4. Section 5 then presents our evaluation of 4 function useGetPicture() { 5 var cameraOptions = { ... }; text that is describing code segments in research articles, Baker. The documentation linking prototype is described in 6 navigator.camera.getPicture(onCameraSuccess, including segments embedded as figures Section 6, followed by discussion in Section 7. Related work 7 onCameraError, cameraOptions); is described in Section 8; Section 9 concludes the paper. 8 } • a set of heuristics to expand the neighborhood of iden- 9 function onCameraSuccess(imageData) { tified descriptions to include informative, yet less obvi- 10 var image = document.getElementById(".."); 11 image.src = "data:image/jpeg" + imageData; ously related text 12 } • a tool that takes research articles as input and outputs 2. SCENARIO 13 function onCameraError(message) { Consider the Java code snippet shown in Figure 1. This 14 alert("Failed: " + message); an XML-based representation with markups to associate snippet (pertaining to a library called GWT) was posted to 15 } identified code segments with their corresponding de- Stack Overflow to assist a developer who did not understand scriptions how to manipulate the state of History objects. The figure contains a number of bolded elements. These are the types Figure 2: A JavaScript code snippet containing Cor- • an evaluation study that evaluates the effectiveness of the and methods that our tool, Baker, can uniquely link to the dova, JQuery and JavaScript DOM API usage. Each presented code description identification techniques API; i.e., the elements for which it can determine a fully- of the bolded elements can be linked back to the rel- qualified name. With this information we can automatically evant API documentation. II. MOTIVATING EXAMPLES augment the HTML version of the official API documenta- tion for History by dynamically injecting the code example (a)3. Code APPROACH segment as a figure In addition to the example in Figure 1, we present three into the web page. We can also inject the links to the official Next,Identifying consider API the elements JavaScript in code snippets snippet requires in Figure the 2, additional code snippets and their descriptive text extracted API into the Stack Overflow post; these two additions to the whereability a to developer parse these snippets.is trying This to is make more diaffi webcult app than from research articles and discuss how they could be used to parsing full files because code snippets can be ambiguous. documentation would make it easier for developers to learn that can take a photo and inject it into an further motivate extracting code segments with descriptions how to use this class. elementDagenais in and an Robillard HTML document. highlighted four This kinds example of ambiguity interactsthat can hamper with the the identification JavaScript of DOM elements [9]; two of from research articles. The description of the code snippet (getElementById),these were specific to the takes plain-text a photo analysis using they the were per- shown in Figure 2 explains this code’s inefficiency and pro- 1 public FirstPanel() { Cordovaforming, whileproject the other (getPicture), two were more and generally uses JQuery relevant. vides useful information including (1) it is a method used 2 History.addHistoryListener(this); toThese detect two were whendeclaration the photo ambiguity shouldand beexternal taken reference 3 String token = History.getToken(); ($ambiguity and on).. For each of these method references for testing more than one test scenario, (2) the specific test if (token. () == 0) { 4 length BakerDeclaration can identify Ambiguity. theSnippets API that are, it by is definition,