Fine-Grained Api Usage Extractor
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) fine-GRAPE: fine-Grained APi usage Extractor – An Approach and Dataset to Investigate API Usage Anand Ashok Sawant Alberto Bacchelli · the date of receipt and acceptance should be inserted later Abstract An Application Programming Interface (API) provides a set of functionalities to a developer with the aim of enabling reuse. APIs have been investigated from di↵erent angles such as popularity usage and evolution to get a better understanding of their various characteristics. For such studies, software repositories are mined for API usage examples. However, many of the mining algorithms used for such purposes do not take type information into account. Thus making the results unreliable. In this paper, we aim to rectify this by introducing fine-GRAPE, an approach that produces fine-grained API usage information by taking advantage of type information while mining API method invocations and annotation. By means of fine-GRAPE, we investigate API usages from Java projects hosted on GitHub. We select five of the most popular APIs across GitHub Java projects and collect historical API usage information by mining both the release history of these APIs and the code history of every project that uses them. We perform two case studies on the resulting dataset. The first measures the lag time of each client. The second investigates the percentage of used API features. In the first case we find that for APIs that release more frequently clients are far less likely to upgrade to a more recent version of the API as opposed to clients of APIs that release infrequently. The second case study shows us that for most APIs there is a small number of features that is actually used and most of these features relate to those that have been introduced early in the APIs lifecycle. Keywords Application Programming Interface API usage API popularity Dataset · · · A.A. Sawant Mekelweg 4, 2628 CD, Delft Tel.: +31 (0)15 27 89803 E-mail: [email protected] A. Bacchelli E-mail: [email protected] 2AnandAshokSawant,AlbertoBacchelli 1 Introduction An Application Programming Interface (API) is a set of functionalities pro- vided by a third-party component (e.g., library and framework) that is made available to software developers. APIs are extremely popular as they promote reuse of existing software systems [1]. The research community has used API usage data for various purposes such as measuring of popularity trends [2], charting API evolution [3], and API usage recommendation systems [4]. For example, Xie et al. have developed a tool called MAPO wherein they have attempted to mine API usage for the purpose of providing developers API usage patterns [5,6]. Based on a developers’ need MAPO recommends various code snippets mined from other open source projects. This is one of the first systems wherein API usage recommendation leveraged open source projects to provide code samples. Another example is the work by L¨ammel et al. wherein they mined data from Sourceforge and performed an API usage analysis of Java clients. Based on the data that they collected they present statistics on the percentage of an API that is used by clients. One of the major drawbacks of the current approaches that investigate APIs is that they heavily rely on API usage information (for example to derive popularity, evolution, and usage patterns) that is approximate. In fact, one of the modern techniques considers as “usage” information what can be gathered from file imports (e.g., import in Java) and the occurrence of method names in files. This data is an approximation as there is no type checking to verify that a method invocation truly does belong to the API in question and that the imported libraries are used. Furthermore, information related to the version of the API is not taken into account. Finally, previous work was based on small sample sizes in terms of number of projects analyzed. This could result in an inaccurate representation of the real world situation. With the current work, we try to overcome the aforementioned issues by devising fine-GRAPE (fine-GRained APi usage Extractor), an approach to extract type-checked API method invocation information from Java programs and we use it to collect detailed historical information on five APIs and how their public methods are used over the course of their entire lifetime by 20,263 client projects. In particular, we collect data from the open source software (OSS) reposito- ries on GitHub. GitHub in recent years has become the most popular platform for OSS developers, as it o↵ers distributed version control, a pull-based devel- opment model, and social features [7]. We consider Java projects hosted on GitHub that o↵er APIs and quantify their popularity among other projects hosted on the same platform. We select 5 representative projects (from now on, we call them only API s to avoid confusion with client projects) and analyze their entire history to collect information on their usage. We get fine-grained information about method calls using a custom type resolution that does not require to compile the projects. Title Suppressed Due to Excessive Length 3 The result is an extensive dataset for research on API usage. It is our hope that our data collection approach and dataset not only will trigger further research based on finer-grained and vast information, but also make it easier to replicate studies and share analyses. For example, with our dataset the following two studies can be conducted: First, the evolution of the features of the API can be studied. An analysis of the evolution can give an indication as to what has made the API popular. This can be used to design and carry out studies on understanding what precisely makes a certain API more popular than other APIs that o↵er a similar service. Moreover, API evolution information gives an indication as to exactly at what point of time the API became popular, thus it can be studied in coordination with other events occurring to the project. Second, a large set of API usage examples is a solid base for recommenda- tion systems. One of the most e↵ective ways to learn about an API is by seeing samples [8] of the code in actual use. By having a set of accurate API usages at ones’ disposal, this task can be simplified and useful recommendations can be made to the developer; similarly to what has been done, for example, with Stack Overflow posts [9]. In our previous work titled “A dataset for API Usage” [10], we presented our dataset along with a few details on the methodology used to mine the data. In this paper, we go into more detail into the methodology of our mining process and conduct two case studies on the collected data which make no use of additional information. The first case is used to display the wide range of version information that we have at our disposal. This data is used to analyze the amount of time by which a client of an API lags behind the latest version of the API. Also, the version information is used to calculate as to what the most popular version of an API is. This study can help us gain insights into the API upgrading behavior of clients. The second case showcases the type resolved method invocation data that is present in our database. We use this to measure the popularity of the various features provided by an API and based on this mark the parts of an API that are used and those that are not. With this information an API developer can see what parts of the API to focus on for maintenance and extension. The first study provided initial evidence of a possible distinction between upgrade behavior of clients of APIs that release frequently compared to those that release infrequently. In the former case, we found that clients tend to hang back and not upgrade immediately; whereas, in the latter case, clients tend to upgrade to the latest version. The results of the second case study highlight that only a small part of an API is used by clients. This finding requires further investigation as there is a case to be made that many new features that are being added to an API are not really being adopted by the clients themselves. This paper is organized as follows: Section 2 presents the approach that has been applied to mine this data. For the ease of future users of this dataset an overview of the dataset and some introductory statistics of it can be found 4AnandAshokSawant,AlbertoBacchelli in section 3. Section 4 presents the two case studies that we performed on this dataset. In section 5 we describe the limitations of our approach and the dataset itself. Section 6 concludes this article. 2 Approach We present the 2-step approach that we use to collect fine-grained type- resolved API usage information. (1) We collect data on project level API usage from projects mining open source code hosting platforms (we target such platforms due to the large number of projects they hosted) and use it to rank APIs according to their popularity to select an interesting sample of APIs to form our dataset; (2) apply our technique, fine-GRAPE, to gather fine-grained type-based information on API usages and collect historical usage data by traversing the history of each file of each API client. 2.1 Mining of coarse grained usage In the construction of this dataset, we limit ourselves to the Java program- ming language, one of the most popular programming languages currently in use [11].