Enhancing the Unified Features to Locate Buggy Files by Exploiting

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Enhancing the Unified Features to Locate Buggy Files by Exploiting the Sequential Nature of Source Code∗ Xuan Huo and Ming Li National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology and Industrialization Nanjing 210023, China fhuox, [email protected] Abstract or follow the technical requirements of the system. Bug reports are generated by the end-users of the software and then Bug reports provide an effective way for end-users submitted to the software maintenance team. Once a bug re- to disclose potential bugs hidden in a software port is received and verified, the software maintenance team system, while automatically locating the potential would read the textual description of the bug report to locate buggy source files according to a bug report re- the buggy potential source files in the source code, and assign mains a great challenge in software maintenance. appropriate developers to fix the bug accordingly. Unfortu- Many previous approaches represent bug reports nately, for large and evolving software, the maintenance team and source code from lexical and structural in- may receive a large number of bug reports over a period of formation correlated their relevance by measuring time and it is costly to manually locate the buggy potential their similarity, and recently a CNN-based model is source files based on bug reports. proposed to learn the unified features for bug local- Therefore, to alleviate the burden of software maintenance ization, which overcomes the difficulty in model- team, effective models for identifying potentially buggy ing natural and programming languages with differ- source files for a given bug report automatically are highly ent structural semantics. However, previous studies desirable, and it has drawn significant attentions in software fail to capture the sequential nature of source code, engineering community [Gay et al., 2009; Zhou et al., 2012; which carries additional semantics beyond the lex- Ye et al., 2014; Huo et al., 2016]. The key for identify- ical and structural terms and such information is ing buggy source files is to correlate the abnormal program vital in modeling program functionalities and be- behaviors written in natural languages with the source code haviors. In this paper, we propose a novel model written in programming languages that implement the corre- LS-CNN, which enhances the unified features by sponding functionality. Some of the existing methods treat exploiting the sequential nature of source code. LS- the source code as natural language by representing both bug CNN combines CNN and LSTM to extract seman- reports and source files based on bag-of-words feature rep- tic features for automatically identifying potential resentations, and correlate their relevance by measuring sim- buggy source code according to a bug report. Ex- ilarity in the same feature space. For example, [Gay et al., perimental results on widely-used software projects 2009] represented both source code and bug reports using the indicate that LS-CNN significantly outperforms the vector space model (VSM) based on which the similarities state-of-the-art methods in locating buggy files. between the buggy source files and a bug report are com- puted for localizing the corresponding buggy files. [Zhou et al., 2012] proposed a revised vector space model (rVSM), 1 Introduction where similar historical bug reports whose corresponding Software quality assurance is vital to the success of a software buggy files are further exploited to improve the bug local- system. As software systems become larger and more com- ization results obtained by measuring the similarity between plex, it is extremely difficult to identify every software defect bug reports and source files. Recently, Huo et al. [2016] con- before its formal release due to the inadequate software test- sidered that the bug reports in natural language and source ing resources and tight development schedule. Thus, software code in programming language should be processed in differ- systems are often plagued with bugs. ent ways. They employed a particular model based on convo- To facilitate fast and efficient identification and fixing of lutional neural network (CNN) to learn unified features from the bugs in a released software system, developers often allow both bug reports and source code, which are shown effective users to submit bug reports to bug tracking systems, which in modeling source code and improving the performance on are documents written in natural language specifying the sit- locating buggy source files. uations in which the software fails to behave as it is expected Although learning unified features from bug reports and source code by CNN [Huo et al., 2016] is able to ∗This research was supported by NSFC (61422304, 61272217). overcome the difficulty in modeling natural and program- 1909 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) ming languages with different structural semantics, the 2 Related Work sequential nature of programming language, especially To maintain software quality assurance, many bug localiza- the long-term dependencies between statements, has not tion approaches have been studied in recent years. Bug local- been well-modeled, which causes the loss of semantic ization, which identifies and locates source files potentially information in the source code. The sequence of state- responsible for the bug reported in bug reports, is an ex- ments in source code specifies how statements interact with tremely vital but costly task in software maintenance. Most previous ones along the execution path and data stream existing approaches treat the source code as documents and transmission, and hence provides additional semantics to formalize the bug localization problem as a document re- program functionality aside from lexical and structural trieval problem, which calculate the relevancy between a bug terms. An example will make it more concrete: assume report and a source file to identify buggy source code [Poshy- path that a private string type variable is initialized with a vanyk et al., 2007; Lukins et al., 2008]. For example, Gay DEFAULT PATH if default value , and the following code “ et al. [2009] employed Vector Space Model (VSM) based on path==DEAFULT PATH fpath=getNewPath();g concept localization to represent bug reports and source code File f=File.open(path) ” may result in different as feature vectors, which are used to measure the similarity behaviors if the previous default value has not been well between bug reports and source files. Zhou et al. [2012] also [ et considered. Although the CNN model proposed in Huo proposed BugLocator approach using revised Vector Space al. ] , 2016 employs filters sliding over statements along the Model, which is based on document length and similar bugs execution path is able to reflect the sequential nature in the that have been solved before as new features. Recently, more adjacent statements, it is also very important to consider the information and features from bug reports and source code sequential nature of statements with long-term dependencies have been investigated for identifying bugs. Saha et al. [2013] in order to better represent the program functionality and utilized structured information from source code, such as behaviors. class and method to enable more accurate bug localization. In this paper, we propose a novel unified framework based Wang et al. [2014] proposed AmaLgam that combines ver- on deep neural network called LS-CNN (Long Short-term sion history, similar report and structure to further improve memory based on Convolutional Neural Network) to exploit bug localization performance. the sequential nature of source code to enhance the unified Recently, deep learning models are very popular and have features for bug localization. Such a method combines the achieved enormous success in many natural language pro- LSTM and CNN models to enhance the unified features by cessing tasks. For example, Johnson and Zhang [2015] uti- exploiting the sequential nature of source code, such that the lized convolutional neural network (CNN) to provide an alter- functional semantics of the program and correlations between native mechanism for effective use of word order for text cat- bug reports and source code for identifying buggy files are egorization through direct embedding of small text regions. carefully embedded. The key part of LS-CNN is the intra- To overcome the difficulty in learning long term dynamics of language feature extraction network that combines LSTM Recurrent Neural Networks (RNNs) [Mikolov et al., 2010] and CNN for source code processing, where LSTM is de- for text processing, [Sepp and Juergen, 1977; Graves, 2012; signed to extract semantic features reflecting sequential na- Donahue et al., 2015] proposed long short-term memory ture from source code and handle long-term dependency be- (LSTM) by incorporating memory cells to learn when to for- tween statements and CNN is designed to capture the local get previous states and when to update current states given and structure information within statements. Experimental new information. Furthermore, several studies have tried results on widely-used software projects indicate that exploit- deep learning models on software engineering research re- ing sequential features is beneficial for bug localization

Enhancing the Unified Features to Locate Buggy Files by Exploiting

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support