Feature-Based Software Design Pattern Detection a ∗ B C Najam Nazar , , Aldeida Aleti and Yaokun Zheng
Total Page:16
File Type:pdf, Size:1020Kb
Feature-Based Software Design Pattern Detection a < b c Najam Nazar , , Aldeida Aleti and Yaokun Zheng Faculty of IT, Monash University, Australia ARTICLEINFO ABSTRACT Keywords: Software design patterns are standard solutions to common problems in software design and archi- Software design patterns tecture. Knowing that a particular module implements a design pattern is a shortcut to design com- code features prehension. Manually detecting design patterns is a time consuming and challenging task; therefore, word-space-model researchers have proposed automatic design patterns detection techniques to facilitate software de- machine learning velopers. However, these techniques use complex procedures and show low performance for certain design patterns. In this work, we introduce an approach that improves the performance over the state- of-the-art by using code features with machine learning classifiers to automatically train a design pattern detection. We create a semantic representation of Java source code using the code features and the call graph, and apply the Word2Vec algorithm on the semantic representation to construct the word-space geometric model of the Java source code. DPDF then uses a Machine Learning approach trained using the word-space model and identifies software design patterns with 74% Precision and 71% Recall. Additionally, we have compared our results with two existing design pattern detection approaches namely FeatureMaps & MARPLE-DPD. Empirical results demonstrate that our approach outperforms the state-of-the-art approaches by 30% and 10% respectively in terms of Precision. The runtime performance also supports its practical applicability of our classifier. 1. Introduction graphs or other representations, they always show low ac- curacy and fail to effectively predict the majority of design Design pattern detection is an active research field and patterns (Yu et al., 2018). At the same time, capturing se- in recent years gained enormous attention by software engi- mantic (lexical) information from the source code is chal- neering professionals (Mayvan and Rasoolzadegan, 2017). lenging and has not been attempted yet in identify design pat- (Kuchana, 2004) defines software design patterns as “recur- ring solutions to common problems in a given context and terns effectively. Translating source code to natural language text has been effectively used in generating source code sum- system of forces”. Since their popularisation by the ‘Gang of Four’ maries with high accuracy (Hu et al., 2018; McBurney and (GoF) (Gamma et al., 1995) in the 1990s, many McMillan, 2015, 2016; Moreno et al., 2013), including our identified design patterns have been widely adopted by soft- own work on identifying summary sentences from code frag- ware professionals to improve the quality of software de- ments (Nazar et al., 2016). Therefore, we hypothesise that sign, architecture, and to facilitate code reuse and refactor- lexical-based (basic) code features extracted from the source ing. Recognising that a particular software module imple- code will increase the accuracy of design pattern detection. ments a design pattern can greatly assist in program compre- In this paper, we introduce Feature-Based Design Pat- hension, and consequently improve software maintenance (Prechelt tern Detection (DPDF ) that uses source code features - both et al., 2002). While design patterns can be easily detected structural and lexical, and employs machine learning classi- by looking at the source code, the complexity of software fier to predict a wider range of design patterns, with higher projects and the differences in coding styles of software de- accuracy compared to the state-of-the-art. Our approach builds velopers sometimes make it hard to find the exact locations a call graph and extract 15 source code features to gener- in code where the patterns have been implemented. ate a Software Syntactic and Lexical Representation (SSLR). arXiv:2012.01708v2 [cs.SE] 29 May 2021 Automatic detection of design patterns has shown use- The SSLR provides the lexical and syntactic information of fulness in assisting software developers to quickly and cor- the Java files as well as the relationships between the files’ rectly comprehend and maintain unfamiliar source code, ul- classes, methods etc., in the natural language form. Using timately leading to higher developer productivity (Walter and SSLR, we build a word-space geometrical model of Java Alkhaeir, 2016; Scanniello et al., 2015; Gaitani et al., 2015; files by applying the Word2Vec algorithm. We train a super- Christopoulou et al., 2012). The majority of existing meth- vised machine learning classifier, DPDF on design patterns ods reverse engineer the source code to identify the design using the labelled set and a geometrical model for recognis- patterns (Detten et al., 2010; Lucia et al., 2011) or build tools ing twelve commonly used object-oriented software design to detect design patterns in the source code e.g, (Hautamäki, patterns. 2005; Moreno and Marcus, 2012), and utilise code metrics To evaluate our approach, we label a corpus of 1,400 e.g, (Uchiyama et al., 2011). Although it is relatively easy Java files, which we refer to as ’DPDF Corpus’. DPDF cor- to obtain structural elements from the source code such as pus is extracted from a publicly available ’Github Java Cor- classes, attributes, methods etc. and transform them into pus’ (Allamanis and Sutton, 2013). To facilitate the labelling, < Corresponding author. we use an online tool ’CodeLabeller’, where expert raters [email protected] (N. Nazar) were asked to label design patterns. We statistically evalu- ORCID(s): ate the performance of DPDF by calculating the Precision, N. Nazar et al.: Preprint submitted to Elsevier Page 1 of 15 Feature based sw design pattern detect Recall and F1-Score measures and compare our approach to design patterns, code features, word-space embeddings/models two existing software design pattern detection approaches, and machine learning. ’FeatureMaps’ and ’MARPLE-DPD’. Empirical results show that our approach is effective in recognising twelve object- 2.1. Design Patterns oriented software design patterns with high Precision (74%) We consider the following twelve design patterns: Adapter, and low error rate (26%). Furthermore DPDF outperforms Builder, Decorator, Factory, Façade, Memento, Observer, the state-of-the-art methods FeatureMaps & MARPLE-DPD Prototype, Proxy, Singleton, Wrapper and Visitor that are by 30% and 10% respectively in terms of Precision. analysed by using the proposed approach. These patterns - Design pattern detection is a subjective process and in except wrapper cover all three categories of GoF patterns i.e. most cases, the decision of selecting a design pattern is based creational, structural and behavioural patterns. Creational on the context it is used. We apply machine learning tech- patterns include Builder, Factory, Prototype and Singleton niques to allow developers to pick examples of what is a patterns whereas, structural patterns include Adapter, Deco- good or bad instance of a design pattern. We foresee two po- rator, Façade and Proxy patterns. We have considered three tential advantages of using machine learning approaches for behavioural patterns for this study and that are Memento, design pattern detection over existing approaches. Firstly, a Observer and Visitor. machine learning approach is potentially much more easily 2.1.1. Adapter vs Wrapper extensible to new patterns than hand-built detectors, as the detector can be trained to recognise a new pattern merely by Though the wrapper pattern is not part of GOF patterns, providing it with sufficient exemplars of code implementing we use it as a separate pattern because a wrapper encapsu- the pattern. Secondly, a machine learning approach may be lates and hides the complexity of other code. It is intended to more robust than a hand-built classifier that may look for one simplify an interface to an external library and contains an- or two superficial features of code implementing a pattern other object and wraps around it. In other words, it has the (for instance, simply looking for name suffixes). This also sole responsibility of moving data to and from the wrapped relates to extensibility, as non-experts attempting to extend object and fulfills the need of a simplified and specific pro- a classifier may be particularly prone to writing oversimpli- gramming interface. Whereas, the Adapter allows code that fied rules for identification. has been designed for compatibility with one interface to be Contributions: In summary, this paper makes the fol- compatible with another - transforming input to match the lowing contributions: input required by another interface. It solves a problem of incompatibility the its extensive use in the corpus. In addi- • We introduce a novel approach called Feature-Based Design Pattern Detector tion, Looking at the corpus, we can see that both adapters (DPDF ) that uses 15 source and wrappers are treated differently. For example, Prepared- code features to detect software design patterns. StatementWrapper1 wraps the error messages to the stata- • We build a large corpus i.e. DPDF Corpus compris- ments. ing of 1,400 Java files, which are labelled from expert 2.1.2. Abstract Factory vs Factory Method software engineers using our online tool Codelabeller. After thoroughly observing the corpus and collecting the • We demonstrate that our approach outperforms two desired instances for experimentation we have noticed that state-of-the-art