Feature-Based Detection a ∗ b c Najam Nazar , , Aldeida Aleti and Yaokun Zheng

Faculty of IT, Monash University, Australia

ARTICLEINFO ABSTRACT

Keywords: are standard solutions to common problems in software design and archi- Software tecture. Knowing that a particular module implements a is a shortcut to design com- code features prehension. Manually detecting design patterns is a time consuming and challenging task; therefore, word--model researchers have proposed automatic design patterns detection techniques to facilitate software de- velopers. However, these techniques use complex procedures and show low performance for certain design patterns. In this work, we introduce an approach that improves the performance over the state- of-the-art by using code features with machine learning classifiers to automatically train a design pattern detection. We create a semantic representation of Java using the code features and the call graph, and apply the Word2Vec on the semantic representation to construct the word-space geometric model of the Java source code. DPDF then uses a Machine Learning approach trained using the word-space model and identifies software design patterns with 74% Precision and 71% Recall. Additionally, we have compared our results with two existing design pattern detection approaches namely FeatureMaps & MARPLE-DPD. Empirical results demonstrate that our approach outperforms the state-of-the-art approaches by 30% and 10% respectively in terms of Precision. The runtime performance also supports its practical applicability of our classifier.

1. Introduction graphs or other representations, they always show low ac- curacy and fail to effectively predict the majority of design Design pattern detection is an active research field and patterns (Yu et al., 2018). At the same time, capturing se- in recent years gained enormous attention by software engi- mantic (lexical) information from the source code is chal- neering professionals (Mayvan and Rasoolzadegan, 2017). lenging and has not been attempted yet in identify design pat- (Kuchana, 2004) defines software design patterns as “recur- ring solutions to common problems in a given context and terns effectively. Translating source code to natural language text has been effectively used in generating source code sum- system of forces”. Since their popularisation by the ‘Gang of Four’ maries with high accuracy (Hu et al., 2018; McBurney and (GoF) (Gamma et al., 1995) in the 1990s, many McMillan, 2015, 2016; Moreno et al., 2013), including our identified design patterns have been widely adopted by soft- own work on identifying summary sentences from code frag- ware professionals to improve the quality of software de- ments (Nazar et al., 2016). Therefore, we hypothesise that sign, architecture, and to facilitate and refactor- lexical-based (basic) code features extracted from the source ing. Recognising that a particular software module imple- code will increase the accuracy of design pattern detection. ments a design pattern can greatly assist in program compre- In this paper, we introduce Feature-Based Design Pat- hension, and consequently improve (Prechelt tern Detection (DPDF ) that uses source code features - both et al., 2002). While design patterns can be easily detected structural and lexical, and employs machine learning classi- by looking at the source code, the complexity of software fier to predict a wider range of design patterns, with higher projects and the differences in coding styles of software de- accuracy compared to the state-of-the-art. Our approach builds velopers sometimes make it hard to find the exact locations a call graph and extract 15 source code features to gener- in code where the patterns have been implemented. ate a Software Syntactic and Lexical Representation (SSLR).

arXiv:2012.01708v2 [cs.SE] 29 May 2021 Automatic detection of design patterns has shown use- The SSLR provides the lexical and syntactic information of fulness in assisting software developers to quickly and cor- the Java files as well as the relationships between the files’ rectly comprehend and maintain unfamiliar source code, ul- classes, methods etc., in the natural language form. Using timately leading to higher developer productivity (Walter and SSLR, we build a word-space geometrical model of Java Alkhaeir, 2016; Scanniello et al., 2015; Gaitani et al., 2015; files by applying the Word2Vec algorithm. We train a super- Christopoulou et al., 2012). The majority of existing meth- vised machine learning classifier, DPDF on design patterns ods reverse engineer the source code to identify the design using the labelled set and a geometrical model for recognis- patterns (Detten et al., 2010; Lucia et al., 2011) or build tools ing twelve commonly used object-oriented software design to detect design patterns in the source code e.g, (Hautamäki, patterns. 2005; Moreno and Marcus, 2012), and utilise code metrics To evaluate our approach, we label a corpus of 1,400 e.g, (Uchiyama et al., 2011). Although it is relatively easy Java files, which we refer to as ’DPDF Corpus’. DPDF cor- to obtain structural elements from the source code such as pus is extracted from a publicly available ’Github Java Cor- classes, attributes, methods etc. and transform them into pus’ (Allamanis and Sutton, 2013). To facilitate the labelling, ∗ Corresponding author. we use an online tool ’CodeLabeller’, where expert raters [email protected] (N. Nazar) were asked to label design patterns. We statistically evalu- ORCID(s): ate the performance of DPDF by calculating the Precision,

N. Nazar et al.: Preprint submitted to Elsevier Page 1 of 15 Feature based sw design pattern detect

Recall and F1-Score measures and compare our approach to design patterns, code features, word-space embeddings/models two existing detection approaches, and machine learning. ’FeatureMaps’ and ’MARPLE-DPD’. Empirical results show that our approach is effective in recognising twelve object- 2.1. Design Patterns oriented software design patterns with high Precision (74%) We consider the following twelve design patterns: Adapter, and low error rate (26%). Furthermore DPDF outperforms Builder, Decorator, Factory, Façade, Memento, Observer, the state-of-the-art methods FeatureMaps & MARPLE-DPD Prototype, Proxy, Singleton, Wrapper and Visitor that are by 30% and 10% respectively in terms of Precision. analysed by using the proposed approach. These patterns - Design pattern detection is a subjective process and in except wrapper cover all three categories of GoF patterns i.e. most cases, the decision of selecting a design pattern is based creational, structural and behavioural patterns. Creational on the context it is used. We apply machine learning tech- patterns include Builder, Factory, Prototype and Singleton niques to allow developers to pick examples of what is a patterns whereas, structural patterns include Adapter, Deco- good or bad instance of a design pattern. We foresee two po- rator, Façade and Proxy patterns. We have considered three tential advantages of using machine learning approaches for behavioural patterns for this study and that are Memento, design pattern detection over existing approaches. Firstly, a Observer and Visitor. machine learning approach is potentially much more easily 2.1.1. Adapter vs Wrapper extensible to new patterns than hand-built detectors, as the detector can be trained to recognise a new pattern merely by Though the wrapper pattern is not part of GOF patterns, providing it with sufficient exemplars of code implementing we use it as a separate pattern because a wrapper encapsu- the pattern. Secondly, a machine learning approach may be lates and hides the complexity of other code. It is intended to more robust than a hand-built classifier that may look for one simplify an interface to an external library and contains an- or two superficial features of code implementing a pattern other object and wraps around it. In other words, it has the (for instance, simply looking for name suffixes). This also sole responsibility of moving data to and from the wrapped relates to extensibility, as non-experts attempting to extend object and fulfills the need of a simplified and specific pro- a classifier may be particularly prone to writing oversimpli- gramming interface. Whereas, the Adapter allows code that fied rules for identification. has been designed for compatibility with one interface to be Contributions: In summary, this paper makes the fol- compatible with another - transforming input to match the lowing contributions: input required by another interface. It solves a problem of incompatibility the its extensive use in the corpus. In addi- • We introduce a novel approach called Feature-Based Design Pattern Detector tion, Looking at the corpus, we can see that both adapters (DPDF ) that uses 15 source and wrappers are treated differently. For example, Prepared- code features to detect software design patterns. StatementWrapper1 wraps the error messages to the stata- • We build a large corpus i.e. DPDF Corpus compris- ments. ing of 1,400 Java files, which are labelled from expert 2.1.2. Abstract Factory vs Factory Method software engineers using our online tool Codelabeller. After thoroughly observing the corpus and collecting the • We demonstrate that our approach outperforms two desired instances for experimentation we have noticed that state-of-the-art approaches with substantial margins majority of the abstract factory and factory method patterns in terms of Precision, Recall and F1-Score. use ’factory’ keyword instead of having creators or produc- Paper Organisation: Our paper begins with an intro- ers or any other words. In addition, our code features such duction in Section1 and concludes in Section7. We briefly as inheritance and keywords can pick the keywords and the discuss the preliminaries and relevant background of the re- relationship between classes which is quite similar in case lated technologies, in particular, design patterns, code frag- of both abstract and factory method patterns. Both patterns ments, word space models and machine learning in Section2. deal with creating family of factory objects thus, having the Section3 discusses our study design, including the research similarity in usage. Therefore, to simplify our detection pro- questions, how the corpora are selected and labelled? The cess, we treat both patterns as a one factory pattern. tool used to label data, and source code features are used 2.2. Code Features and how they are selected, Word2Vec for building N-gram model and ending at discussing our machine learning clas- Code features are static source code attributes that are sifier. Section4 presents results with respect to our research extracted by examining the source code (McBurney et al., questions and discusses our results as well as the overall ap- 2018), such as the size of the code elements (e.g., line of proach. We discuss the threats to the validity of our study in code), complexity of code such as if/else blocks, object-oriented Section5 and the related work is presented in Section6. attributes in code such as inheritance etc., and source code constructs that are uniquely identifiable names for a con- 2. Preliminaries struct, for example class name, method name etc. Zanoni 1https://github.com/ms969/ChenSun/blob/master/CS5430/jars/ In the following subsections, we briefly introduce some mysql-connector-java-5.1.18/src/com/mysql/jdbc/jdbc2/optional/ basic concepts that are used in this study, which are related to PreparedStatementWrapper.java

N. Nazar et al.: Preprint submitted to Elsevier Page 2 of 15 Feature based sw design pattern detect et al.(2015) called features as code entities that are the names a vector representation for each word is learned by a neu- given to any code construct that is uniquely identifiable by its ral network. Their representations of the words are based name and the name of its containers. In the object-oriented on prediction as opposed to counting. Embedded vectors paradigm, code entities are classes, interfaces, enums, anno- created using the predictive models such as Word2Vec have tations, etc., methods and attributes. Previously, Nazar et al. many advantages compared to LSA (Baroni et al., 2014). For (2016) used 21 textual features to identify summary lines for instance, their ability to compute dense, low-dimensional code fragments. predictive representations of words. Vector spaces created By examining aforementioned studies, we decide to use on word distributional representations have been success- a mixture of structural and lexical object-oriented code con- fully proven to encode word similarity and relatedness re- structs to investigate if they can be useful in identifying the lations (Radinsky et al., 2011; Reisinger and Mooney, 2010; design patterns from the source code. These features cap- Ciobanu and Dinu, 2013; Collobert et al., 2011), and word ture the behvaioural, strutcural and creational aspected of embeddings have proven to be a useful feature in many nat- the code as needed for design patterns (as discussed in Sec- ural language processing tasks (Collobert et al., 2011; Le tion 2.1). These features are discussed further in Section 3.3. and Mikolov, 2014; dos Santos and Zadrozny, 2014) in that In this paper, we follow the term features as we believe that they often encode semantically meaningful information of a they are more relevant to design patterns and depicts the word. code structure and semantics better than low entities discussed in relevant studies. 2.4. Machine Learning Machine learning (ML) is the study of computer pro- 2.3. Word Embeddings grammes that learns from the data and improve automati- Word space models are abstract representations of the cally through experience (Mitchell, 1997). In ML the essen- meaning of words, encoded as vectors in a high dimensional tial elements are data - structural such as text, unstructural or space (Salton et al., 1975). A word vector space is con- semi-structural such as source code, the model e.g., Support structed by counting co-occurrences of pairs of words in a Vector Machines (SVM) and Random Forest (RF) (Cortes text corpus, building a large square n-by-n matrix where n is and Vapnik, 1995; Ho, 1995), and the evaluation procedure the size of the vocabulary and the cell (i, j) contains the num- e.g., cross-validation. The types of machine learning algo- ber of times the word i has been observed in co-occurrence rithms differ in their approach, the type of data they input with the word j in the corpus. The i-th row in a co-occurrence and output, and the type of task or problem that they intend matrix is an n-dimensional vector that acts as a distributional to solve. In short, there are three major types of machine representation of the i-th word in the vocabulary. learning approaches that are supervised, unsupervised and The key to using a vector representation to compute the semi-supervised. Supervised learning builds a mathemat- semantic relatedness of words lies in the Distributional Hy- ical model based on the labelled data to predict future re- pothesis (DH) (Harris, 1954). The DH states that: ”words sults (Manning et al., 2008; Bishop, 2006). Unsupervised, that occur in the same contexts tend to have similar mean- on the other hand, learns how systems can infer a function ing", therefore allowing the computational linguist to ap- to describe a hidden structure from unlabelled data (Bishop, proximate a measure of how much two words are related in 2006). Semi-supervised falls between supervised and un- their meaning by a numeric value from a vector supervised approaches, since it uses both labelled and unla- space model. belled data for training – typically a small amount of labelled The similarity between two words is geometrically mea- data and a large amount of unlabelled data (Manning et al., surable with any distance metric. The most widespread met- 2008; Bishop, 2006). ric for this purpose is the cosine similarity, defined as the In this paper, we have used the supervised learning and cosine of the angle between two vectors: there are two major types of supervised learning namely, x ⋅ y classification and regression. The classification methods usu- S(x, y) = x y ally use two sets, a training set and a test set. The training ðð ðððð ðð set is used for learning some classifiers and requires a pri- Several techniques can be applied to reduce the dimen- mary group of labelled individuals, in which the category sionality of the co-occurrence matrix. Latent Semantic Anal- related to each individual is obvious from its label. The test ysis (LSA), for instance, uses Singular Value Decomposition set is used to measure the efficiency of the learned classifiers (SVD) to prune the less informative elements while preserv- and includes labelled individuals which do not participate in ing most of the topology of the vector space, and reducing learning classifiers. Regression, on the other hands, allow the number of dimensions to the order of hundreds (Lan- us to predict a continuous outcome of the variable we intend dauer and Dumais, 1997; Mikolov et al., 2013). to find. Recently, neural network based models have received in- Design pattern detection can be categorised as a clas- creasing attention for their ability to compute dense, low- sification problem where classes containing the pattern can dimensional representations of word. To compute such rep- be labelled by expert raters. Thus, supervised classification resentation, i.e., the word embeddings, several models rely learning corresponding to the ground truth that is a class on a huge amounts of natural language texts from which contains or does not contain a design pattern can be used

N. Nazar et al.: Preprint submitted to Elsevier Page 3 of 15 Feature based sw design pattern detect to predict design patterns. In this study, RF and SVM based 3.2.1. Data Collection classifiers are considered from benchmark studies where as Existing datasets labelled with design patterns are either our DPDF classifier builds a design pattern detection model too small or not fully publicly available. We have found one that classifies given classes. publicly available corpus of Java projects called P-MART. On exploring the P-MART we found that it contains uneven 3. Study Design number of design pattern and requiring surgery to fit for our purpose. Therefore, we create a new corpus DPDF -Corpus, In this section, we describe our research questions and which we label with the respective design pattern. The aim justify their use in applying the overall motivation of design is to increase the size of the existing corpora and make it pattern detection from source code. Additionally, we lay out publicly available for future researchers. our methodology for answering these questions, including We build the new corpus based on the Github Java Cor- the code features we used in this section as well. pus (GJC) Allamanis and Sutton(2013), which is the largest publicly available corpus consisting exclusively of open source 3.1. Research Questions Java projects collected from GitHub from multiple domains. This study seeks to identify design patterns through code In total, the GJC contains 2, 127, 357 Java files in 14, 436 Word2Vec features and algorithm along with the application Java projects. As the first step, we remove unnecessary files of supervised machine learning. In doing so, we aim to ex- such as unit test cases (JUnit) or files such as amine the relationship between code features and the design HTML, CSS etc. from the GJC, as these files do not nor- patterns. Therefore, we pose the following three Research mally implement design patterns. Next, we select a subset of Question (RQs): the GJC projects using the correction of finite population ap- proach 2 to determine the size of the final corpus that can be 1. RQ1: How well does the proposed approach recognise trained by the machine learning classifier. The final DPDF software design patterns? corpus has 1,400 files. To ensure that sufficient instances are The rationale behind RQ1 is to determine whether our ap- available for training and testing the machine learning algo- proach identifies design patterns accurately. For this pur- rithms, we select at least 100 instances for each of the 12 pose, we statistically evaluate our classifiers using stan- design patterns as shown in Table4. dard statistical measures that are Precision, Recall and F1-Score. RQ2 Table 1 2. : What is the misclassification rate of the classifier? The number of instances of each pattern in a labelled For addressing RQ2, we build a confusion matrix of the DPDF -Corpus classifier to find how well it identifies the percentage of pattern instances from the labelled data and what percent- Patterns # Patterns # age of instances are missed. Adapter 112 Observer 102 3. RQ3: How well does our approach perform compared to Builder 104 Prototype 102 existing approaches? Decorator 104 Proxy 111 It is important to compare our approach with the existing Factory 123 Singleton 102 studies to further evaluate the efficacy of our approach. Facade 102 Wrapper 112 Memento 100 Visitor 104 To do so, we select two studies that are FeatureMaps pro- None 122 posed by Thaller et al.(2019) & MARPLE-DPD pro- posed by Zanoni et al.(2015) respectively as a benchmark studies for comparing our approach. Theses studies as Benchmark Corpus: In addition to the DPDF coprus, well as the details of the benchmark corpus are discussed we employ an existing dataset - P-MART 3 used by bench- in detail in Section 3.2 mark studies (Zanoni et al., 2015; Thaller et al., 2019), which contains 4,242 files from 9 projects, which are QuickUML, 3.2. Methodology Lexi, JRefactory, Netbeans, JUnit, JHotDraw, MapperXML, Our methodology to address research questions is as fol- Apache Nutch and PMD. P-Mart contains 1,039 files that are lows: labelled as design pattern - that we plan to identify (expect First, we collect the corpus containing Java projects. Next, the wrapper) by the corpus authors. Table2 shows the dis- we discuss how the code features are selected and applied to tribution of design patterns in the P-MART Corpus. As you extract the syntactic and semantic representation (SSLR) for see there is a huge imbalance of data between patterns used the corpus. After that, we apply Word2Vec algorithm on the in the P-Mart. SSLR to create the geometric representation that can be read by the machine. In the end, we train supervised classifiers on 3.2.2. Data Labelling the labelled corpus to predict design pattern instances from We built an online tool using Node/Angular JS, called 4 the geometric representation of Java files. CodeLabeller to label both corpora via crowd knowledge. In the subsequent sections, we discuss these steps one by 2https://www.surveysystem.com/sscalc.htm one. 3http://www.ptidej.net/tools/designpatterns/ 4www.codelabeller.org

N. Nazar et al.: Preprint submitted to Elsevier Page 4 of 15 Feature based sw design pattern detect

Table 2 3.3.1. Class-Level Features The number of instances of each pattern in a P-MART A class is a user-defined blueprint or prototype from which corpus objects are created and in the Java language, the class name Patterns # Patterns # begins with a capital letter so the first code feature we select is a ‘ClassName’ i.e., the name of the Java class. The second Adapter 241 Observer 137 feature - feature 2 - is about the access modifiers in Java lan- Builder 43 Prototype 32 guage i.e., public, private, protected and default. Next two Decorator 63 Proxy 3 Factory 342 Singleton 13 features that are features 3 and 4 - implements & extends are Facade 11 Visitor 139 Java keywords - and related to the inheritance principle of Memento 15 object-oriented languages. Inheritance is key to many pat- terns such as observer, factory etc.; therefore, it is important to capture it. In Java language, there are two keywords for inheritance i.e. extends when a class inherits a class and im- Each file is labelled by at least three annotators, ensuring plements when a class inherits an interface. An interface can that they have at least 2 years of programming experience in extend another interface in the same way that a class can ex- Java programming language, applied design patterns in the tend another class; therefore, the extends keyword can also projects they have worked, and taught software be used to extend an interface. at the university level. Thus, we have asked the faculty mem- bers, postgraduate honours and doctoral students major in 3.3.2. Method-Level Features and at the Faculty A Java method is a collection of statements that are grouped of IT, Monash University to perform the labelling of DPDF - Corpus CodeLabeller together to perform an operation. In general, a method in through online tool. To further sim- Java contains a name, which begins with a small letter in plify the annotation process we labelled both Abstract Fac- a camelCase format, one or more parameters (sometimes tory and Factory Method patterns as ’Factory’ pattern in the ’None’ known as attributes) enclosed in parentheses and a return reference set. labelled files are those that do not im- type - these are features 5, 6 & 7 respectively in our case. plement or contain any standard design patterns. If a pattern The Feature 8 measures properties of the statements inside comprises of more than one file, for example, a Façade pat- the method’s body. These statements can be local attributes, tern which comprises of both interface and implementation a conditional statements such as if, else, switch, or an as- - the raters are asked to label both files as Façade pattern. Raters Agreement: signment statement etc. Feature 9 measures the number of Labelling (or annotation) is a sub- variables or attributes in the method - some methods may jective process and there is possibility that a file is labelled have local scoped variables while some may not. Feature differently by different annotators. Following a similar ap- 10 is about the number of methods called within a method, proach to Nazar et al.(2016), we perform a Cohen’s Kappa whereas feature 11 calculates the number of lines within a Test (Cohen, 1960) to measure the level of agreement among 0.74 method’s body. ‘MethodIncomingMethod’ - feature 12 - is annotators. For the DPDF corpus, the kappa K-value is , about the number of methods a given method calls. For ex- showing a medium to high level of agreement among raters (Co- ample, Method A calls Methods B, C and D then the value hen, 1960; Carletta, 1996). We haven’t calculated the Kappa for ‘MethodIncomingMethod’ is 3. The difference between value for the benchmark coprus. feature 10 and 12 is that, feature 10 numbers all the methods 3.3. Feature Extraction which are called within a method that can be the methods from the same class or from other classes. Whereas, feature DPs describe classes and their loose arrangements and 12 only provides the number of methods that belong to ex- communication paths. Consequently, extracted features ide- ternal classes. For example a method A in class A has called ally capture these arrangements and their relationships to im- method B from class B and Method C from class A then the prove reasoning in later stages. As discussed in Section 2.2 value for ‘MethodIncomingMethod’ will be 1. Feature 13 the source code features are high level code constructs that is related to method incoming function as it illustrates the are used to extract information from the source code. Fol- names of methods that call Method A. Similarly, features 14 lowing a similar approach with McBurney et al.(2018)& and 15 are related to each other as they mention the number Nazar et al.(2016), we have selected 15 features that are of methods a given method - say method A - has been called based on the syntactic and semantic (the linguistic mean- from and their names. ings) constructs of Java source code. As listed in Table3, these features are class labels, method names, modifiers, at- 3.4. Feature-Based Design Pattern Detector tributes n-grams method return type number of parame- , , , As shown in figure2, the DPD F approach has three main ters the number of lines in a method incoming methods , , , steps namely Preprocessing, Model Building and Machine outgoing methods . We may informally categorise these fea- Learning Classification. The Preprocessing step includes tures into two major groups that are class level features - 4 building a call graph and generating the SSLR representation features in total, and method level features - 11 features in by parsing the Java source code with code features. Once the total. code is parsed into natural language representation (SSLR), we apply the Word2Vec algorithm on SSLR to generate a

N. Nazar et al.: Preprint submitted to Elsevier Page 5 of 15 Feature based sw design pattern detect

Table 3 will help in extracting the structural, behavioural and cre- 15 source code features & their descriptions ational aspects of design pattern. Thus, we generate a nat- No.Features Description ural language representation of the source code i.e. SSLR by parsing the source code with code features and the SCG 1 ClassName Name of Java Class. to encode the relationships between the code elements. The 2 ClassModifiers Public, Protected, Private code elements are classes, methods, interfaces, etc. in the Keywords etc. DPDF -Corpus. The figure1 illustrates a small subset (ex- 3 ClassImplements A binary feature (0/1) if a 5 class implements an interface. ample) of SSLR file generated for the abdera project . 4 ClassExtends A binary feature (0/1) if a class extends another class. 5 MethodName Method name in a class. 6 MethodParam Method parameters (argu- ments). 7 MethodReturnType A method that returns some- thing (void, int etc.) or a method having a return key- word. 8 MethodBodyLineType Type of code in a method’s body e.g. assignment state- Figure 1: An example (subset) of the SSLR file. ment, condition statements etc. 9 MethodNumVariables Number of vari- Implementation: We have implemented a parser in Python ables/attributes in a method. programming language that parses Java files using the Plyj 6 10 MethodNumMethods Number of method calls in a library. It uses a callgraph generator and feature extractor to class. build the SSLR representation of the corpus. Plyj is a Java 7 11 MethodNumLine Number of lines in a method. parser written in Python using Ply 7 and as the ’DPDF cor- 12 MethodIncomingMethod Number of methods a method pus’ contains files from the projects that were written before calls. Plyj 13 MethodIncomingName Name of methods a method 2013, thus, the files are pre-Java 8 - using fulfils our calls. requirement. 14 MethodOutgoingMethod Number of outgoing methods. 3.4.2. Model Building 15 MethodOutgoingName Name of outgoing methods. We build a Java embedded n-gram model representation, by applying the Word2Vec algorithm (Baroni et al., 2014) on the SSLR representation of the DPDF corpus generated Java Embedded Model (JEM). The JEM is used as an input in the preprocessing step. Word2Vec is a method to build to a supervised machine learning classifier trained on design distributed representation of words based on their contexts patterns to predict design patterns. In the subsequent sec- that works in two alternative : that are the Continu- tions, we discuss these steps in more detail. ous Bag-of-Words (CBOW) model, which predicts the target 3.4.1. Preprocessing word from source context words, and the Skip-Gram model, Translation between source code and natural language which does the inverse and predicts source context words is challenging due to the structure of the source code. One from the target words. simple way to model source code is to just view it as plain The CBOW model (also called the CBOW architecture text. However, in such way, the structure and semantic infor- or Vector Model) is meant for learning relationships between mation will be omitted, which may cause inaccuracies in the pairs of words. In the CBOW model, context is represented generated natural language text (Hu et al., 2020). The struc- by multiple words for a given target word (Mitchell and La- pata, 2008). For example, we could use ‘cat’ and ‘tree’ as tural information can either be extracted through Abstract ‘climbed’ Syntax Trees (AST) or Static Call Graphs (SCG). We decide context words and as the target word. to use SCG instead of AST because the call graph encodes Hence, we create a Word Space Model of n-grams ex- how methods and functions call each others (Musco et al., tracted from the content of the Java source code files using 2017) and builds a graph (edges and nodes) of flow of code the source code features. We treat these files as if it was a to show the relationship between classes and methods in the natural language document, extracting the n-grams by seg- source code. The AST, on the other hand, represents the hi- menting the names of the most salient elements i.e., classes erarchical syntactic structure of the source code sequentially. and methods etc., and consider them as words in the docu- Furthermore, building a call graph is useful as we need ment. We then run the Word2Vec algorithm on the dataset to encode the caller-callee classes and methods in the cor- to construct a high-dimensional representation of these n- pus which is the basis of generating the natural language text 5https://github.com/apache/abdera, last accessed on 09-04-2020 from the source code. Simply having a hierarchy is not use- 6https://github.com/musiKk/plyj, last accessed on 24-08-2020 7https://github.com/dabeaz/ply ful in this case. From the aforementioned arguments, SCG , last accessed on 24-08-2020

N. Nazar et al.: Preprint submitted to Elsevier Page 6 of 15 Feature based sw design pattern detect

Figure 2: Design Pattern Detection with Features (DPDF ).

grams, i.e., each n-gram is paired with a high-dimensional vector in a continuous dense space. The vector representing the n-grams occurring in a Java class are composed by av- representation is passed as an input and a Java Embedded eraging them and the resulting vector is concatenated to the Model is an output as a result of this step. features extracted from the Java class with previous methods. 3.4.3. Machine Classification We use the CBOW architecture, which is meant for learn- ing relationships between pairs of words, i.e., classes and By parsing the files in the labelled corpus, we are able methods mainly as in our case. We set a matrix of size 100 to to build a large data set of these files paired with bags of n- build a 100-dimensional embedding model which is trained grams relevant to each Java file. This structure is comparable with Word2Vec which results in the vector representation of to a natural language corpus, by drawing parallels between each ngram in our collection. The vector representation of a Java source code and a text document, and between source a Java file will therefore be a function of the vector repre- code n-grams and natural language words. The augmented sentations of its n-grams. Inspired by (Mitchell and Lapata, feature representation (in Section 3.4.2) is used to train our 2008), we produce embeddings for the Java files as a uni- DPDF classifier. form linear combination of the embeddings of its constituent The DPDF classifier is an ensembling classifier that uses n-grams using Word2Vec. randomised decision trees as a base estimator. It implements In the 100-dimensional vector, each instance is associ- a meta-estimator that fits a number of randomised decision ated with its project id, the class name, a 100-dimensional trees on various sub-samples of a dataset and uses ensem- feature vector derived from a word embedding model built bling to improve the predictive accuracy and control over- from the n-grams in the Java file, and a design pattern label fitting. As we are generating multi-class classification, we where the design pattern is the target label to predict. These use SAMME-R as a boosting algorithm (Hastie et al., 2009). word embeddings provide a compact yet expressive feature representation for Java classes in a project source code. De- Table 4 The DPD classifier’s learning parameters spite being a programming language, and therefore a formal F one, as opposed to natural language, several elements of the Parameters Values source code are natural language, including its name, the Base Estimator Random Forest names of its methods and variables, and the comments etc. No of Estimations 100 In short, we extract a bag-of-words, i.e., an unordered Learning Rate 1 set of character sequences from the Java source code. This Algorithm SAMME.R method uses the camel-case notation and other features to split the names of methods and arguments, therefore provid- Implementation: Scikit-learn ing as output a lexicon of basic language elements associated we use Python’s (Pedregosa with the input Java source code. et al., 2011) library for building our classifier and the efficacy Implementation: For model generation we use the Word2Vec of the classification is tested using standard statistical mea- implementation provided by (Řehůřek and Sojka, 2010) in sures as discussed in Section 4.1. Table4 provides a brief the form of Python Gensim Library8. The generated SSLR summary of our classifier’s learning parameters. Cross-Validation: Cross-Validation is a resampling pro- 8https://radimrehurek.com/gensim_3.8.3/index.html cedure that is used to evaluate the machine learning mod-

N. Nazar et al.: Preprint submitted to Elsevier Page 7 of 15 Feature based sw design pattern detect

Figure 3: Stratified K-Fold Cross-Validation

els on a data sample. For evaluating our machine learning Section 3.2.1, the difference is minimal and our DFDF ma- model, we have used the Stratified K-Fold Cross-Validation chine learning classifier handled it well. However, to reduce (SK-Fold CV) procedure. The K-Fold Cross Validation is the bias further and improve the results, we have applied data type of Cross-Validation that involves randomly dividing the oversampling technique using SMOTE algorithm. SMOTE set of observations into K groups, or folds of approximately is an oversampling algorithm that relies on the concept of equal size (James et al., 2013). Normally, the first fold is nearest neighbours to create the synthetic data (Chawla et al., treated as a validation set, and the method is fit on the re- 2002). maining K minus 1 folds. A value of K=10 is the recom- Implementation: The Python’s imbalance.SMOTE10 im- mended value for k-fold cross-validation in the field of ap- plementation is used to implement SMOTE in our classifier. plied machine learning (Kuhn and Johnson, 2013), thus, we have used 10 folds per cross validation to validate our ma- 4. Results and Evaluations chine learning model. As there is a variation in sample size for each instance of design pattern in our corpus, it is impor- This section presents the results of our study. Our an- tant that each fold contains same percentage of each design swers for each research questions are presented, evaluated pattern instance to have fair prediction. Stratification is vari- and supported by our data and interpretation. ation of traditional K-Fold CV which is defined by (Kuhn 4.1. Evaluation Criteria and Johnson, 2013) as ’the splitting of data into folds ensur- ing that each fold has the same proportion of observations We report the Precision, Recall, F1-Score, the standard with a given categorical value’ that is pattern instances for measures to statistically evaluate the efficacy of classifiers. our model. Figure3 illustrates our K-Fold Stratification pro- These measures are defined as under: Precision: (P) cedure using 90/10 train test splits on the DPDF corpus. Precision is defined as the fraction of in- Implementation: Our classifier used the sklearn Strati- stances of a classification that are correct, calculated as in fiedKFold 9 implementation of cross-validation. equation1: Resolving Class Imbalance: An imbalanced dataset is a dataset where the number of data points per class differs, re- TP P = sulting in a heavily biased machine learning model that will TP + FP (1) not be able to learn the minority class. When this imbalanced TP FP ratio is not so heavily skewed toward one class, many ma- Where stands for true positives and stands for chine learning models can handle them. Looking at the dis- false positives. Recall: (R) tribution of design pattern instances in our DPDF corpus in Recall is defined as the proportion of actual instances of a classification that are classified as such by the 9https://scikit-learn.org/stable/modules/generated/sklearn.model_ selection.StratifiedKFold.html 10https://imbalanced-learn.readthedocs.io/en/stable/generated/ imblearn.over_sampling.SMOTE.html

N. Nazar et al.: Preprint submitted to Elsevier Page 8 of 15 Feature based sw design pattern detect

Figure 4: Precision & Recall for the DPDF classifier.

classifier as shown in the equation2. FN in the equations RQ1: Is the proposed approach effective in stands for false negatives. detecting software design patterns? Since the initialisation of the classifier is random, the TP results slightly vary at each run, although the difference is R = TP + FN (2) insignificant and there are no striking differences in perfor- mance, and the overall performance is very similar for each F1-Score: The F1-Score (F1-S) is measured as the har- iteration. Therefore, we calculate the weighted average of monic mean of P and R and it is computed as follows: each turn to determine the final Precision, Recall and F1- Score for each label. Using the DPDF corpus, the DPDF P ∗ R classifier is able to predict most of the labels accurately, reach- F 1 − S = 2 ∗ P + R (3) ing a Precision of more than 74+%. Figure4 and Table5 shows the results obtained from DPDF classifier broken down We use these measures in the subsequent sections to dis- into labels. cuss our research questions. As evident from the results the most of the patterns are easily recognised by DPDF classifier, with Visitor having

N. Nazar et al.: Preprint submitted to Elsevier Page 9 of 15 Feature based sw design pattern detect

Table 5 The error rate of DPD is low, with on average Precision, Recall and F-Score values for every label re- F rate of approximately 26%. turned by the DPDF classifier.

Classifier DPDF Design Patterns Precision (%) Recall (%) F1-Score (%) RQ3: How well does our approach perform compared to existing approaches? Adapter 71.45 55.45 61.65 Builder 70.21 81.63 75.30 It is observed that due to the lack of publicly available Decorator 89.94 73.18 80.02 standard benchmark, evaluation and validation of accuracy Facade 69.26 73.36 69.77 of classifier become difficult. Existing approaches either do Factory 57.04 52.57 53.92 not share the corpus or source code is publicly unavailable Memento 89 83.88 85.79 to replicate the results. Observer 76.98 90.54 82.23 Selection Criteria: Under these limitation, we select Prototype 82.05 69 73.73 two studies from the literature that have some relevancy to Proxy 56 68.18 60.58 our study. The relevancy measure is that they have either Singleton 72.79 54.33 60.16 Visitor 96.93 92.36 94.48 utilised code features or machine learning models to predict Wrapper 72.93 63.18 67.16 design patterns or the combination of both. Though they None 59.73 72.69 64.67 have not shared their implementation for regenerating the results, the approach is clearly described which helps us in regenerating the results for comparison purposes. Benchmark DP detection approaches: In this study, an the highest precision of approx. 97%. Some patterns such as evaluation of performance has been presented by comparing Memento and Decorator are very well recognised by DPDF the results of proposed study with the existing approaches with over approx. 89% precision. Interestingly, Factory and developed by Thaller et al.(2019) and Fontana et al.(2011). Proxy are poorly classified (in mid fifties wrt to precision) by Both studies have utilised low level code features with ma- DPDF classifier. Adapter and Wrapper are identified equally chine learning models to predict design patterns. Thus, they well with almost same level of accuracy - this is perhaps fulfil the selection criteria. Unfortunately, since the source explicable by the fact that the Adapter and Wrapper pat- code and the (part of) corpus are not publicly available, we terns quite closely resemble each other, which may mean replicate the results using the information provided in the that some instances are mislabelled in the training set, and studies. also that it is relatively difficult to distinguish them in some Comparison Strategy: We apply the following strategy cases. We investigate the reasons for this further in the sec- to compare our approach with state-of-the-art. ond research question. In conclusion, the answer to the first research question is as follow: 1. We apply our approach on the corpus provided by the studies and compare results. This corpus is referred as a Benchmark corpus. DPDF is effective in detecting design patterns, with an average Precision and Recall of over 74% 2. We apply selected studies to our labelled corpus and and 71.5% respectively. compare results. We refer to this corpus as DPDF Corpus. Lebelling the P-MART Corpus: We selected in total RQ2: What is the misclassification of DPDF ? 260 files from the P-MART corpus and labelled them using Misclassification also known as Error Rate is the ratio the same approach we used for labelling the DPDF -Corpus. of how often the classifier is wrong or predicts the instances Table7 lists the instances of design patten used in the la- incorrectly. It is investigated by computing the confusion belled P-MART corpus. The red coloured patterns are not matrix for DPDF . The results in Figure5 show that DPD F identified by the classifier. performs very well in detecting the absence of a design pat- tern, i.e., a ’none’ - 88 out of 122 instances are true values. Discussion Similarly, 96 out of 104 instances for Visitor and 94 out of We employ two state of the art DP detection methods to 102 instances for Observer are correctly identified by DFDF . compare with the proposed method. The error rate for Singleton and Factory is slightly high com- FeatureMaps: Thaller et al.(2019) apply feature-maps pared to other patterns as only half of these instances are as an input to random forest and convolutional neural net- truly predicted by the classifier. Figure5 shows the confu- works (LeCun et al., 1998) to determine whether a given set sion matrix for DPDF classifier. Overall, the misclassifica- of classes implement a particular pattern role. Their features tion is less than 26% and DPDF has correctly identified most are high-level conceptual classes such as micro-structures. of the instances of the design patterns with high Precision, We compare our DPDF classifier’s results with their RF clas- hence in conclusion: sifier only. Though they have only identified Decorator and Singleton patterns, we replicate their study with all patterns

N. Nazar et al.: Preprint submitted to Elsevier Page 10 of 15 Feature based sw design pattern detect

Figure 5: Confusion Matrix of the DPDF classifier.

Table 6

Comparison of DPDF with the state-of-the-art. The best results are presented in bold font. Precision (P), Recall (R) and F1-Score (F1-S) values of the benchmark approaches are

generated for both benchmark and our DFDF corpus. The benchmark corpus contained some of the patterns and not all so the results are tested for the patterns given below only.

Corpora Design Patterns FeatureMap MARPLE-DPD DPDF P (%) R (%) F1-S (%) P (%) R (%) F1-S (%) P (%) R (%) F1-S (%) Adapter 15 20 17.14 78.14 75.62 76.86 83.33 83.33 83.33 Builder 55 45 49.5 53.45 48.8 51.02 65.66 71.66 68.53 Decorator 13.22 12.8 13.01 54.18 66 59.51 96.66 86.9 91.35 Factory 50.23 40.35 44.75 78.23 80.1 79.15 93.33 75 83.17 Labelled P-MART Observer 46.12 44.12 45.1 57.21 55.23 56.20 71.78 76.66 74.14 Singleton 63 59 60.93 74.23 70.18 72.15 23.33 40.00 29.47 Visitor 30.3 35.3 32.61 45.74 50.25 47.89 92.5 90 91.23 None 57.23 70.02 62.98 51.3 51.67 61.48 71.83 90 91.23 Overall 41.26 40.82 41.04 61.56 62.23 61.89 74.8 76.66 75.72 Adapter 35 31.5 33.16 85.16 78.25 81.56 71.45 55.45 62.44 Builder 62.2 60.1 61.13 58.52 51.23 54.63 70.21 81.63 75.49 Decorator 21.28 24.5 22.78 60.15 58.23 59.17 90 73 80.61 Factory 61.3 50.45 55.35 82.15 80.8 81.47 69.26 73.36 71.25

DPDF -Corpus Observer 50.1 47.65 48.84 53.25 48.26 50.63 76.98 90.54 83.21 Singleton 65 67 65.98 74.24 69.23 71.65 72.79 54.33 62.22 Visitor 55.1 80.1 65.29 60.1 66.25 63.03 96.93 92.36 94.59 None 65.25 79 71.47 51.3 56.24 53.67 59.73 72.69 65.58 Overall 51.9 55.04 53.42 65.61 63.56 64.57 75.92 74.17 75.03

N. Nazar et al.: Preprint submitted to Elsevier Page 11 of 15 Feature based sw design pattern detect

Table 7 Empirical results show that DPD classifier not Labelled instances of the benchmark corpus trained by the F only outperform existing classifiers in terms of benchmark classifiers. The red coloured instances are not iden- tified by the DPD classifier. Precision, Recall and F1-Score, it also identify F more instances of the 12 design patterns com- Patterns # Patterns # pared to existing studies Adapter 30 Observer 30 Builder 30 Prototype 26 Decorator 23 Proxy0 Factory 30 Singleton 12 5. Threats to validity Facade9 Visitor 30 We have identified the following threats to the validity Memento 10 None 30 of our study that may affect our conclusions. Bugs: It is possible that there are bugs in the processing code that affect the SSLR file generation and the training labelled in the benchmark corpus - creating feature maps for data, as well as bugs in the Word2Vec model and the classi- all patterns in the benchmark corpus as per the criteria dis- fiers we have used, though this is less likely to be a problem cussed in 3.2.2. As shown in Table6 our DPD F classifier given the wide use these tools have already received. We outperforms the RF classifier in FeatureMaps in terms of have spent some time debugging our code, found and fixed Precision with approximately 30% improvement on bench- minor issues, which tend to improve the classification accu- mark corpus and approximately 22% on DPDF corpus. racy, so it is more likely than not that bugs will lead to lower MARPLE-DPD: In this study, (Fontana et al., 2011) have precision and/or Recall than would otherwise be the case. applied basic elements and metrics to mechanically extract Data Labelling: Though we have labelled data using an design patterns from the source code and further extended it online tool, there is a possibility of disagreement between the in (Zanoni et al., 2015). These approaches use different clas- labellers, which may lead to incorrect labelling of the data. sifiers for the identification of design patterns, thus, we repli- To mitigate these issues, we hire at least three labellers to cate their study only with RF classifier and compare results label the corpus as discussed in section 3.2.1. This process with DPDF classifier. We calculate the true and false pos- has substantially reduced the disagreement among raters as itives for each pattern and calculate the weighted Precision shown by the high kappa score. To further reduce the dis- and Recall for both classifiers. Results in Table6 show that agreement, we intend to hire more labellers in future. overall, our approach works far better than MARPLE-DPD Diversity of Reference Set: The reference set may not in terms of Precision, Recall and F1-Score on both corpora. reflect the totality of Java source code that may be of interest To avoid the class imbalance problem as discussed in to developers, and it is unknown whether the classifier will Section 3.2.1, we ensure that the labelled benchmark cor- work effectively on the source code of interest outside the pus should have at least 30 instances for each pattern, which reference set. Obviously, the best way to mitigate this risk is resulted in total seven patterns identified in the end from the to further increase the size and diversity of the reference set, benchmark corpus. The Facade, Memento and prototype are which we have identified as future work. missed because the labelled instances (after removing dupli- Data Imbalance As discussed earlier we have a dataset cates) are 10 or less; thus, they can not be cross validated which is partly imbalanced. This was primarily due to the while training the classifiers. In the P-MART dataset many public corpus we selected and trying to achieve a bigger instances are labelled as decorators and prototypes so we use dataset for experimentation and clear predictions. We have decorator only in the final labelled benchmark corpus and re- reduced the threat by applying SMOTE algorithm on the test moved prototypes. Both MARPLE and Featuremap identi- set while training the model. In future, we plan to add more fied singleton better than DPDF on the labelled P-MART projects and balance the number of instance. corpus. It is because very few instances of singleton are found in the corpus and thus not fully trained by our classi- 6. Related Work fier. We have also generated results by setting the threshold value to 20 and the results were quite poor. Thus, larger the During the past years, with the growing amount of elec- dataset is better classifier will be trained and more accurate tronically available information, there is substantial interest results will be generated. and a substantial body of work among software engineers In summary, FeatureMaps works poorly on both corpora and academic researchers in design pattern detection. whereas, MARPLE-DPD identifies most of the patterns rea- A majority of the approaches to the detection of design sonably well with over 61% Precision on benchmark corpus patterns transform the source code and design patterns into and 66% on DPDF corpus, nevertheless, it still lags behind some intermediate representations such as rules, models, graphs, our approach by approximately 10% on benchmark corpus productions and languages (Yu et al., 2018). For example, and approximately 8% on our corpus in terms of Precision. (Bernardi et al., 2013) exploited a meta-model which con- tains a set of properties characterising the structures and be- haviours of the source code and design patterns. After the

N. Nazar et al.: Preprint submitted to Elsevier Page 12 of 15 Feature based sw design pattern detect modelling step, a matching algorithm is performed to iden- works. tify the implemented patterns. During the detection phase, Several other approaches exploit the techniques of ma- each pattern instance is mapped to the its corresponding de- chine learning to solve the issue of variants. For example, (Chi- sign pattern graph. (Alnusair et al., 2014) employed seman- hada et al., 2015) mapped the design pattern detection prob- tic rules to capture the structures and behaviours of the de- lem into a learning problem. Their proposed detector is based sign patterns, based on which the hidden design patterns on learning from the information extracted from design pat- in open-source libraries are discovered. Recently, (Xiong tern instances, which normally include variant implementa- and Li, 2019) applied ontology based parser with idiomatic tions. phrases to identify design pattern achieving high Precision. (Ferenc et al., 2005) applied machine learning A good number of studies developed tools that used source to filter false positives out of the results of a graph matching code or and intermediate representation to identify patterns phase, thus providing better precision in the overall output as well as machine learning models to predict patterns. (Lu- while considering variants. A recent work by (Hussain et al., cia et al., 2011), (Tosi et al., 2009) and (Zhang and Liu, 2018) leverage deep learning algorithms for the organisation 2013) used static and dynamic analysis (or a combination and selection of DPs based on text categorisation. To reduce of both) to develop Eclipse plugins that detect design pat- the size of training examples for DP detection, a clustering terns from source code. (Moreno and Marcus, 2012) devel- algorithm is proposed by (Dong et al., 2008) based on de- oped a tool JStereoType used for detecting low-level patterns cision tree learning. However, the applicability of machine (classes, interfaces etc) to find the design intent of the source learning approaches suffers from the drawbacks and limi- code. (Zanoni et al., 2015) exploited a combination of graph tations of collecting datasets and training machine learning matching and machine learning techniques, implemented a models (Dong et al., 2008). Hence, we built a large dataset tool called MARPLE-DPD. (Niere et al., 2002) designed the which is publicly available and trains a machine learning FUJABA Tool Suite which provides developers with support model with high accuracy. for detecting design patterns (including their variants) and Though we have used machine learning based classifiers smells. (Hautamäki, 2005) used a pattern based solution and to test the efficacy of our approach, our approach is sub- tool to teach Software Developer how to use development stantially different from the aforementioned studies in many solutions in a project. ways. We have summarised the differences with the bench- Some techniques including tool generation techniques mark studies as under: applied reverse engineering to identify design patterns from source code and UML artefacts (or other intermediate rep- • We have used 15 source code features whereas as the resentations). For instance, (Lucia et al., 2011) built a tool benchmark studies of (Zanoni et al., 2015) used code using static analysis, they also applied reverse engineering metrics and (Thaller et al., 2019) used feature maps. through visual parsing of diagram in the same study. Other • Our corpus size is large having 1,400 files extracted studies such as (Thongrak and Vatanawood, 2014; Panich from around 100 projects from the GJC. (Zanoni et al., and Vatanawood, 2016; Shi and Olsson, 2006) also applied 2015) and (Thaller et al., 2019) used a corpus P-MART reverse engineering techniques on UML class and sequence that has 1039 files that contains the design patterns we diagrams to extract design patterns using ontology. In an- intend to identify in a highly imbalanced nature. other study, (Couto et al., 2012) applied a pattern classifica- tion reverse engineering approach on Java source code to cre- • Our machine learning classifier achieved approximately ate UML diagrams to indicate whether or not each pattern in 74% precision whereas existing studies achieved 34% GoF (Gamma et al., 1995) from UML diagrams, and (Brown, and 63% Precision respectively. 1996) proposed a method for automatic detection of design patterns by reverse-engineering the code. • Our classifier identified 12 design patterns success- Very few studies utilised code metrics or variants in iden- fully, whereas existing studies only recognised two and tification of design patterns. (Uchiyama et al., 2011) have six design patterns respectively. presented a software pattern detection approach by using soft- ware metrics and machine learning technique. They have 7. Conclusion & Future Work identified candidates for the roles that compose the design patterns by considering machine learning and software met- In this paper, we introduce a novel approach to detecting rics. (Lanza and Marinescu, 2007) approach uses learning software design patterns by using source code features and from the information extracted from design pattern instances machine learning methods that are trained on source code which normally include variant implementations such num- features extracted from Java source code. We build an SSLR ber of accessor methods etc. (Fontana et al., 2011) intro- representation by applying call graph and source code fea- duced the micro-structures that are regarded as the building tures on the DPDF corpus extracted from the publicly avail- ’The Java Github Corpus’ blocks of design patterns. They evaluated whether the de- able . Next, we construct a Java Word2Vec tection of these building blocks is relevant for the detection n-gram model by applying algorithm on the SSLR of occurrences of design patterns. (Thaller et al., 2019) has file. Finally, we train a supervised machine learning classi- build a feature map for pattern instances using neural net- fier, DPDF to identify design patterns.

N. Nazar et al.: Preprint submitted to Elsevier Page 13 of 15 Feature based sw design pattern detect

To statistically evaluate the efficacy of the proposed ap- Brown, K., 1996. Design reverse-engineering and automated design-pattern proach we apply four commonly used statistical measures detection in Smalltalk. Technical Report. North Carolina State Univer- namely Precision, Recall, F1-Score, and build a confusion sity. Dept. of Computer Science. Carletta, J., 1996. Assessing agreement on classification tasks: the kappa matrix as well to mitigate the potential class imbalance prob- statistic. Computational linguistics 22, 249–254. lem if any, and calculate the error rate. Empirical results Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. Smote: show that our proposed approach DFDF detects software synthetic minority over-sampling technique. Journal of artificial intelli- design patterns with approximately 74% of Precision. Com- gence research 16, 321–357. paring our approach with the existing studies reveals that our Chihada, A., Jalili, S., Hasheminejad, S.M.H., Zangooei, M.H., 2015. Source code and design conformance, design pattern detection from approach performs significantly better than two exiting ap- source code by classification approach. Applied Soft Computing 26, proaches, outperforming them by 30% and 10% respectively 357–367. in terms of Precision, as well as identifies more patterns than Christopoulou, A., Giakoumakis, E.A., Zafeiris, V.E., Soukara, V., 2012. existing studies. Automated refactoring to the strategy design pattern. Information and In future, we plan to extend our online tool CodeLabeller Software Technology 54, 1202–1214. Ciobanu, A.M., Dinu, A., 2013. Alternative measures of word relatedness using crowdsourcing to obtain a larger reference set to train in distributional semantics, in: Joint Symposium on Semantic Process- the model as well as automatically predicting the pattern, ing, p. 80. which should improve the Precision score of the classifier. Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational Furthermore, it will provide additional assurance that the and psychological measurement 20, 37–46. model is robust enough to cope with diverse coding styles. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., 2011. NLP (almost) from scratch. Journal of Machine Learning We also plan to investigate whether other useful information Research 12, 2493–2537. for software maintainers can be extracted using our meth- Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, ods. While our classifier’s performance is promising, fur- 273–297. ther improvements are clearly highly desirable for practical Couto, R., Ribeiro, A.N., Campos, J.C., 2012. A patterns based reverse use. While increasing the size of the training set will almost engineering approach for java source code, in: Software Engineering Workshop (SEW), 2012 35th Annual IEEE, IEEE. pp. 140–147. certainly improve accuracy, it is likely that adding additional Detten, M.V., Meyer, M., Travkin, D., 2010. Reverse engineering with the code features to the SSLR files would also help, and could reclipse tool suite, in: Proceedings of the 32nd ACM/IEEE International be targeted to improve performance on those patterns where Conference on Software Engineering-Volume 2, pp. 299–300. accuracy is relatively low. Dong, J., Sun, Y., Zhao, Y., 2008. Compound record clustering algorithm for design pattern detection by decision tree learning, in: 2008 IEEE International Conference on Information Reuse and Integration, IEEE. A. Reproducibility pp. 226–231. Ferenc, R., Beszedes, A., Fulop, L., Lele, J., 2005. Design pattern mining For the purpose of reproducibility of our results, we have enhanced by machine learning, in: 21st IEEE International Conference released our complete implementation with the annotated on Software Maintenance (ICSM’05), IEEE. pp. 295–304. reference set and the result files to the public as an open- Fontana, F.A., Maggioni, S., Raibulet, C., 2011. Understanding the rele- vance of micro-structures for design patterns detection. Journal of Sys- source project via our online appendix at https://figshare. com/s/6d04193f45d724327e43 tems and Software 84, 2334–2347. . Gaitani, M.A.G., Zafeiris, V.E., Diamantidis, N., Giakoumakis, E.A., 2015. Automated refactoring to the null object design pattern. Information and Acknowledgement Software Technology 59, 33–52. Gamma, E., Helm, R., Johnson, R., Vlissides, J., 1995. Design Patterns: We would like to extend our gratitude to the individuals Elements of Reusable Object-oriented Software. Addison-Wesley Long- who have dedicated their time and effort in performing anno- man Publishing Co., Inc., Boston, MA, USA. Harris, Z.S., 1954. Distributional structure. Word 10, 146–162. tation of design patterns through the online tool. We would Hastie, T., Rosset, S., Zhu, J., Zou, H., 2009. Multi-class adaboost. Statis- also like to thank all anonymous reviewers for their valuable tics and its Interface 2, 349–360. comments. Hautamäki, J., 2005. Pattern-based tool support for frameworks: Towards architecture-oriented environment. Ho, T.K., 1995. Random decision forests, in: Proceedings of the Third In- References ternational Conference on Document Analysis and Recognition (Volume Allamanis, M., Sutton, C., 2013. Mining source code repositories at mas- 1) - Volume 1, pp. 278–. sive scale using language modeling, in: The 10th Working Conference Hu, X., Li, G., Xia, X., Lo, D., Jin, Z., 2018. Deep code comment gener- on Mining Software Repositories, IEEE. pp. 207–216. ation, in: 2018 IEEE/ACM 26th International Conference on Program Alnusair, A., Zhao, T., Yan, G., 2014. Rule-based detection of design pat- Comprehension (ICPC), IEEE. pp. 200–20010. terns in program code. International Journal on Software Tools for Tech- Hu, X., Li, G., Xia, X., Lo, D., Jin, Z., 2020. Deep code comment genera- nology Transfer 16, 315–334. tion with hybrid lexical and syntactical information. Empirical Software Baroni, M., Dinu, G., Kruszewski, G., 2014. Don’t count, predict! A sys- Engineering 25, 2179–2217. tematic comparison of context-counting vs. context-predicting semantic Hussain, S., Keung, J., Khan, A.A., Ahmad, A., Cuomo, S., Piccialli, F., vectors, in: Proceedings of the 52nd Annual Meeting of the Association Jeon, G., Akhunzada, A., 2018. Implications of deep learning for the for Computational Linguistics (ACL), pp. 238–247. automation of design patterns organization. Journal of Parallel and Dis- Bernardi, M.L., Cimitile, M., Di Lucca, G.A., 2013. A model-driven graph- tributed Computing 117, 256–266. matching approach for design pattern detection, in: 2013 20th Working James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An introduction to Conference on Reverse Engineering (WCRE), pp. 172–181. statistical learning. volume 112. Springer. Bishop, C.M., 2006. Pattern recognition and machine learning. springer. Kuchana, P., 2004. design patterns in Java. CRC

N. Nazar et al.: Preprint submitted to Elsevier Page 14 of 15 Feature based sw design pattern detect

Press. controlled experiments assessing the usefulness of design pattern doc- Kuhn, M., Johnson, K., 2013. Applied predictive modeling. volume 26. umentation in program maintenance. IEEE Transactions on Software Springer. Engineering 28, 595–606. Landauer, T.K., Dumais, S.T., 1997. A solution to plato’s problem: The Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S., 2011. A word latent semantic analysis theory of acquisition, induction, and repre- at a time: computing word relatedness using temporal semantic analysis, sentation of knowledge. Psychological Review 104, 211–240. DOI: in: Proceedings of the 20th International Conference on World Wide 10.1037/0033-295X.104.2.211. Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011, pp. Lanza, M., Marinescu, R., 2007. Object-oriented metrics in practice: using 337–346. software metrics to characterize, evaluate, and improve the design of Řehůřek, R., Sojka, P., 2010. for Topic Modelling object-oriented systems. Springer Science & Business Media. with Large Corpora, in: Proceedings of the LREC 2010 Workshop on Le, Q.V., Mikolov, T., 2014. Distributed representations of sentences and New Challenges for NLP Frameworks, pp. 45–50. documents, in: Proceedings of the 31th International Conference on Ma- Reisinger, J., Mooney, R.J., 2010. Multi-prototype vector-space models of chine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1188– word meaning, in: Human Language Technologies: Conference of the 1196. North American Chapter of the Association of Computational Linguis- LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al., 1998. Gradient-based tics, Proceedings, June 2-4, 2010, Los Angeles, California, USA, pp. learning applied to document recognition. Proceedings of the IEEE 86, 109–117. 2278–2324. Salton, G., Wong, A., Yang, C.S., 1975. A vector space model for automatic Lucia, A.D., Deufemia, V., Gravino, C., Risi, M., Tortora, G., 2011. An indexing. Commun. ACM 18, 613–620. eclipse plug-in for the identification of design pattern variants, in: Sixth dos Santos, C.N., Zadrozny, B., 2014. Learning character-level represen- Workshop of the Italian Eclipse Community (Eclipse-IT 2011), Milan, tations for part-of-speech tagging, in: Proceedings of the 31th Interna- Italy, pp. 40–51. tional Conference on Machine Learning, ICML 2014, Beijing, China, Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Informa- 21-26 June 2014, pp. 1818–1826. tion Retrieval. Cambridge University Press, USA. Scanniello, G., Gravino, C., Risi, M., Tortora, G., Dodero, G., 2015. Doc- Mayvan, B.B., Rasoolzadegan, A., 2017. Design pattern detection based umenting design-pattern instances: A family of experiments on source- on the graph theory. Knowledge-Based Systems 120, 211–225. code comprehensibility. ACM Transactions on Software Engineering McBurney, P.W., Jiang, S., Kessentini, M., Kraft, N.A., Armaly, A., and Methodology (TOSEM) 24, 1–35. Mkaouer, M.W., McMillan, C., 2018. Towards prioritizing documen- Shi, N., Olsson, R.A., 2006. Reverse engineering of design patterns from tation effort. IEEE Transactions on Software Engineering 44, 897–913. java source code, in: Automated Software Engineering, 2006. ASE’06. McBurney, P.W., McMillan, C., 2015. Automatic source code summariza- 21st IEEE/ACM International Conference on, IEEE. pp. 123–134. tion of context for java methods. IEEE Transactions on Software Engi- Thaller, H., Linsbauer, L., Egyed, A., 2019. Feature maps: A compre- neering 42, 103–119. hensible software representation for design pattern detection, in: 2019 McBurney, P.W., McMillan, C., 2016. An empirical study of the textual IEEE 26th International Conference on Software Analysis, Evolution similarity between source code and source code summaries. Empirical and Reengineering (SANER), IEEE. pp. 207–217. Software Engineering 21, 17–42. Thongrak, M., Vatanawood, W., 2014. Detection of design pattern in class Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Dis- diagram using ontology, in: Computer Science and Engineering Confer- tributed representations of words and phrases and their compositionality, ence (ICSEC), 2014 International, IEEE. pp. 97–102. in: Advances in neural information processing systems, pp. 3111–3119. Tosi, C., Zanoni, M., Maggioni, S., 2009. A design pattern detection plugin Mitchell, J., Lapata, M., 2008. Vector-based models of semantic composi- for eclipse. Eclipse-IT 2009 , 89. tion. proceedings of ACL-08: HLT , 236–244. Uchiyama, S., Washizaki, H., Fukazawa, Y., Kubo, A., 2011. Design pat- Mitchell, T., 1997. Machine Learning. McGraw-Hill International Editions, tern detection using software metrics and machine learning, in: First McGraw-Hill. international workshop on model-driven software migration (MDSM Moreno, L., Marcus, A., 2012. Jstereocode: automatically identifying 2011), p. 38. method and class stereotypes in java code, in: Proceedings of the 27th Walter, B., Alkhaeir, T., 2016. The relationship between design patterns IEEE/ACM International Conference on Automated Software Engineer- and code smells: An exploratory study. Information and Software Tech- ing, pp. 358–361. nology 74, 127–142. Moreno, L., Marcus, A., Pollock, L., Vijay-Shanker, K., 2013. Jsumma- Xiong, R., Li, B., 2019. Accurate design pattern detection based on id- rizer: An automatic generator of natural language summaries for java iomatic implementation matching in java language context, in: 2019 classes, in: 2013 21st International Conference on Program Compre- IEEE 26th International Conference on Software Analysis, Evolution hension (ICPC), IEEE. pp. 230–232. and Reengineering (SANER), pp. 163–174. Musco, V., Monperrus, M., Preux, P., 2017. A large-scale study of call Yu, D., Zhang, P., Yang, J., Chen, Z., Liu, C., Chen, J., 2018. Efficiently graph-based impact prediction using mutation testing. detecting structural design pattern instances based on ordered sequences. Journal 25, 921–950. Journal of Systems and Software 142, 35–56. Nazar, N., Jiang, H., Gao, G., Zhang, T., Xiaochen, X.L., Ren, Z., 2016. Zanoni, M., Fontana, F.A., Stella, F., 2015. On applying machine learning Source code fragment summarization with small-scale crowdsourcing techniques for design pattern detection. Journal of Systems and Software based features. Frontiers of Computer Science 10, 504–517. 103, 102 – 117. Niere, J., Schäfer, W., Wadsack, J.P., Wendehals, L., Welsh, J., 2002. To- Zhang, H., Liu, S., 2013. Java source code static check eclipse plug-in wards pattern-based design recovery, in: Proceedings of the 24th inter- based on common design pattern, in: Software Engineering (WCSE), national conference on Software engineering, ACM. pp. 338–348. 2013 Fourth World Congress on, IEEE. pp. 165–170. Panich, A., Vatanawood, W., 2016. Detection of design patterns from class diagram and sequence diagrams using ontology, in: Computer and Infor- mation Science (ICIS), 2016 IEEE/ACIS 15th International Conference on, IEEE. pp. 1–6. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830. Prechelt, L., Unger-Lamprecht, B., Philippsen, M., Tichy, W.F., 2002. Two

N. Nazar et al.: Preprint submitted to Elsevier Page 15 of 15