ABSTRACT

SEMANTIC INCONSISTENCY AND COMPUTATIONAL INTRACTABILITY IN TRANSITIVE ABSTRACTION RULES

By Cihan Kaynak

Class diagrams in UML have the ability to capture the conceptual design view of a software system. Therefore, the abstraction of them has vital implications in reverse engineering and program comprehension. Building abstraction of class diagrams that are directly generated from source code is crucial to provide a maintenance engineer the ontology of a legacy system. Egyed proposed 121 transitive abstraction rules to discover UML relationships among classes that are related to domain concepts. In this work, we propose a number of modifications to Egyed’s rules that address some semantic inconsistencies. Furthermore, we prove that serial application of Egyed’s rules is inherently ambiguous in some cases and the identification of a semantically consistent abstraction is computationally intractable. Finally, we introduce a methodology that simplifies the set of abstraction rules introduced by Egyed. SEMANTIC INCONSISTENCY AND COMPUTATIONAL INTRACTABILITY IN TRANSITIVE ABSTRACTION RULES

A Thesis

Submitted to the Faculty of Miami University

in partial fulfillment of

the requirements for the degree of

Masters of Computer Science

Department of Computer Science and System Analysis

by

Cihan Kaynak

Miami University

Oxford, Ohio

2008

Advisor______Dr. Gerald C. Gannod

Reader______Dr. James Kiper

Reader______Dr. Valerie Cross TABLE OF CONTENTS 1 Introduction ...... 1 2 Background and Related Work ...... 3 2.1 Unified Modeling Language (UML) ...... 3 2.2 Reverse Engineering and Design Recovery ...... 5 2.3 Concept Identification ...... 6 2.4 Transitive Abstraction Rules ...... 7 2.5 Related Work...... 9 3 Abstraction Rules ...... 13 3.1 Analysis of Abstraction Rules ...... 13 3.1.1 Generalization ...... 13 3.1.2 Dependency...... 22 3.1.3 Association ...... 30 3.2 Ambiguity in Serial Abstraction ...... 33 3.3 Reducing the Set of Abstraction Rules ...... 37 4 Tool Support: XARules ...... 43 5 Conclusion and Future Investigations ...... 47 References ...... 49 Appendix A: The Proposed Abstraction Rules ...... 51 Appendix B: Ambiguous Paths with four Classes ...... 55

ii

LIST OF FIGURES Figure 1. Generalization relationship...... 4 Figure 2. Association relationship...... 4 Figure 3. Aggregation relationship...... 4 Figure 4. Dependency relationship...... 4 Figure 5. Input-Output Schema of the Abstraction Function ...... 8 Figure 6. A counterexample where ...... 18 Figure 7. An example where does not depend on T...... 21 Figure 8. An example where there is no association between and ...... 21 Figure 9. A path that contains four classes...... 33 Figure 10. The first way of abstracting the path in Figure 9...... 34 Figure 11. The second way of abstracting the path in Figure 9...... 34 Figure 12. Structure of the path that contains classes...... 36 Figure 13. The abstraction of a path according to Rule #3 defined by Egyed...... 38 Figure 14. The abstraction of a path according to Rule #60 defined by Egyed...... 38 Figure 15. New syntax to represent an abstraction rule...... 41 Figure 16. The high-level workflow of XARules...... 43 Figure 17. An example from XARules...... 45 Figure 18. Abstraction of the low-level model in Figure 17...... 46

iii

Acknowledgements

Firstly, I would like to thank Dr. Gerald C. Gannod for his continuous support and commitment on my thesis. This work would not have been realized without his valuable feedback. In the meantime, I would like to thank Dr. James Kiper and Dr. Valerie Cross for participating in my thesis committee and reviewing this work.

Secondly, I can never thank my parents and brother enough for their constant encouragement. The inspiration they have been providing me has always been my principal guide while achieving my goals in my life.

I would also like to thank all of my friends for their sincerity. Enduring friendships that I made at Miami University let me enjoy and be grateful for every single moment in my last two years.

Finally, I would like to extend my love to a beautiful woman in my life. Words cannot express how much I appreciate her strong support on my every single step. She is not only the one whom I love to death, but also my best friend whose encouragement always helped me to overcome any challenge that I encountered.

iv

1 Introduction

The Unified Modeling Language (UML) is a well accepted standard that has been used in the software industry to model structural, behavioral, and architectural views of a software system [1]. Class diagrams specified by UML address the design view of a program so that a developer can conceptually model the software by abstracting implementation details during the development phase. Regarding the reverse engineering phase, class diagrams can also be utilized by software maintainers in the process of understanding the program architecture and capturing domain-level concepts that are embedded in the source code.

Today, many UML tools exist in the domain, which have the ability to construct class diagrams directly from source code [2]. Since model elements in these diagrams are straightforwardly recovered from the original source code, they are often called as “as-built” diagrams. However, these diagrams are too detailed to present the high-level, structural view of a program, because implementation details are also captured during the construction. Therefore, there is a need to have methods for building abstract class diagrams so that a reverse engineer can have a better understanding of domain concepts of a given legacy software system.

Carey and Gannod proposed an approach that identifies classes in a that are related to domain concepts by also filtering others which are considered as implementation details [3]. However, their approach primarily focuses on class abstraction, but not on the issues related to UML relationships. On the other hand, Egyed proposed another method with 121 transitive abstraction rules where the main goal is to discover UML relationships among domain-level classes [4]. However, even though Egyed’s rules are complete in terms of coverage, we observed several semantic inconsistencies that cause ambiguities in abstracted class diagrams. As a result, we can claim the following statement.

Thesis Statement:

An ontology of a legacy software system can be constructed if and only if domain-level classes that are embedded into the program source code are identified and semantically valid relationships among them are captured.

1

In this work, we first analyze the semantics of each transitive abstraction rule proposed by Egyed. During the discussion, we show that there are some circumstances under which resulting outputs are either invalid or incomplete. Furthermore, we introduce a number of modifications to Egyed’s rules that address these semantic inconsistencies.

Having examined each rule, we demonstrate that the methodology for serial application of Egyed’s rules is inherently ambiguous in some cases such that it may produce different abstract class diagrams for a given system. In addition to this, we also prove that identifying these different results is computationally intractable as there are exponentially many reductions for abstracting a given set of inputs.

Finally, we show that Egyed proposes two different rules for two input cases that are syntactically different, but semantically equivalent. In order to eliminate these redundancies, we propose a new syntax to define a rule so that the total number of rules is reduced to 66. We conclude our work by introducing a tool we have developed as an alternative to support the construction of abstract class diagrams.

As a result, this work provides five major contributions to the field: (1) a semantic analysis of transitive abstraction rules proposed by Egyed, (2) a proof that shows that serial application of Egyed’s rules on class diagrams may result in ambiguous cases where different abstract class diagrams of a program can be obtained, (3) a proof that shows that identifying semantically valid abstractions is computationally expensive, (4) a method that proposes a new syntax to define abstraction rules by which redundant rules in Egyed’s set are eliminated, and (5) a tool for constructing abstract class diagrams.

The remainder of this paper is organized as follows. Section 2 presents background information and related works in the reverse engineering and program understanding literature. The discussion about semantic inconsistencies in Egyed’s rules and the proof that shows the intractability of serial abstraction is provided in Section 3, including the methodology that proposes a reduced set of abstraction rules. Section 4 describes the tool developed to build abstract class diagrams. Finally, conclusion and suggested future investigation are drawn in Section 5.

2

2 Background and Related Work

In this section, we discuss the background material and related work in the area of concept identification, reverse engineering with design recovery, and abstraction of UML class diagrams with discovering transitive relationships among classes.

2.1 Unified Modeling Language (UML)

The Unified Modeling Language (UML) is a general-purpose visual modeling language that is used to specify, visualize, construct, and document the artifacts of a software system [5]. UML has several model elements to be used in the requirement analysis and system design in a typical software development life cycle. The Class Diagram is one of the components of UML that can model the overall system structure. It may contain classes, interfaces and basic relationships among them. In object-oriented modeling, there are two kinds of class diagrams: conceptual class diagram and design class diagram [6]. A Conceptual Class Diagram models the conceptual domain knowledge (ontology) of a solution, whereas Design Class Diagram adds detailed implementation issues to the domain knowledge so that the system design becomes ready to be realized. During the development, a designer first starts with modeling the concepts of a domain on a conceptual class diagram and then they refine that high-level system model to generate a design class diagram.

Class diagrams do not only represent concepts in a domain, but also illustrate relationships among them. In the UML notation, there are mainly four types of relationships: generalization (inheritance), association, aggregation (part-of), and dependency [7]. A generalization is a relationship between a general thing (called super class or parent) and a more specific kind of that thing (called the subclass or child) [7]. Such kind of relation is also called as is-a relationship. An association is a structural relationship that specifies that objects of one thing (class) are connected to objects of another [7]. Given an association, we can navigate from one object of one class to an object of the other class and vice versa. On the other hand, an aggregation is a plain association between two classes where a structural connection occurs between peers in terms of “whole/part” relationship where the open diamond shape indicates the “whole” [7]. An aggregation is also called as has-a relationship. In addition, composition is a

3 special form of aggregation with strong ownership and coincident lifetime as part of whole [7]. Differently from aggregation, the filled diamond represents the “whole” in the relationship. Since every composition is aggregation, we focus on aggregation for the sake of generality. Finally, a dependency is a using relationship that states a change in specification of one thing may affect another thing which uses it, but not necessarily the reverse [7]. Dependency is weaker than association in a sense that the client in a dependency does not necessarily have to access resources defined in the supplier in the same dependency (i.e. class A does not access an attribute of class B, but class A has a method that expects an instance of class B as a parameter). The following figures illustrate four major types of UML relationships described above.

Class A Class B

Figure 1. Generalization relationship.

Class A <> Class B

Figure 2. Association relationship.

Class A Class B

Figure 3. Aggregation relationship.

Figure 4. Dependency relationship.

In a typical class diagram that represents the structural view of a software system, there will be some classes which are more important than others. In other words, some concepts in the system 4 design fully represent the domain entities, whereas others do not. For instance, the concept of “account” is likely to be more important than the concept of “contact details” in a usual banking automation system. As a result, such kinds of concepts, which are more essential than the others, are called core concepts, while others are called peripheral concepts [8]. In our work, we will use the term concept class to refer to a core concept and non-concept class to refer to a peripheral class.

Generally speaking, the class diagram generated from the source code of an application usually contains both concept and non-concept classes. Therefore, a generalization mechanism should exist such that it filters non-concept classes and preserves concept classes within the ontology. Metzger defines abstraction as a transformation which allows model users to concentrate on the significant system aspects, thus allowing them to handle complexity [9]. In this context, the abstraction of a UML class diagram can be rephrased as a process that removes non-concept classes and keeps the concept classes by also preserving/generating relationships among them.

2.2 Reverse Engineering and Design Recovery

Before the term “reverse engineering” started to evolve as a major link in software life cycle, it has its origin in the analysis of hardware systems. Rekoff defines the reverse engineering as the process of developing a set of specifications for a complex hardware system by an orderly examination of specimens of that system [10]. According to Rekoff, the process of reverse engineering in the hardware analysis terminology is conducted by someone other than the developer without any of the initial design decisions for the purpose of creating the exact duplicate of the original hardware system [10].

With the development of software solutions in hardware systems, reverse engineering in software systems began to evolve. Chikofsky and Cross redefine reverse engineering for the domain of software life cycle as the process of analyzing a subject software solution to identify the system's components with their interrelations and to create representations of the system in another form or at a higher level of abstraction [11]. In this sense, the main objective of a reverse engineering process in a software solution is to gain a sufficient design-level understanding to help maintenance, while it is to clone the system in the case of hardware [11].

5

On the other hand, design recovery is a subset of reverse engineering which recreates design abstractions from a combination of code, existing documentation, and domain knowledge to fully understand what a software solution does, how it does it, why it does it, and so forth [11].

2.3 Concept Identification

The concept assignment problem is the process of recognizing concepts in software systems and building a model or human level understanding of the concepts [12]. This well known problem is still being solved at the line-by-line level of analyzing source code. Therefore, such kind of solution technique, which requires an intensive human involvement, is not scalable for large software systems.

One solution technique of the concept assignment problem is based on the graph theory. In object-oriented software engineering, a class diagram is nothing but a graph where classes are nodes and the relationships are edges. Therefore, all graph analysis methods can be applied on a typical class diagram that is converted to an equivalent graph. In addition to this, a class diagram can also be considered as a social network where nodes (classes) are associated with each other with relationships. In social networks, the importance of a node is measured by a centrality metric such that a significant vertex in the network is expected to have a higher centrality value [13]. As a result, a concept in an existing software system can be identified according to its centrality value. Although there are several centrality metrics that can be used in the analysis of a graph, five of them are mainly used while recognizing core concepts. These are: (1) degree centrality (measures the number of edges on a node), (2) closeness centrality (measures the average distance from that node to all other nodes), (3) betweenness centrality (measures the number of shortest paths between all pairs of nodes in the graph that use a particular node), (4) information centrality (measures the information contained in all paths originating with a specific node), (5) eigenvector centrality (measures the centrality of a node relative to the importance of its surrounding nodes) [8].

Another way of identifying concepts is based on machine learning techniques. The solution mainly involves collecting object-oriented metric data from existing systems, which are then used in machine learning methods to create classifiers so that a given class in a legacy software system is classified either as concept or non-concept [3]. Object-oriented metrics used to create

6 the feature vector of a given class capture information about the size and complexity of software at the class level. The detailed information about the metrics and classifiers used in their approach is discussed in Section 2.5.

2.4 Transitive Abstraction Rules

Egyed proposes a set of abstraction rules that capture relationships among concept classes in a class diagram [4]. Egyed exhaustively creates rules for all possible combinations of UML relationships that can exist in a short path. A short path is a sequence of classes where there are two concept classes and one non-concept class that has a UML relationship associated to each concept class in the path. Egyed introduces a set of named relations to denote Association, Dependency, and Generalization type relationships in UML as shown in Table 1. We call this set . Note that the cardinality of the set is equal to 11.

Table 1. The Mapping between Egyed’s Named Relations ( ) and UML Relationships.

Egyed’s Named Relation UML Relationship in a Class Diagram

GeneralizationRight

GeneralizationLeft

Association

AssociationLeft

AssociationRight

DependencyRight

DependencyLeft

7

[Agg]Association

Association[Agg]

[Agg]AssociationRight

AssociationLeft[Agg]

Input:

<< concept >> R1 << non-concept >> R2 << concept >> Source Intermediate Target

Output:

<< concept >> R3 << concept >> Source Target

Figure 5. Input-Output Schema of the Abstraction Function

The diagram in Figure 5 illustrates the input-output schema of abstraction rules defined by Egyed. A rectangle represents a class and a line between two classes represents a relationship like in a UML class diagram. Furthermore, each class in the figure is tagged by a , which indicates whether a given class is a concept or not (i.e., Source is concept, whereas Intermediate is non-concept). Basically, an input of the abstraction function is a short path with three classes (2 concepts, 1 non-concept) and two UML relationships. The function abstracts an input path by filtering the non-concept class and also inferring the relationship between two concept classes, if it exists. Both and are elements of . If the function determines that there exists a relationship between two concept classes after the abstraction, then it creates , which is also in . Otherwise, it concludes that there is no relationship between Source and

8

Target. In Egyed’s notation, the schema described in Figure 5 would be represented as . For instance, the rule – – is an instance where , , and are replaced by relationship.

The abstraction function outputs R3 by evaluating R1, and R2 independently from the Source Concept Class, Non-Concept Class, and Target Concept Class. Therefore, it is relationship oriented, but not class oriented. Thus, given the set , which is the set of all UML relationships indicated in Table 1, the abstraction function is formally defined as . Since the size of is equal to 11, the total number of tuples in the domain of is equal to 121 (i.e., ). This actually shows that the set of abstraction rules that Egyed proposes is complete. However, as we will show in Section 3.3, almost every rule has a dual (i.e. a rule has a dual if they are syntactically different, but semantically equivalent) and the total number of abstraction rules can be reduced to 66 by introducing a new notation.

2.5 Related Work

Hsi, Potts, and Moore proposed a reverse engineering method that aims to provide the ontology of a given software system [8]. They use social network analysis methods (described in Section 2.3) to identify concepts in a legacy software system. They claim that domain concepts are illustrated by the graphical user interface of a system. Thus, they follow a black-box reverse engineering methodology in the sense that they create a graph based on the graphical user interface of a software system. A node in the graph represents a graphical widget (e.g., button, check box, text box, etc.) in the user interface, whereas an edge represents containment or activation relationship possibly existing between two widgets. They claim that a node with a high centrality value in the social network (i.e., the initially created isomorphic graph) is possibly a core concept. As a result, they eliminate nodes with low centrality values (i.e., they remove peripheral concepts) and present the remaining concepts and relationships among them as the ontology of the software. Our work differs from their work in couple of different ways. Firstly, we create the initial low level model from the source code (there might be cases where the graphical user interface of a software system may not illustrate all concepts in the domain). Secondly, our concept identification method depends on machine learning techniques. Lastly, we

9 do not only filter non-concept classes, but also focuses on transitive relationships among concept classes.

Carey and Gannod introduced a supervised machine learning technique that facilitates filtering non-concept classes from existing systems at the source level based on their features in the software system [3]. The proposed solution involves collecting object-oriented metric data from existing systems (training phase), which are then used to indentify concept classes in a given system (testing phase). Several object oriented metrics are used to create a feature vector for each class in a particular legacy system so that they are able to identify a class as either concept or non-concept. Their concept identification is a fully automated approach at the source level in a way that no human intervention is needed. Support Vector Machines (SVM) and k-nearest neighbors (KNN) are used as machine learning techniques. The feature vector of a given class is based on object-oriented metrics (e.g., number of attributes, number of methods, depth of inheritance tree, lack of cohesion of methods, McCabe cyclomatic complexity, etc.). Metrics used in their work mainly capture information about the size and the complexity of a software system at the class level. According to their hypothesis, the feature vector of a concept class is significantly different than the vector of a non-concept class. In our work, we use their approach to identify concept classes in a particular software system.

Milanova proposed a methodology that recovers composition relationships from the source code to bridge the gap between the design class diagram and the reverse engineered diagram (i.e., a class diagram that is naively created from the source code by a typical UML design tool) [14]. Milanova follows the notion of points-to analysis and use of owners-as-dominators model to identify composition relationships. In the work, an approximate object graph of a given program is created by using points-to analysis where nodes denote objects and edges represent “may- access” relationship. Finally, the ownership analysis is conducted by following the owners-as- dominators model, which states the node dominates the node if every path from the root of the graph that reaches the node needs to pass through the node [15]. In Milanova’s work, the focus is on creating the semantically valid as-built class diagram of a software system. However, we primarily focus on abstracting as-built class diagrams to provide the domain model of a given legacy system.

10

Gueheneuc presented a reverse engineering tool suite to create as-built class diagrams from static (i.e., source code or class files) and dynamic (i.e., history of execution events) models of Java programs [16]. Classes, interfaces, inheritance and instantiation relationships are recovered from static models of a given program. On the other hand, other possible relationships among classes such as use, association, aggregation, composition are identified by using dynamic models. Firstly, consensual definitions of dynamically recovered relationships are restated in the work. Then, four properties: exclusivity (property that states whether an instance of a class participated in a relationship can be in another relationship at given time), invocation site (property that indicates the way of sending messages from one instance of a class to another instance of another class), lifetime (a constraint that states the lifetime of all objects of a class with respect to the lifetime of all objects of another class), and multiplicity (number of instances allowed from each class participating the relationship) are introduced. Finally, the definitions of relationships (i.e., use, association, aggregation, and composition) are formalized with these four properties. As a result, the history of execution events of a program is inspected according to these formal definitions and relationships among classes are identified. Gueheneuc’s work is similar to Milanova’s work in the sense that the focus is on producing the most precise as-built class diagram from the source code.

Sutton and Maletic introduced an automated approach that recovers UML class models from C++ code [17]. The semantic gap between UML and C++ is the main motivation in the work, therefore a set of mappings for the reverse engineering of UML class models from C++ source code is proposed. Several rules and constraints are introduced to capture fundamental model elements (e.g., classes, interfaces, data types, attributes, associations, and etc.) from the source code. In addition to this, a heuristic to detect design level attributes is also presented in the context of abstraction of class diagrams. Instead of capturing all member variables of a given class, only class-level member attributes that are associated with a set of accessor and mutator methods are recovered from the source code. In order to identify attributes, formal concept analysis is used such that the set of member variables is defined as the attribute set and the set of member functions is defined as the object set in the formal context of a given class. From the formal context of a given class, the corresponding concept lattice, where a concept in the lattice consists of the maximal collection of member functions that use common variables, is created. As a result of analyzing the lattice, a concept with one member variable and one or more member 11 functions is presented as a design-level property. Even though their approach aims to construct high-level models, the main emphasis of their work is to abstract attributes in a given class, but not global classes captured from the source code. Again, our work primarily focuses upon abstracting classes and relationships among them.

As discussed in Section 2.4, Egyed proposed an abstraction method that discovers transitive UML relationships among concept classes after filtering non-concept classes from the class diagram of a particular software system [4]. Egyed introduces 121 rules by considering all possible combinations of UML relationships that can exist in a short path of classes (2 concept classes and 1 non-concept class). These 121 cases are presented as bases cases in the work. Egyed claims that a longer path (i.e., a path that contains more than 1 non-concept class) can be abstracted recursively so that a given class diagram can be fully abstracted. Our work is directly related to Egyed’s work. However, there are two primary points where our work differs. First, our research also focuses on concept identification in addition to the abstraction of relationships. Second, as we will discuss in Section 3, we have investigated the semantic inconsistencies in abstraction rules as well as the computational complexity of applying them recursively on longer paths.

12

3 Abstraction Rules

In this section, we will discuss the approach for constructing abstractions in class diagrams through the application of transitive relationships abstractions. First, a detailed semantic analysis of Egyed’s abstraction rules is presented. We will show that there are some circumstances under which some of abstraction rules proposed by Egyed are not semantically valid. Next, we will prove that Egyed’s rules are inherently ambiguous and there exist multiple ways of abstracting a given set of inputs in which not all of them produce the same result. We will also prove that determining whether all possible orders of applying abstraction rules on a given input produce the same result is computationally expensive. Finally, we will propose a set of abstraction rules that is semantically equivalent to original Egyed’s rules, but that contains 66 rules whereas Egyed proposes 121.

3.1 Analysis of Abstraction Rules

In this section, we provide a semantic analysis of abstraction rules introduced by Egyed. We categorize the abstraction rules into three major groups: Generalization, Association, and Dependency. During the discussion, we will mainly stress rules for which there are semantic inconsistencies that possibly affect the abstraction of a given class diagram.

3.1.1 Generalization

As we discussed earlier, Egyed uses GeneralizationRight and GeneralizationLeft to represent the Generalization relationship in UML. Depending on the position of the participating classes, the appropriate notation is selected in order to represent the Generalization relationship. However, we argue that using “Right” or “Left” as a direction identifier in a two-dimensional plane is ambiguous, because the concept of “Right” and “Left” is valid only on the horizontal axis. Instead, we propose a more generic way of representing the relationship by using child and parent concepts. According to our representation, the child denotes the derived class, (e.g., the specialization of the concept that is represented by the parent in the relationship) in a given Generalization relationship. Table 2 shows the mapping between Egyed’s notation and the corresponding parent/child denotations.

13

Table 2. The Mapping between Egyed’s Notation and Our Approach to model Generalization.

Egyed’s Notation Our Approach

child: A A <> B parent: B

child: B A <> B parent: A

Valid Generalization

Let and be two sets in the universe. The definition of the subset states that

.

Specifically, is a subset of if and only if every element of is also an element of . On the other hand, the semantics of the class construct in the UML defines a set of instances in some universe. That is, an instance of a set is considered as a member of a set of objects characterized by the class. So, let and be two classes where is the child and is the parent of a generalization . Then, we can claim that is a valid generalization ( if and only if every instance of is also an instance of .

On the other hand, let us say there exists a relationship between and any other class . Since every property and relationship associated to is inherited by , then the relationship between and is also inherited by at the class level. However, as Egyed states that there might be exceptional cases at the implementation level where the assumption we made above is violated [4]. For instance, there might be a method in where the relationship between and is instantiated. On the other hand, might override the same method in a way that the code section related to the relationship between and is not implemented. Thus, there would not be a relationship between and . Regarding the automation of abstraction rules, one possible

14 solution for those cases in generalization is providing the user an informative feedback so that he can inspect the abstraction after the process.

As a result, based on the definition of valid generalization, we conclude that Egyed’s abstraction rules, which are proposed for short paths where the non-concept class is the parent, are semantically valid at the class level. Table 3 below shows abstraction rules, which are proposed by Egyed for short paths where the non-concept class is the parent [4]. The field Related Rule # indicates the identification number of an abstraction rule in Egyed’s work.

Table 3. Egyed’s abstraction of Generalization in short paths where the non-concept class is the parent.

Related Short Path before the Abstraction Abstraction proposed by Egyed Rule #

S X T S T 1

S X T S T 2

3

S X T S T 4

S X T S T 5

15

6

S X T S T 7

S X T S T 8

S X T S T 9

S X T S T 10

S X T S T 11

S X T S T 16

38

S X T S T 60

16

S X T S T 82

27

S X T S T 49

S X T S T 71

S X T S T 115

S X T S T 93

S X T S T 104

In Table 3, classes and denote a concept class; represents a non-concept class. Except in Rule #5, the relationship between the parent and the other class is inherited by the child. Thus, none of the rules violates the definition of valid generalization. On the other hand, Rule #5 is a special case of generalization. The fact that is a specialization of and is a specialization of does not necessarily imply that is a specialization of . Again, if we make an analogy from the set theory and consider , and as sets, we can provide the Venn diagram in Figure 6 as a counterexample that supports our claim.

17

X

S

T

Figure 6. A counterexample where

, but .

The other possible short path in which there exists a generalization relationship between the non- concept class and one of two concept classes is the case where the non-concept class is the child and one of two concept classes is the parent. Egyed proposes a similar approach such that the relationship between the non-concept class and the other concept class, which does not participate in Generalization, can be transitively associated to concept classes after the abstraction process. However, as we will demonstrate shortly, there are some cases where this approach does not produce semantically valid abstractions. Table 4 lists all cases, where the non- concept class is the child and the concept class is the parent with the abstraction of each case proposed by Egyed. Note that except the path in Case #2, every short path in each case is semantically related to exactly two abstraction rules in Egyed’s work [4].

18

Table 4. Egyed’s abstraction of Generalization in short paths where the non-concept class is the child.

Case # Related Short Path before the Abstraction Abstraction Proposed by Rule # Egyed

S X T S T 1 1, 16

S X T S T 2 12

3 23, 17

4 34, 13

S X T S T 5 45, 18

S X T 6 56, 14

S X T 7 11, 20

S X T S T 8 89, 22

19

S X T S T 9 100, 21

S X T S T 10 67, 19

S X T S T 11 78, 15

As in Table 3, classes , and represent concepts, and the class denotes the non-concept class. We observe that abstractions proposed by Egyed for cases #1, #2, #4, #6 and #11 in Table 4 are semantically valid. However, abstraction results for the rest of the cases in the table do not produce the correct result.

The abstraction proposed by Egyed in Case #3 says that if is a subclass of and is dependent on , then must also be dependent on . However, there are cases for which this claim does not provide consistent results. For instance, if is dependent on a property of that is specifically implemented in , but does not exist in , then we cannot claim that is also dependent on . Specifically, there might be a base class with the implementation of some methods. On the other hand, there might be a subclass, which not only inherits all implementation details of that base class, but also adds some additional method definitions, which are independent from the definitions in the base class. Then, if there exists a third class, which is dependent on the definitions of those additional methods in that subclass, then that kind of dependency would remain specific between the client and the supplier and would not be propagated to the base class. The model in Figure 7 is an example that supports our claim.

20

Figure 7. An example where does not depend on T.

We can follow a similar approach in order to show that abstraction rules introduced by Egyed for Case #5 and #7 in Table 4 are not always semantically true. For each of these two cases, if is associated to a definition in that emerges as a result of the specialization between and , then we cannot claim that there exists an association between and . For instance, let have the definition of an attribute called , which only exists in , but not in . In addition, suppose an instance of uses through a method call. This configuration eventually results in an association between and . However, since is only defined in , but not in , there is no association between and because of the usage of . The model in Figure 8 illustrates the example we motivated above.

T p is specifically an instance of S defined in X accesses the attribute p through getP()

X S +p

+getP()

Figure 8. An example where there is no association between and .

Both Case #8 and #10 in Table 4 involve the aggregation relationship. The aggregation relationship describes has-a relationship between classes. According to Egyed, cases #8 and #10

21 implies that if is a subclass of and is an aggregate of , then must also be an aggregate of , except the aggregation in Case #8 is a two-way association, whereas the one in Case #10 is one-way. However, this argument is not valid. Note that the aggregation between and is specific to instances of , but does associated to every instance in .

Finally, consider Case #9 in Table 4. The aggregation relationship (Association[Agg)] between and is the union of AssociationLeft[Agg] and AssociationRight. Thus, Case #9 can be expressed by the composition of Case #11 and Case #5 in Table 4. Since we already showed the abstraction proposed for Case #5 is not always semantically valid, we can infer that the abstraction proposed for Case #9 is not semantically true.

This concludes our discussion about the abstraction of short paths that involve the Generalization relation. We first analyzed short paths, where the non-concept class is a generalization of one of two concept classes and demonstrated that abstraction rules proposed by Egyed for those cases are valid. Second, we moved to another type of short paths where the non- concept class is a specialization of one of two concept classes. Then, we showed that there are some cases where Egyed’s rules do not output semantically valid abstractions.

3.1.2 Dependency

In this section, we will semantically analyze short paths that contain the Dependency relationship and their abstractions proposed by Egyed. Before starting our discussion, we will first group these short paths into four groups and then analyze them individually. Short paths that contain the Dependency relationship are grouped as follows:

Dependency – Dependency: Group of short paths that contain exactly two Dependency relationships, Dependency – Generalization: Group of short paths that contain exactly one Dependency relationship, and exactly one Generalization relationship, Dependency - Association: Group of short paths that contain exactly one Dependency relationship, and exactly one Association relationship.

22

The set of short paths in the Dependency – Generalization group is, in fact, a subset of paths analyzed in Section 3.1.1. Therefore, we will not analyze them here again. Instead, we will only focus on short paths in other two groups.

Booch, Rumbaugh, and Jacobson define a dependency as a using relationship, specifying that a change in the specification of one thing (supplier) may affect another thing (client) that uses it [7]. From this definition, we can infer that dependency is a relationship at the modeling-level, but not at the runtime-level, because the definition of a model element is finalized before the runtime. Since we only focus on class diagrams in this work, we will consider a class as a model element.

Dependency – Dependency

In this section, we analyze short paths and their abstractions proposed by Egyed (as shown in Table 5) in the Dependency – Dependency group. Again, the field Related Rule # indicates the identification number of an abstraction rule in Egyed’s work

Table 5. Egyed’s abstraction of short paths in Dependency- Dependency group.

Related Short Path before the Abstraction Abstraction Proposed by Egyed Rule # (Dependency – Dependency)

24

28

39

35

23

As discussed in Section 3.1.1, we will denote a short path with three classes and two relationships where and are concept classes, whereas is the non-concept class in the path. Both Rule #28 and Rule #35 are similar in terms of their abstraction results. We conclude that both of those rules proposed by Egyed are semantically valid.

However, we observe that there are cases where Rule #24 and Rule #39 do not produce the right abstraction. Rule #24 states that must depend on , if depends on and depends on . However, if those two dependencies are independent, then we cannot claim that there exists a dependency between and . For instance, let be a method in that takes an argument of type . Moreover, let use a property defined in , which causes a dependency between and such that a change in the definition of may cause a modification in the definition of . On the other hand, let be a method in that takes an argument of type to use a property defined in . So, it basically creates another dependency between and like the one between and . Additionally, let us also assume that does not access so that it cannot change the definition of . In this case, any change in that is due to another change in the definition of , would not affect the definition of . In other words, a change in specification of would not affect the specification of .

Finally, consider Rule #39 and its abstraction in Table 5. We can claim the same argument used in Rule #24 for Rule #39, because both cases have similar structure such that there is a chain of dependencies in both paths. If the dependency between and is independent from the dependency between and , then we cannot always claim that is dependent on . Otherwise, if those two dependencies are also dependent on each other, then the abstraction proposed by Egyed is applicable.

Dependency – Association

Booch, Rumbaugh, and Jacobson state that association is semantically a special kind of dependency [7]. In other words, their statement means that if there is an association between two classes, then there is also a dependency between them. This implication allows us to reduce an association to a dependency so that we can map a short path in the Dependency – Association group to its corresponding form that containing only dependency. Table 6 lists the abstraction of each short path in the Dependency – Association group as proposed by Egyed. Moreover, it also 24 shows the application of our reduction technique on each short path that will be used while evaluating the validity of its abstraction.

Table 6. Short paths in Dependency – Association group.

Abstraction Proposed Reduction Related Short Path before the Abstraction by of Rule # (Dependency – Association) Egyed Short Path

31

25

29

33

32

30

26

25

42

36

40

44

43

41

37

112

46

26

57

101

90

79

68

116

50

61

105

27

94

83

72

As demonstrated in Table 6, the reduction of a short path is carried out by considering the direction of the association. If the association in a short path is a one-way association, then we reduce the association to a dependency whose direction is the same as the direction of the reduced association. For example, the direction of the association in Rule #25 in Table 6 is from to . Then, the dependency, which is the reduction of the association in Rule #25, is also from to . On the other hand, if the association in a short path is a two-way association, then we create a dependency for each direction. For instance, we reduce the association in Rule #31 into two dependencies such that both and are dependent to each other.

We claim that the abstraction of the resulting reduction implies the same abstraction of the original path. If the association in the original path is a one-way association, then the reduction provides another short path that contains exactly two dependencies as we replace that one-way association with a dependency. This actually implies that the path, which is created from the reduction of the original path, is an element of the set of paths defined in Table 5. Thus, we can abstract the reduction of the original path according to the same arguments we provided while discussing the Dependency – Dependency group. At the end, the abstraction of the reduction (path created by reducing the original path) will be the abstraction of the path before the reduction. For instance, the reduction of the path in Rule #25 is the same path as the one defined in Rule #24 shown in Table 5. While discussing Rule #24 in Table 5, we claimed that there exists a dependency between and if and only if the dependency between and is dependent on the dependency between and . So, this argument is also valid for the path (let us call it , 28 which is created from the reduction of the original path (let us call it ) in Rule #25 shown in Table 6. As a result, the abstraction of is equal to the abstraction of . However, Egyed claims that there exists a dependency between and after the abstraction, which is not always true as proved by our reduction method.

On the other hand, if the association in the original path is a two-way association, then we know that the reduction creates a dependency for each direction as we discussed earlier. This actually results in two short paths, which are both elements of the set of paths defined in Table 5. Moreover, one of those two paths is definitely equal to either the path in Rule #28 or the path in Rule #35 shown in Table 5. As a result, the abstraction of the original path (i.e. the short path, which is reduced) is equal to the union of abstractions of those two paths, where one of them is eventually cancelled out (both Rule #28 and Rule #35 produce no relationship after the abstraction). For instance, there are two paths in the reduction of Rule #94 as shown in Table 6:

1. S-[DependencyRight]-X-[DependencyLeft]-T, and 2. S-[DependencyLeft]-X-[DependencyLeft]-T.

The first path (S-[DependencyRight]-X-[DependencyLeft]-T) is equal to the path in Rule #28 in Table 5, whose abstraction does not produce any relationship between and . As a result, the abstraction of the path in Rule #94 in Table 6 is, in fact, equal to the abstraction of the path S-[DependencyLeft]-X-[DependencyLeft]-T, which should be computed according to arguments we claimed while discussing short paths in Table 5 (i.e., depends on if and only if the dependency between and is dependent on the dependency between and ). However, Egyed claims that must always depend on after the abstraction. Again, our reduction technique proves that the abstraction proposed by Egyed is not semantically valid for every case.

This concludes our discussion about the abstraction of short paths that contain the Dependency relationship. We first examined short paths in the Dependency-Dependency group by showing that there exists a dependency between and if and only if given dependencies in the short path are dependent to each other. Then, we moved on to the abstraction of short paths in the Dependency-Association group and demonstrated a reduction technique, which helps us to evaluate the validity of abstraction of a path in the Dependency-Association group by reducing it to another path defined in Table 5.

29

3.1.3 Association

In this section, we will analyze the effect of the Association relationship on the transformation short paths. During the discussion, we will group short paths that contain any Association relationship as follows:

Association – Association: Group of short paths that contain exactly two Association relationships, Association – Generalization: Group of short paths that contain exactly one Association relationship, and exactly one Generalization relationship, Dependency - Association: Group of short paths that contain exactly one Dependency relationship, and exactly one Association relationship.

Paths in the Dependency – Association group are already discussed in Section 3.1.2. Similarly, the analysis we carried out in Section 3.1.1 covers all paths in the Association-Generalization group. Thus, in this section, we will only focus on short paths in the Association-Association and conclude that rules proposed by Egyed on short paths in this group are semantically valid at the class level. As we did previously, we will denote a short path with three classes and two relationships where and are concept classes, whereas is the non-concept class in the path.

Suppose that there is an association between and . For the sake of simplicity, assume that is a two-way association and the cardinality of is many-to-many. Similarly, assume that there is another association between and such that is a two-way association and the cardinality of is many-to-many, too. Given those facts, we can infer the following:

is associated to at the class level through and . The association between and is the abstraction of and .

The inference above is the main motivation behind the abstraction, which is proposed by Egyed for short paths that contain associations.

We initially assumed that both associations in the path have a cardinality of many-to-many, but we also know that this is not the case for every association. However, this would not cause any problem, because the cardinality of the resulting abstraction between and can be computed

30 by Multiplication Rule (e.g. if the cardinality of is 1-to-10 and the cardinality of is 1-to- 5, then the cardinality of the association between and is 1-to-50).

On the other hand, we also initially assumed both of the associations in the path are two-way association. Thus, the resulting association between and is also a two-way association. Specifically, the direction of and the direction of only determine the direction of the resulting relationship. For instance, if both and are one-way association with the same direction, then the resulting association is also a one-way association with the same direction of associations in the path. If both and are one-way association, but with the opposite direction; then we can infer that and are not associated through and , which means the abstraction does not produce any relationship between and . Conversely, if one of and is a one-way association and the other one is a two-way association, then we can still infer that and are associated through and . In addition, we can also claim the resulting association is a one-way association with the same direction of the one-way association in the original path.

So far, we have shown that and are associated through and , if directions of those two associations do not conflict. However, we have not yet stated the type of the resulting association between and . Booch, Rumbaugh, and Jacobson define association as a structural relationship that specifies that objects of one class are connected to objects of another [7]. They also claim that association is created either to navigate from objects of one class to objects of another (data-driven association) or to represent the interaction between objects of different classes (behavior-driven association) [7]. On the other hand, they define aggregation as a special kind of association such that it models a “whole/part” relationship where one class represents the “whole” that consists of another class representing the “part” [7]. According to those definitions, we can infer the type of the association between and from given associations and . If both and are aggregation and there is a chain of “whole/part” relationships (e.g. is a part of and is a part of ), then the resulting association between and must also be aggregation (e.g. is a part of ). On the other hand, if both and are aggregation, but there is no chain of “whole/part” relationships (e.g. and are part of ), then the association between and is not aggregation, but either data-driven or behavior-driven association

31 depending on and . In all other possible combinations of and , the resulting association between and is again either data-driven or behavior-association.

As a result, the abstraction method, which is proposed by Egyed for short paths in the Association-Association group, matches with arguments that we provided in this section. Thus, we can conclude that Egyed’s abstraction approach is semantically valid.

This actually ends the semantic analysis of abstraction rules proposed by Egyed. In Section 3.1.1, we analyzed abstraction of short paths that contain Generalization by showing that not all of his abstraction rules for those paths are semantically valid. Then, we moved on short paths that involves Dependency. During our discussion in Section 3.1.2, we provided an approach, which shows that his abstraction rules involving Dependency are applicable under certain conditions. Finally, we analyzed cases consisting of Association and concluded that abstraction rules proposed by Egyed that are related to short paths containing only association relations are semantically valid.

32

3.2 Ambiguity in Serial Abstraction

The set of abstraction rules discussed on Section 3.1 is defined only on short paths, which contain two concept classes and one non-concept class. Egyed claims that any path that consists of more than three classes can also be abstracted by applying corresponding abstraction rules serially [4]. However, as Egyed also states, different orders of applying rules on a longer path may produce different abstraction results.

S X1 X2 T

Figure 9. A path that contains four classes.

Consider the short path in Figure 9 where classes and are concept classes, while and are non-concept classes. The goal in this case is to abstract the path by removing and , and discovering the relationship between and . According to how Egyed defines abstraction rules, there are two different ways of performing the serial abstraction on the path in Figure 9 with two different final abstraction results. The first way is starting with abstracting the sub-path that contains , , and by applying the corresponding rule(i.e., Association-Class- GeneralizationRight ->AssociationRight). At the end of this process, is removed and a new relationship between and is created as shown in Figure 10(b). Then, the path containing , , and is abstracted by applying another corresponding rule (i.e., AssociationRight-Class- Association -> AssociationRight) so that the overall process is completed as shown in Figure 10(c).

33

S X1 X2 T

(a) The original path in Figure 9

S X2 T

(b) Intermediate path after removing

S T

(c) Abstraction of the original path

Figure 10. The first way of abstracting the path in Figure 9.

The second way is starting with another sub-path containing , , and by applying the rule GeneralizationRight-Class-Association -> Association. After removing and creating the resulting relationship between and , the path containing , , and is abstracted by applying the rule Association-Class-Association -> Association as shown in Figure 11(b) and (c) respectively.

S X1 X2 T

(a) The original path in Figure 9

S X1 T

(b) Intermediate path after removing

S T

(c) Abstraction of the original path

Figure 11. The second way of abstracting the path in Figure 9.

34

As a result, we have showed that the path in Figure 9 can be abstracted in two different ways by obtaining two different abstraction results. Since there is no order of operations proposed for abstraction rules, the abstraction of the path in Figure 9 causes a semantic ambiguity that has to be resolved in order to build valid abstractions.

The path in Figure 9 is an example, which contains four classes and can be abstracted by two different ways with two different results. A path containing four classes is the shortest case where an ambiguity can be observed, because there exists only one way of abstracting a path that has three classes. In our work, we exhaustively analyzed every possible path with four classes in order to check whether there exists an ambiguous case as in Figure 9. Since there are 11 different relations introduced in Table 1, there are totally 1331 (i.e. ) different settings of a path with four classes. After the automated analysis, we observed 96 cases among those 1331 paths where a semantic ambiguity exists as in Figure 9. In order to carry out the experiment, we implemented a program that abstracts a path with four classes by applying rules in both ways. The complete set of those 96 paths and their ambiguous abstractions can be seen in Appendix B. As an interesting observation, we noticed that there is a generalization relationship between two non-concept classes in every path where we observed ambiguity. The fact that rules, which involve generalization, contain some semantic inconsistencies (see Section 3.1.1) is one of the possible explanations behind this observation.

In the context of reverse engineering and program understanding, it is crucial to determine whether a given path can be abstracted unambiguously. In our experiment, we focused on paths that contain four classes. However, in practice, it is quite possible to encounter a path, which contains more than four classes. Furthermore, there is a need to check whether all possible orders of applying rules on a given path produce same abstraction or not. Otherwise, the semantically right abstraction of a given path cannot be generated. However, as we will show next, the process of determining whether all possible abstractions are same is computationally expensive, because there are exponentially many different orders of applying rules on a given path in the size of an input (i.e. the number of classes in a given path).

35

Claim: Let be a path that contains classes (i.e. concepts, non-concepts). Let be a function that denotes the number of different orders of applying rules on a given path that contains classes. Then, .

Proof: Let be a path that contains classes. Let , and denote concept classes in , while denotes a non-concept class for all . So, totally there are classes as claimed. By the definition of , there is a UML relationship between and as there is another one between and . In addition, every pair and is connected with a UML relationship where . Figure 12 illustrates the structure of the path . In the figure, a rounded rectangle represents a class; a solid line represents an arbitrary relationship defined in Table 1. Basically, the goal is to obtain an abstraction result that only contains and with a UML relationship if it exists according to the abstraction process.

Figure 12. Structure of the path that contains classes. At the first step, since there is no precedence order for abstraction rules, we can either start on abstraction with the sub-path containing , , and or the sub-path containing , , and or another sub-path containing , , and or so on. In other words, any sub-path can be initially picked from possibilities. Once a sub-path is selected, the corresponding abstraction rule is applied so that the sub-path is removed by adding the resulting abstraction to . In other words, one non-concept class is filtered and the original sub-path is replaced with the abstraction of it. Then, the second step starts where there are now classes and possible sub-paths to abstract in the path. This process goes on till a base case is reached where there are only and remained. This implies that there are totally steps since one non- concept class is removed at the end of each step. As a result, by the multiplication rule, the total number of different sub-path abstraction sequences, which is equal to the value of can be calculated as follows:

36

Since , we can state that . This concludes our proof.

In order to reduce the computational complexity of abstracting longer paths, Egyed introduced two optimization methods: (1) Reuse, and (2) Merging [4]. During the exhaustive abstraction of a given path (i.e. trying all possible orders of applying rules), there are intermediate abstraction results (i.e. abstraction of sub-paths) that are previously generated. Thus, Egyed proposes to reuse these intermediate results to simplify subsequent steps of abstracting the input. Egyed indicates that the method of reusing reduces the number of intermediate abstractions by two- thirds [4]. Even though the method of reusing optimizes the abstraction process, it must be noted that the factor of improvement is constant that is not asymptotically significant. Merging is the second technique that is introduced by Egyed in order to minimize the potentially large number of intermediate abstraction steps [4]. According to Egyed, two abstraction results of a sub-path with four classes in the original input path must be combined, if both of these abstraction results are exactly the same (i.e. same UML relationship with the same direction). As a result of the complexity analysis, Egyed states that merging reduces the computational overhead significantly such that the abstraction of a path can be computed in polynomial time [4]. However, it must also be noted that two abstraction results of a sub-path can be merged if and only if they are exactly the same. In our work, we showed that there are paths with four classes where we obtain different results after the exhaustive abstraction. Therefore, at the worst case scenario, merging would not be applicable on the abstraction of a long path.

As a result, abstracting long paths is a computationally expensive process as shown in this section. Although Egyed introduced two strong optimization techniques that reduce the number of intermediate results significantly under some conditions, there is still a need to handle exponentially many abstraction steps unless a precedence rule for the set of abstraction rules is proposed.

3.3 Reducing the Set of Abstraction Rules

As we mentioned in Section 2.4, Egyed exhaustively creates the set of rules for all possible combinations of UML relationships as shown in Table 1. As a result, Egyed states that some abstraction rules have mirror images in a way that there are duplicates, which actually imply the same abstraction [4]. Specifically, Egyed proposes two different rules for two paths, which are

37 syntactically different, but semantically equivalent. For instance, Egyed defines Rule #3 such that there must be an instance of AssociationRight between classes and after the abstraction as shown in Figure 13 [4].

Rule #3: GeneralizationRight – Class – AssociationRight -> AssociationRight

S X T

(a) A path to be abstracted

S T

(b) The abstraction of the path in (a)

Figure 13. The abstraction of a path according to Rule #3 defined by Egyed.

On the other hand, as illustrated in Figure 14, Egyed proposes Rule #60 for a path, which is syntactically different, but semantically equivalent to the input path in Figure 13 [4].

Rule #60: AssociationLeft – Class – GeneralizationLeft -> AssociationLeft

T X S

(a) A path to be abstracted

T S

(b) The abstraction of the path in (a)

Figure 14. The abstraction of a path according to Rule #60 defined by Egyed.

38

As demonstrated in both Figure 13 and Figure 14, either Rule #3 or Rule #60 can be applied depending on the orientation of the path, although input paths in both figures are semantically equivalent. Note that the abstraction result obtained by Rule #3 is also equivalent to the abstraction computed by Rule #60.

As a result, we propose a different notation to represent a short path so that duplicates in Egyed’s rule set are eliminated. In other words, each short path is abstracted by a unique rule regardless of its orientation.

Specifically, we indicate the direction of a relationship, which is associated with the non-concept class in a given path, by using Incoming and Outgoing keywords where the non-concept class is the reference point. Likewise, the direction of the resulting relationship, which is created after filtering the non-concept class, is specified in the same way. Table 7 demonstrates how we represent a UML relationship with respect to the non-concept class in a given short path.

Table 7. Representation of a UML relationship with respect to the non-concept class.

Our Named Relation UML Relationship in a Class Diagram

<> <> IncomingGeneralization

<> <> OutgoingGeneralization

<> <> Association

39

<> <> IncomingAssociation

<> <> OutgoingAssociation

IncomingDependency

OutgoingDependency

<> <> [Agg]Association

<> <> Association[Agg]

<> <> [Agg]IncomingAssociation

<> <> [Agg]OutgoingAssociation

40

The illustration in Figure 15 shows the structure of the abstraction rule that we propose to abstract the input path in Figure 13.

Our Rule: IncomingGeneralization – Class – OutgoingAssociation -> OutgoingAssociation

S X T

(a) A path to be abstracted

S T

(b) The abstraction of the path in (a)

Figure 15. New syntax to represent an abstraction rule.

In Figure 15, is the parent of , therefore the generalization between and is represented as IncomingGeneralization with respect to . On the other hand, the association between and is a one-way association such that the direction of the navigation is from to . Thus, it is represented as OutgoingAssociation with respect to X. Note that the association, which is created between and after the abstraction, is a one-way association represented as OutgoingAssociation. Since the original reference class ( in the input path is removed, we set S as the new reference class. The direction of the resulting relationship is determined with respect to the other end of the incoming relation of the non-concept class (i.e. is the other end of the generalization (IncomingGeneralization) between and so that the direction of the association (OutgoingAssociation) between and after the abstraction is set with respect to ).

Note that there is no need to have a secondary rule to abstract the path in Figure 14, because the rule defined in Figure 15 can be used for the abstraction. The reason is that the generalization between and in Figure 14 is an instance of IncomingGeneralization with respect to .

41

Similarly, the association between and is an instance of OutgoingAssociation again with respect to . As a result, the structure of the path in Figure 14 matches with the rule IncomingGeneralization – Class – OutgoingAssociation -> OutgoingAssociation and there is no need to introduce an additional one. Note that the association between and in Figure 14 is an instance of OutgoingAssociation, which is the same the output computed by the rule introduce in Figure 15.

As a result, the notation we use to represent short paths helps us to simplify the set of rules proposed by Egyed. Specifically, we introduce 66 abstraction rules that cover all possible combinations of UML relationships defined in Table 1 in a short path. Note that not every rule in Egyed’s set has a dual (i.e. Rule #5: GeneralizationRight – Class – GeneralizationRight -> ). The complete list of newly introduced rules can be seen in Appendix A.

42

4 Tool Support: XARules

In our work, we supported our approach by implementing an automated tool named XARules. XARules is a JAVA based plug-in that is built for Poseidon Professional Edition 6.02. Poseidon is a commercial software tool that allows a developer to create UML models [18]. The main functionality of XARules is to provide a developer the domain model of a software system in the form of a UML class diagram. The illustration in Figure 16 shows the high-level work flow of XARules.

Import the source code Identify each concept and Create the initial class non-concept class diagram (model)

Select abstraction rules to Discover all possible paths Apply the heuristic to be applied in the model discover ambiguous paths ambiguous paths

Output the high-level class diagram ambiguous paths

Figure 16. The high-level workflow of XARules.

The whole process first starts with importing the source code of a legacy software system. Once Poseidon parses the source code, it generates meta-model that contains all classes (both concept and non-concept classes) with associated relationships among them as defined in the imported software tool. Having created the initial model, XARules marks each class in the initial class diagram either as non-concept or concept according its classification result, which is produced by the classifier specified in Maurice’s and Gannod’s work [3]. Upon the completion of identifying concept and non-concept classes, XARules creates the initial low-level class diagram of a given legacy software system.

Having obtained the initial class diagram, a developer can analyze and modify it (e.g., he can modify the classification result of a given class manually), if he needs. Then, he can move on the abstraction process by first selecting a subset of rules from the complete list (i.e. 66 rules as 43 defined in Section 3.3). In other words, XARules is implemented in a way that a developer may want to focus on application of some certain rules. Once the set of rules to be applied is selected, XARules discovers all paths between every two concept classes in the model. This process is done by using the classic breadth-first search technique by optimizing the computational overhead.

After discovering all possible paths, XARules applies a heuristic on the abstraction of each single path to determine whether an ambiguity exists (i.e., determine whether all possible ways of abstracting a given path produce same result) so that it can let the developer abstract the ambiguous path manually. As discussed in Section 3.2, detecting ambiguity in a serial abstraction is computationally intractable for long paths. However, as we mentioned in Section 3.2, we observed a generalization relationship between two non-concept classes in every path in which there are 4 classes and two different abstraction results after applying rules serially. This helps us to claim that any path where there exists a generalization relationship between two non- concept classes is a potentially candidate that can have different abstraction results. The rationale behind the claim above is that there is at least one way of abstracting the given path (i.e., a path where there exists a generalization relationship between two non-concept classes) which will lead us to hit a sub-path (i.e., the length of the path is reduced by one after every iteration of the serial abstraction as discussed in Section 3.2) that contains 4 classes and there exists a generalization relationship between two non-concept classes. Obviously, this heuristic can only determine a subset of paths in the model whose abstractions result in an ambiguity. However, even this can be considered as an improvement from the user point of view. After applying the proposed heuristic, XARules abstracts remaining paths linearly in a way that it applies abstraction rules on a given path by starting from the source concept class and removing non-concept classes sequentially till all of non-concept classes are filtered.

Finally, XARules outputs a class diagram that contains only concept classes and relationships among them. Then, the developer can analyze the final output and inspect the proposed domain model to see whether resulting relationships among concept classes are reasonable compared to the ones in the original low-level model.

44

The model in Figure 17 is created by using XARules. As shown in the figure, the class diagram partially models a simple and hypothetical University Management System. A concept class in the diagram is marked by using the stereotype << CONCEPT >>, whereas a non-concept class is indicated by << NON_CONCEPT>>. In this example, the developer aims to abstract the initial model in a way that he only wants to focus on concepts named Instructor, Course, and Student with relationships among them.

Figure 17. An example from XARules.

Application of selected abstraction rules (see Abstraction Rules tab in Figure 17) on the initial low-level class diagram in Figure 17 results in the high-level model shown in Figure 18. Note that all non-concept classes (i.e., Department, School, and Enrollment) are eliminated. In addition to this, a new, direct relationship is created between Instructor and Student due to the abstraction of the path containing classes Instructor, Department, School, and Student. Similarly, another new relationship is created between Course and Student after abstracting Enrollment and its associated relationships.

45

Figure 18. Abstraction of the low-level model in Figure 17.

46

5 Conclusion and Future Investigations

In this work, we mainly focused on the semantic of abstraction rules and inconsistencies with some of them, which are proposed by Egyed to abstract UML class diagrams. Firstly, we analyzed each rule individually. As mentioned in Section 3.1, we demonstrated that some of Egyed’s rules (related to dependency and generalization) do not produce the right abstraction of a given short path. Furthermore, we proposed the semantically valid abstraction for each of those cases.

After the semantic analysis, we showed that there exists an ambiguity in the serial abstraction (see Section 3.2). In addition, we also proved that the ambiguity in the serial abstraction causes inconsistent abstraction results of class diagrams. Thus, we claimed that the valid abstraction of a given class diagram can be achieved by eliminating the ambiguous ways of applying abstraction rules. However, as proved in Section 3.2, the identification of a semantically valid abstraction is computationally expensive at the worst case.

In Section 3.3, we showed that Egyed actually proposes two different abstraction rules for two paths which are syntactically different, but semantically equivalent. As a result, we introduced a new notation to represent an abstraction rule by which we reduced the total number of rules to 66, whereas Egyed originally proposed 121 rules. Finally, we concluded our work by demonstrating XARules, which is a java based plug-in implemented for Poseidon Professional Edition 6.02.

As mentioned in our work, the computational overhead in the serial abstraction is the biggest problem to be handled in order to provide the correct abstraction of a given class diagram. One possible way of solving that problem is introducing the order of precedence for the abstraction rules so that we could avoid both large number of intermediate abstraction steps and different final abstraction results. Thus, we consider extending our current work by discovering a set of precedence rules.

Another possible future work is to introduce a way of visualizing large-scale class diagrams. As class diagrams become more complex, it becomes harder to simply illustrate them in two- dimensional plane, even if the initial, low-level class diagram is abstracted. Thus, there is a need

47 to have an adequate visualization technique, which lays out a given concept class with its associated relationships according to its information content.

48

References

[1] The , Unified Modeling Language Specification (Action Semantics) – UML 1.4 with Action Semantics, Final Adopted Specification, January 2002. On-line at http://www.omg.org/uml. [2] Kollmann, R., Selonen, P., Stroulia, E., Systa, T., and Zundorf, A.. A Study on the Current State of the Art in Tool-Supported UML-Based Static Reverse Engineering. In Proceedings of the 9th Working Conference on Reverse Engineering (WCRE’02), 2002. [3] Carey, M., and Gannod, G., C.. Recovering Concepts from Source Code with Automated Concept Identification. In Proceedings of the 15th IEEE International Conference on Program Comprehension. June 2007, pp. 27- 36. [4] Egyed, A.. Automated Abstraction of Class Diagrams. ACM Transactions on Software Engineering and Methodology (TOSEM), 11:4, October 2002, pp. 449 – 491. [5] Rumbaugh, J., Jacobson, I., and Booch, G.. The Unified Modeling Language Reference Manual. Addison Wesley, 1999. [6] Ambler, S., W.. UML 2 Class Diagrams. AgileModeling. 2006. April 3rd, 2006. On-line at http://www.agilemodeling.com/artifacts/classDiagram.htm. [7] Booch, G., Rumbaugh, J., and Jacobson, I.. The Unified Modeling Language User Guide. Addison Wesley, 1999. [8] Hsi, I., Potts, C., and Moore, M.. Ontological excavation: unearthing the core concepts of the application. In Proceedings of the 10th Reverse Engineering Working Conference. 2003, pp. 345 – 353. [9] Metzger, A.. A Systematic Look at Model Transformation, in Model Driven Software Development (MDSD2), ed. Beydeda, S., Book, M., and Gruhn, V., Springer Verlag, 2005, pp. 21-33, (book chapter). [10] Rekoff, M.G.. On Reverse Engineering. IEEE Trans. Systems, Man, and Cybernetics, March- April 1985, pp. 244-252, 1985. [11] Chikofsky, E.J., and Cross, J.H.. Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software, 7(1): 13-17, 1990. [12] Biggerstaff, T.J., Mitbander, B.G., Webster, D.E.. Program understanding and the concept assignment problem. Communications of the ACM, 37(5):72-82, 1994.

49

[13] Wasserman, S., and Faust, K.. Social Network Analysis. Cambridge: Cambridge University Press, 1994. [14] Milanova, A.. Precise Identification of Composition Relationships for UML Class Diagrams. In Int”l Conference on Automated Software Engineering, 2005. [15] Potter, J., Noble, J., and Clarke, D.. The Ins and Outs of Objects. In Australian Software Engineering Conference, 1998. [16] Gueheneuc, Y.. A Reverse Engineering Tool for Precise Class Diagrams. In Proceedings of the Annual International Conference on Computer Science and Software Engineering (CASCON), 2004. [17] Sutton, A., and Maletic, J., I.. Recovering UML class models from C++: A detailed explanation. Information and Software Technology, 49:212-229, 2007. [18] Gentleware. Poseidon for UML. On-line at http://www.gentleware.com.

50

Appendix A: The Proposed Abstraction Rules

Table 8. The reduced set of abstraction rules with the new syntax discussed in Section 3.3.

Rule # Rule Structure

1 Association – Class – Association Association

2 Association – Class – [Agg]OutgoingAssociation OutgoingAssociation

3 Association – Class – OutgoingAssociation OutgoingAssociation

4 Association – Class – OutgoingDependency OutgoingDependency or

5 Association – Class – OutgoingGeneralization

6 Association[Agg] – Class – Association Association

7 Association[Agg] – Class – [Agg]Association Association

8 Association[Agg] – Class – [Agg]OutgoingAssociation OutgoingAssociation

9 Association[Agg] – Class – OutgoingAssociation OutgoingAssociation

10 Association[Agg] – Class – OutgoingDependency OutgoingDependency or

11 Association[Agg] – Class – OutgoingGeneralization

12 [Agg]Association – Class – Association Association

13 [Agg]Association – Class – Association[Agg] Association

14 [Agg]Association – Class – [Agg]Association [Agg]Association

15 [Agg]Association – Class – [Agg]OutgoingAssociation [Agg]OutgoingAssociation

16 [Agg]Association – Class – OutgoingAssociation OutgoingAssociation

17 [Agg]Association – Class – OutgoingDependency OutgoingDependency or

51

18 [Agg]Association – Class – OutgoingGeneralization

19 [Agg]IncomingAssociation – Class – Association OutgoingAssociation

20 [Agg]IncomingAssociation – Class – Association[Agg] OutgoingAssociation

21 [Agg]IncomingAssociation – Class – [Agg]Association [Agg]OutgoingAssociation

22 [Agg]IncomingAssociation – Class – [Agg]OutgoingAssociation [Agg]OutgoingAssociation

23 [Agg]IncomingAssociation – Class – [Agg]IncomingAssociation

24 [Agg]IncomingAssociation – Class – OutgoingAssociation OutgoingAssociation

25 [Agg]IncomingAssociation – Class – OutgoingDependency OutgoingDependency or

26 [Agg]IncomingAssociation – Class – OutgoingGeneralization

27 IncomingAssociation – Class – Association OutgoingAssociation

28 IncomingAssociation – Class – Association[Agg] OutgoingAssociation

29 IncomingAssociation – Class – [Agg]Association OutgoingAssociation

30 IncomingAssociation – Class – [Agg]OutgoingAssociation OutgoingAssociation

31 IncomingAssociation – Class – IncomingAssociation

32 IncomingAssociation – Class – [Agg]IncomingAssociation

33 IncomingAssociation – Class – OutgoingAssociation OutgoingAssociation

34 IncomingAssociation – Class – OutgoingDependency OutgoingDependency or

35 IncomingAssociation – Class – OutgoingGeneralization

36 IncomingDependency – Class – Association OutgoingDependency or

37 IncomingDependency – Class – Association[Agg] OutgoingDependency or

52

38 IncomingDependency – Class – [Agg]Association OutgoingDependency or

39 IncomingDependency – Class – [Agg]OutgoingAssociation OutgoingDependency or

40 IncomingDependency – Class – IncomingAssociation

41 IncomingDependency – Class – [Agg]IncomingAssociation

42 IncomingDependency – Class – IncomingDependency

43 IncomingDependency – Class – OutgoingAssociation OutgoingDependency or

44 IncomingDependency – Class – OutgoingDependency OutgoingDependency or

45 IncomingDependency – Class – OutgoingGeneralization

46 IncomingGeneralization – Class – Association Association

47 IncomingGeneralization – Class – Association[Agg] Association[Agg]

48 IncomingGeneralization – Class – [Agg]Association [Agg]Association

49 IncomingGeneralization – Class – [Agg]OutgoingAssociation [Agg]OutgoingAssociation

50 IncomingGeneralization – Class – IncomingAssociation IncomingAssociation

51 IncomingGeneralization – Class – [Agg]IncomingAssociation [Agg]OutgoingAssociation

52 IncomingGeneralization – Class – IncomingDependency IncomingDependency

53 IncomingGeneralization – Class – IncomingGeneralization

54 IncomingGeneralization – Class – OutgoingAssociation OutgoingAssociation

55 IncomingGeneralization – Class – OutgoingDependency OutgoingDependency

56 IncomingGeneralization – Class – OutgoingGeneralization OutgoingGeneralization

57 OutgoingAssociation – Class – [Agg]OutgoingAssociation

53

58 OutgoingAssociation – Class – OutgoingAssociation

59 [Agg]OutgoingAssociation – Class – [Agg]OutgoingAssociation

60 OutgoingDependency – Class – [Agg]OutgoingAssociation

61 OutgoingDependency – Class – OutgoingAssociation

62 OutgoingDependency – Class – OutgoingDependency

63 OutgoingGeneralization – Class – [Agg]OutgoingAssociation

64 OutgoingGeneralization – Class – OutgoingAssociation

65 OutgoingGeneralization – Class – OutgoingDependency

66 OutgoingGeneralization – Class – OutgoingGeneralization

54

Appendix B: Ambiguous Paths with four Classes

Table 9. Ambiguous paths that are discovered in the experiment discussed in Section 3.2.

Case Path Abstracted Ambiguously Abstraction Results1

Rule #17, Rule #6:

1 Rule #5:

S T

Rule #18, Rule #7:

S T

Rule #5:

S X1 X2 T 2

1 Two different abstraction results are shown with the order of applying rules that are defined in Egyed’s work [4]. 55

Rule #20, Rule #7:

S T

S X1 X2 T 3 Rule #5:

S T

Rule #21, Rule #7:

S X1 X2 T 4 Rule #5:

Rule #22, Rule #8:

S T

Rule #5: S X1 X2 T 5

56

Rule #19, Rule #8:

S T

S X1 X2 T 6 Rule #5:

S T

Rule #6, Rule #17:

7 Rule #12:

Rule #7, Rule #18:

S T

S X1 X2 T 8 Rule #12:

57

Rule #9, Rule #20:

S T

S X1 X2 T 9 Rule #12:

S T

Rule #10, Rule #21:

S X1 X2 T 10 Rule #12:

Rule #11, Rule #22:

S T

Rule #12: S X1 X2 T 11

58

Rule #8, Rule #19:

S T

S X1 X2 T 12 Rule #12:

S T

Rule #5:

13 Rule #23, Rule #27:

Rule #12:

Rule #27, Rule #23: 14

59

Rule #13:

S T

15 Rule #27, Rule #24:

Rule #14:

16 Rule #27, Rule #25:

Rule #20, Rule #29:

17 Rule #27, Rule #31:

60

Rule #21, Rule #29:

S T

18 Rule #27, Rule #32:

Rule #22, Rule #30:

19 Rule #27, Rule #33:

Rule #15:

Rule #27, Rule #26: 20

61

Rule #6, Rule #39:

21

Rule #34:

S T

Rule #7, Rule #40:

22

Rule #34:

Rule #9, Rule #42:

Rule #34: 23

62

Rule #10, Rule #43:

24 Rule #34:

S T

Rule #11, Rule #44:

25

Rule #34:

Rule #8, Rule #41:

Rule #34: 26

63

Rule #5:

S T

S X1 X2 T 27 Rule #45, Rule #49:

S T

Rule #12:

S X1 X2 T 28 Rule #49, Rule #45:

Rule #13:

Rule #49, Rule #46: 29

64

Rule #14:

S T

S X1 X2 T 30 Rule #49, Rule #47:

S T

Rule #20, Rule #51:

S X1 X2 T 31 Rule #49, Rule #53:

Rule #21, Rule #51:

Rule #49, Rule #54: S X1 X2 T 32

65

Rule #22, Rule #52:

S T

S X1 X2 T 33 Rule #49, Rule #55:

S T

Rule #15:

S X1 X2 T 34 Rule #49, Rule #48:

Rule #6, Rule #61:

Rule #56: 35

66

Rule #7, Rule #62:

S T

S X1 X2 T 36 Rule #56:

S T

Rule #9, Rule #64:

S X1 X2 T 37 Rule #56:

Rule #10, Rule #65:

Rule #56: S X1 X2 T 38

67

Rule #11, Rule #66:

S T

S X1 X2 T 39 Rule #56:

S T

Rule #38, Rule #63:

S X1 X2 T 40

Rule #56:

Rule #5:

S X1 X2 T 41 Rule #67, Rule #71:

S T

68

Rule #12:

S T

S X1 X2 T 42 Rule #71, Rule #67:

S T

Rule #13:

43 Rule #71, Rule #68:

Rule #14:

S X1 X2 T Rule #71, Rule #69: 44

S T

69

Rule #20, Rule #73:

S T

S X1 X2 T 45 Rule #71, Rule #75:

S T

Rule #21, Rule #73:

S X1 X2 T 46 Rule #71, Rule #76:

S T

Rule #22, Rule #74:

Rule #71, Rule #77: S X1 X2 T 47

70

Rule #15:

S T

S X1 X2 T 48 Rule #71, Rule #70:

S T

Rule #6, Rule #83:

49 Rule #78:

Rule #7, Rule #84:

S T

Rule #78: S X1 X2 T 50

71

Rule #9, Rule #86:

S T

S X1 X2 T 51 Rule #78:

S T

Rule #10, Rule #87:

S X1 X2 T 52 Rule #78:

Rule #11, Rule #88:

S T

Rule #78: S X1 X2 T 53

72

Rule #8, Rule #85:

S T

S X1 X2 T 54 Rule #78:

S T

Rule #5:

S X1 X2 T 55 Rule #89, Rule #71:

S T

Rule #6, Rule #94:

Rule #89, Rule #72: 56

73

Rule #7, Rule #95:

S T

S X1 X2 T 57 Rule #89, Rule #73:

S T

Rule #9, Rule #97:

S T

S X1 X2 T 58 Rule #89, Rule #75:

S T

Rule #10, Rule #98:

S T

Rule #89, Rule #76: S X1 X2 T 59

S T

74

Rule #11, Rule #99:

S T

S X1 X2 T 60 Rule #89, Rule #77:

S T

Rule #8, Rule #96:

S T

S X1 X2 T 61 Rule #89, Rule #74:

S T

Rule #12:

Rule #93, Rule #89: S X1 X2 T 62

S T

75

Rule #13:

S T

63 Rule #93, Rule #90:

Rule #14:

S X1 X2 T 64 Rule #93, Rule #91:

S T

Rule #20, Rule #95:

S T

S X1 X2 T 65 Rule #93, Rule #97:

S T

76

Rule #21, Rule #95:

S T

S X1 X2 T 66 Rule #93, Rule #98:

S T

Rule #22, Rule #96:

S X1 X2 T 67 Rule #93, Rule #99:

S T

Rule #15:

S T

Rule #93, Rule #92: S X1 X2 T 68

S T

77

Rule #5:

S T

S X1 X2 T 69 Rule #100, Rule #49:

S T

Rule #6, Rule #105:

70

Rule #100, Rule #50:

Rule #7, Rule #106:

S T

S X1 X2 T Rule #100, Rule #51: 71

78

Rule #9, Rule #108:

S T

S X1 X2 T 72 Rule #100, Rule #53:

S T

Rule #10, Rule #109:

S X1 X2 T 73 Rule #100, Rule #54:

Rule #11, Rule #110:

S T

Rule #100, Rule #55: S X1 X2 T 74

79

Rule #8, Rule #107:

S T

S X1 X2 T 75 Rule #100, Rule #52:

S T

Rule #12:

S X1 X2 T 76 Rule #104, Rule #100:

S T

Rule #13:

Rule #104, Rule #101:

77

80

Rule #14:

S T

S X1 X2 T 78 Rule #104, Rule #102:

S T

Rule #20, Rule #106:

S T

S X1 X2 T 79 Rule #104, Rule #108:

S T

Rule #21, Rule#106:

S X1 X2 T 80 Rule #104, Rule #109:

81

Rule #22, Rule #107:

S T

S X1 X2 T 81 Rule #104, Rule #110:

S T

Rule #15:

S T

S X1 X2 T 82 Rule #104, Rule #103:

S T

Rule #5:

Rule #111, Rule #49: S X1 X2 T 83

82

Rule #6, Rule #116:

84 Rule #111, Rule #50:

S T

Rule #7, Rule #117:

S T

S X1 X2 T 85 Rule #111, Rule #51:

Rule #9, Rule #119:

S T

S X1 X2 T Rule #111, Rule #53: 86

S T

83

Rule #10, Rule #120:

S T

S X1 X2 T 87 Rule #111, Rule #54:

S T

Rule #11, Rule #121:

S X1 X2 T 88 Rule #111, Rule #55:

Rule #8, Rule #118:

S T

Rule #111, Rule #52: S X1 X2 T 89

S T

84

Rule #12:

S T

S X1 X2 T 90 Rule #115, Rule #111:

S T

Rule #13:

91 Rule #115, Rule #112:

Rule #14:

Rule #115, Rule #113: S X1 X2 T 92

85

Rule #20, Rule #117:

S T

S X1 X2 T 93 Rule #115, Rule #119:

S T

Rule #21, Rule #117:

S X1 X2 T 94 Rule #115, Rule #120:

Rule #22, Rule #118:

S X1 X2 T Rule #115, Rule #121: 95

86

Rule #15:

S T

S X1 X2 T 96 Rule #115, Rule #114:

S T

87