<<

Visualizing Legacy Systems with UML

Abstract Understanding a system is of critical importance to a developer. A developer must be able to understand the business processes being modelled by the system along with the system’s functionality, structure, events, and interactions with external entities. This understanding is of even more importance in reverse engineering. Although developers have the advantage of having the source code available, system documentation is often missing or incomplete and the original users, whose requirements were used to design the system, are often long gone.

0. Introduction

A developer requires a model of the system not only in order to understand the business processes being modelled but the structure and dynamics of a system. Visualization of this system model is often necessary in order to clearly depict the complex relationships among model elements of that system. This chapter explains why visualization of a system, through a series of UML , is necessary and explains why relying on source code as the sole form of system documentation is often inadequate. A detailed outline of the process of deriving information from an analysis of legacy system is given, along with the rules to convert this information to a UML model of this system. This chapter also investigates whether it is possible to visualize a system and which of the nine possible UML diagrams are needed to represent this visualization. A brief overview of the methods, along with their inherent difficulty, that are used to extract UML diagrams from a legacy system is provided. Finally, a tool, TAGDUR, is introduced which has automated some of these analysis processes and which models some aspects of this system in UML.

0.1. Why Not Rely on Source Code for System Documentation?

Relying on source code solely to obtain an understanding of the system has many disadvantages, particularly in legacy systems. In many legacy systems, the original design of the system has been obfuscated by the many incremental changes during the system’s long maintenance history. Furthermore, the end-users, whose requirements originally help design the system to meet their business needs, have usually long gone and the documentation outlining these requirements are often missing. Without these original end-users and no documentation, it is often difficult to determine the exact business processes that these systems model. Source code is very programming-language-dependent (Yang, 1999). In order to understand the code, the developer must be fully proficient in the programming language used to develop the system. The function and role of each section of source code within the system may be obvious to a developer but may be meaningless to a non-technical end-user. End-users want to see how the business processes that the system represents are modelled and they want to ensure that all of their business requirements are met in the system. End-users are not concerned with the internal design and details of this system. It is difficult for developers to view parallel data and control flows from reading the code. The control logic, especially if this control logic is heavily nested, is difficult to visualize from the source code, particularly to quickly identify which control constructs affect which parts of code. It is difficult to visualize events occurring in various parts of code and to visualize how these events interact with various objects in the system. Relying on source code as the only documentation source makes it difficult to view the interaction of system objects with other objects and external actors. Source code encompasses many perspectives (such as objects, deployment of components, and timing of object interactions) within itself. These multiple perspectives are confusing – it is difficult to represent the source code in each separate perspective. Source code has the additional disadvantage in that it is difficult to represent abstract concepts and behaviour from low-level, detailed source code.

0.2. UML

One method to overcome this problem of requiring multiple perspectives of the same system is to visualize the system using some sort of graphical notation. Each perspective is given its own type which is specialized to best represent this perspective. One of the most common graphical notations is UML (Unified Modelling Language). UML provides multiple perspectives of the system. diagrams model the business processes embodied in the system from a user’s perspective. Statecharts and class diagrams model the behaviour and structure of the system respectively; the behaviour and structure of the system would be of most interest to developers.

0.3. Why Visualization of the System is Necessary for Reverse Engineering?

Program understanding can be defined as the process of developing an accurate mental model of a software system’s intended architecture, purpose, and behaviour. This model is developed through the use of pattern matching, visualisation, and knowledge-based techniques. The field of program understanding involves fundamental issues of code representation, structural system representation, data and control flow, quality and complexity metrics, localisation of and plans, identification of abstract data types and generic operations, and multiple view system analysis using visualisation and analysis tools. Reverse engineering involves analysing data bindings, common reference analysis, similarity analysis, and subsystem and redundancy analysis (Whitney,1995). Program understanding involves the use of tools that perform complex reasoning about how behaviours and properties arise from particular combinations of language primitives within the program. One method of program understanding is to use visitors, or small reusable classes, whose function is to parse the source system and evaluate the combinations of language primitives that had been discovered during parsing (Wills,1993). Other methods try to evaluate and understand a system by taking, as input, the goals and purpose of the system as a specification (Johnson, 1986). Another method is to use clichés which try to recognise commonly used data structures and algorithms and then match these structures and algorithms to higher level abstractions. Examples of clichés are the structures and algorithms associated with hash tables and priority queues. The degree of accuracy of the matching varies with goal of program understanding. An example, an exact match is needed for program verification while only a reasonably close match is needed for documentation purposes. Software visualisation is a technique to enable humans to use their brain to make analogies and to link a visual software representation with the ideas that this representation portrays. This link would be much more difficult to make if the software representations were purely in textual form. Software visualisation relies on crafts such as animation, graphic design, and cinematography (Price,1992). Software visualisation has been used for decades in order to help developers understand programs. These techniques vary from [Goldstein] to animated graphical representations of data structures (Baecker, 1981). However, many of these software visualisation systems are limited to displaying one type of data or level of abstraction. Few visualisation systems have the ability to suppress lower-level detail in order to depict higher-level concepts of the system. Program visualisation systems or tools can be characterised according to their scope, content, form, method, interaction, and effectiveness. Scope refers to the visualisation system’s general characteristics such as whether it models concurrent programs or whether there are any size limitations as to the system being depicted. Content refers to the content being visualised. Some visualisation systems can model both data and code while others model algorithms only. Form refers to what elements are being used in the visualisation. Some visualisation systems use animated graphics while other systems provide multiple views of different parts of the system. Method refers to how the tool specifies the visualisation. Does the tool require the program source code to be modified in order for it to be visualised? Some tools require the user to insert special statements in code of special interest in order for this code to be properly visualised. Interaction refers to how the user interacts and controls the visualisation. How does the user navigate through the visualisation of a large system in order to see how different parts of the system are being modelled? Effectiveness refers to how well the visualisation communicates information regarding the system being visualised (Price,1992). Using this taxonomy outlined by Price, a number of program visualisation systems can be objectively evaluated. The film, Sorting Out Sorting, is an animated visualisation tool to explain algorithms. Balsa, another visualisation tool, generates animations of Pascal programs. LogoMotion allows users to indicate what aspects of a Logo program they wish to have visualised. Chifosky and Cross define reverse engineering to be “the process of analyzing a subject system to identify the system’s components and their interrelationships and create representations of the system in another form or at a higher level of abstraction”. Chifosky and Cross state six goals of reverse engineering: z controlling complexity z generating alternative views z recovering lost information z detecting side effects z synthesizing higher abstractions z facilitating reuse (Chifosky,1990) A reverse engineering tool, TAGDUR, which we outlined later on in this chapter, tries to accomplish at least three of these goals. TAGDUR tries to control complexity by encapsulating formerly globally-scoped variables and procedures into classes. TAGDUR tries to synthesize higher abstractions by trying to incorporate abstractions, classes, into various abstract representations, such as sequence diagrams, to model system behaviour. TAGDUR generates alternative views by producing UML diagrams, such as sequence or statecharts, to model the structure and behaviour of the system. Creating a model of a system is a useful way to understand the system for many reasons. One reason is that the system itself is a model of an external business process and as such, this model exists, thus reducing the problems in creating further models. Another reason is that programs are constructed from executable and data structures – these existing entities can be extracted and analyzed in order to produce a model of the structure and dynamics of the system (Hall, 1992; Rugaber,1993). Rugaber identifies the understanding of a software system by constructing a mental model of the system as an important of any task involving a software system. Much work in reverse engineering is devoted to explicit knowledge representation of a software system (Van,1992; Rajlich,1992). This explicit knowledge representation results in a set of well-defined system models, invaluable to those undertaking any task on the software system, and better prediction of the reverse engineering’s expected results. This knowledge representation may be in multiple formats such as textual or graphical notation. Graphical notation has the advantage over textual formats in that graphical notations can more clearly depict complex relationships between model elements, such as class (Rugaber,1993). Rumbaugh et al identifies three viewpoints necessary for understanding a software system: the objects manipulated by the computation (the ), the manipulations themselves (functional model), and how the manipulations are organized and synchronized (the dynamic model) (Rumbaugh,1991). Rugaber states that most of the representation techniques, outlined below, emphasize one of these views. Representation References Object-Oriented Frameworks (Johnson,1988) Category Theory (Srinivas,1991) Concept Hierarchies (Biggerstaff,1989; Lubara,1991) Mini-languages (Neighbors,1984; Arango,1986) Database languages (Chen,1986) Narrow spectrum languages (Webster,1987) Wide Spectrum languages (Wile,1987; Ward,1989) Knowledge Representation (Barstow,1985) Text UML, through its various UML diagrams, encompass all of Rumbaugh’s views. The data model is represented by UML’s class and object diagrams. The functional model is represented by and state diagrams. The dynamic model is represented by sequence and collaboration diagrams.

0.4. Advantages of UML

UML has many advantages for a graphical modelling notation. UML encompasses many earlier modelling notations that are well understood and well-accepted by the software development community. An example, UML uses statecharts, a modification of traditional state transition diagrams, to model the system behaviour in terms of states, transitions, and events. Consequently, unlike many other graphical notations, the learning curve for UML’s notations for its users, who may be familiar with earlier notations like state transition diagrams, may be much less steep. UML has a core set of concepts that remain unchanged but UML provides a mechanism to extend UML to new concepts and notations beyond this core set. UML allows concepts, notations, and constraints to be specialized for a particular domain. UML is implementation and development process independent (D’Souza,1999). In other words, UML is not dependent on any particular programming language nor is it dependent on a particular development process, such as the waterfall software development model. UML addresses recurring architectural complexity problems using component technology, visual programming, patterns, and frameworks. These problems include the physical distribution, concurrency and concurrent systems, replication, security, load balancing, and fault tolerance. In order to enable the developers to achieve the same understanding and interpretation of the model when this model is exchanged among different technologies and tools, some sort of common model exchange format is needed. XML is often used for the canonical mapping of the UML model to the data interchange format due to XML’s hierarchial structure and linking abilities that enable it to represent a wide variety of information structures. Schema conversion rules dictate how an instance of the input data structure, in this case a UML model, into instances of the output data structure, in this case an XML document. These rules are used in generating a DTD file and, by following these rules, the UML model is incorporated into a XML document. One set of standards for these schema conversion rules is XMI. XMI is an XML standard for exchanging UML models and is used by such tools as Rational Rose. Because XMI is very cumbersome in that it often requires a large XML document to represent a small UML model, other standards have been proposed such as UXF. UXF (UML eXchange Format) is a XML-based model interchange for UML models developed by Junichi Suzuki and Yoshikazu Yamamoto. (Taentzer, 2001) Because UML is general purpose, expressive, simple, widely accepted, and extensible, UML has wide applicability and usability (Muller,1997).

0.5. UML Diagrams

UML has different types of diagrams: z use case z class z object z sequence z collaboration z statecharts z activity z component z deployment Use case diagrams emphasize services and operations that a system offers to entities outside the system. Use case diagrams are often used to model the business processes that the system represents. Class diagrams emphasize the static structure of the system. Object diagrams model the static structure of a system at a particular point in its execution. Object and class diagrams can be differentiated in that classes model the structure of a system while objects are specific examples of the structure. Sequence diagrams model the messages exchanged over time among the objects within a system. Collaboration diagrams model messages exchanged over time among the objects and their links within a system. Statecharts model how an object changes or responds to external stimuli within a system. Statecharts model the changes of state embodied within a system due to messages. Activity diagrams model how an object changes or responds to internal processing within a society. Activity diagrams model the flow of control and of information within a system. Component diagrams model the packaging of an object as a solution. Deployment diagrams model the deployment of objects within a society as a solution within an environment (Muller,1997).

Structured View : Implementation View :

Class Diagrams Component Diagrams

Object Diagrams

User View :

Use Case Diagram

Sequence Diagrams

Collaboration Diagrams Deployment Diagrams

Statechart Diagrams

Activity Diagrams

Behavioral View : Deployment View :

Different UML diagrams model different views

User Model View : Use Case Diagrams

Behavioral Model View : Structured Model View :

Levels : Class and System

SubSystem Sequence Diagrams (Scenarios) Service or Operation Actor or Subsystem or Class Class

Method

Component Model View :

(Component Diagrams)

Deployment Model View :

(Deployment Diagrams)

Different UML diagrams model different views at different levels of abstraction (Muller,1997)

0.6. Which UML Diagrams Are Needed to Fully Represent a System?

The nine diagrams of UML represent several different perspectives and several different views. UML diagrams are designed to portray information deemed to be of most interest to a particular set of users. Use case diagrams, which depict the external business environment and business processes modelled by the system, are of most interest to end-users who are not interested in the system’s structure, behaviour, or deployment but they are interested in ensuring that their business processes are fully represented in the modelled system. Deployment and component diagrams are of most interest to system architects who typically are concerned with the deployment of components within a specific architectural framework rather than the lower-level details of structure and behaviour. Developers, on the other hand, are most interested in the internal structure and behaviour of the system. The question of how many of these nine UML diagrams are needed to properly represent a system is dependent on several factors, including the type and size of the system. Large systems, which are composed of multiple software components, require component and deployment diagrams to reduce this complexity and to depict information most useful to system architects. In smaller systems, with only a few software systems, these components can more easily be kept track of and a separate diagram, in the form of a , is not needed to depict their particular deployment. Systems with few external events, such as batch-oriented systems, do not need statecharts to describe their behaviour. Activity diagrams, designed to describe a system’s internal processing, is adequate to model the behaviour of batch-oriented systems. However, activity diagrams, because they don’t model external events as well as statecharts, are inadequate to describe the behaviour of reactive systems; statecharts must be used as well to model these reactive systems. Class diagrams are necessary to describe the structure of a system. Object diagrams may not be necessary for systems with statically-created objects; however, object diagrams may be necessary to reduce a system’s complexity in a system with dynamically-created objects. Object diagrams, in this case, may be needed to depict which objects are active at a particular point in time. Sequence and collaboration diagrams are used to help describe the dynamic behaviour of a system. These diagrams are most useful to help reduce complexity in highly-interactive systems with complex behaviour and interactions among objects in the system. However, in non-interactive, strictly batch-oriented systems, the interactions among objects are much less complex and often can be adequately modelled using activity and class diagrams without the need for sequence/collaboration diagrams. Use case diagrams are very necessary in order to clarify system requirements from the end-users. However, many other methods, such as textual representation or formal specifications methods such as Z, exist to depict the information contained within a . Thus, while defining system requirements is a necessary part of the software development process, it is not strictly necessary to utilize use case diagrams to model this information.

1. Acquiring UML Class Diagrams from Legacy System

1.1. Cleaning Source Code

Reverse engineering legacy system means going all the way back to the design stage from legacy source code. The original code may contain unstructured statements such as “GO TO” lines in programs written in COBOL language that make the source code become “spaghetti code” (Hutty, 1997). It is necessary to clean the original code and eliminate the dead code. At the completion of each increment, the percentage of legacy system decreases while the percentage of cleaned code increases. Eventually, legacy software code is completely cleaned. In legacy system, there may be dead code that is useless to the execution of the tasks. That dead code maybe existed at the stage of the development. Or with the change of the environment, especially with the improvement of the hardware, some ways of inputting or outputting data have been modified; or some ways of storing data have been improved. Therefore the corresponding code become useless; on some occasions it even results in the failure of the system. Consequently that dead code must be recognised and removed from legacy system. The process of cleaning legacy software code can provide a clear description of the program in a readily understandable format forming a solid base for further application. It involves migrating legacy system onto a modern hardware or operating system platform and migrating from the existing database to a relational database management system and converting the system to a modern programming language. The original code may contain lines such as “GOTO” that is neither structural nor clear in programming language COBOL. Legacy system will be improved over a series of increments, making functionality available to the user sooner than possible with a big bang deployment strategy. The goal is to break up the migration to the future system into small manageable steps, where the objectives of each increment are well defined. Incremental plans are driven foremost by complexity and technical feasibility. It is critical to ensure that the functionality, reliability, and performance of the system are not diminished after the development of the clean code has completed. Incremental deployment also offers opportunity for the organisation to gradually begin sustentation of modernised components, easing the transition from legacy to modern technologies. Cleaning the code means obtaining the equivalent but different design from the original code or representation that has clearer and simpler semantics than original programming language. The representations are structured by the way to structure the code in order that each line of the source code is a meaningful fragment of a program specification.

1.2. Gathering Parameters

In legacy system, all parameters are gathered together and important information about main data structures is recorded in files or tables. Basically, the program is the set of all parameters and the operations on them. It is a semantic set with the specified operational environment. Therefore the input parameters are served as the original raw, the programming environment and the execution of program lines are regarded as the producing machines that change the characteristics of those parameters, and the output parameters are the outcomes that are produced during the execution of the whole software program. In some cases the input parameters are the same as the output. In other cases they are different (Ben-Menachem,1997). All parameters and associated operations are collected together. Those operations include the systematic operations which may include the systematic calls.

1.3. Classifying Several Groups

All the parameters and the operations on them are classified into several groups. Each group has one nucleus. Each group is related closely to each other to describe the common core. Although that core is not always obvious, every parameter and related operations are part of the specifications of that nucleus in that group. In every group, each parameter depicts one piece of the characteristics of that nucleus, such as its name, its age, its weight, its height, and its ID; and every operation is the change, assessment, or detection of those pieces of characteristics of that nucleus, such as increasing or decreasing its weight, confirming whether it is at that age, and determine whether it has another name. It is possible that one operation may be involved in different groups. The best way to dealing with that problem is to separate that operation into different ones, each of which concentrates on only one group.

1.4. Extracting the Classes

The nucleus that is contained in each group is regarded as one class. Sometimes the name of that nucleus is not mentioned in the program. It is necessary to define a name for it. All the parameters and operations in that group describe the attributes of that nucleus, the operations on that nucleus, and the relationships with other groups. They can be distilled as attributes and operations of that class (D’Souza,1999). It is important to note that the name of the class, its attributes, and its operations are domain- related. The domain is the place in which the problem is allocated. All the definitions and extractions must be based on the domain knowledge (Yang,1991; Pu, 2003).

1.5. Defining Relationships of Classes

An association shows a relationship between two or more classes. Associations have several properties: • A name that is used to describe the association between the two classes. Association names are optional and need not be unique globally. • A role at each end that identifies the function of each class with respect to the associations. • A cardinality at each end that identifies the possible number of instances. Initially, the associations between classes are the most important because they reveal more information about the application domain. Every association should be named and roles assigned to each end. The associations between the classes are also domain-related. It is necessary to model generalization relationships between objects. Generalization is used to eliminate redundancy from the analysis model. If two or more classes share attributes or behaviour, the similarities are consolidated into a superclass (Dorfman,1997).

1.6. Pretty Printing Class Diagrams

A is a diagram that shows a set of classes, interfaces, and collaborations and their relationships. Class diagrams are typically used to explore domain concepts in the form of a domain model, analyse requirements in the form of an analysis model, and depict the detailed design of object-oriented software. Class diagrams are essential to modelling legacy system. During the analysis development period, analysis models should focus on the responsibilities of classes and not on attributes or operations of classes, and associations should be modelled as association classes during this stage. When association classes are modelled on analysis class diagrams, put the dashed line of the association class to the centre and do not name associations. Only when a specific type of an attribute is a requirement of the system, it can be described on analysing class diagrams. Otherwise types of attributes and operations should be modelled on design class diagrams. During design development period, four types of visibility of attributes or operations: Public “+”, Protected “#”, Private “-”, and Package “~”, show details of design classes because it defines the level of access that objects have to it. Class name is a singular noun based on the common terminology. An attribute name is a domain- based noun and an operation name begins with a strong verb. Graphically, a class is rendered as a rectangle with three compartments that contain name, attributes and operations respectively. These three compartments are essential and necessary to a class even if one of attribute compartment and operation compartment is blank. A class may contain uncommon class compartments that should be labelled at the top of the centre of the compartments. The compartment with the incomplete list should be marked by an ellipsis at the compartment end. In a class, attributes or operations should be listed in decreasing visibility and static ones before instance ones. When operation signatures have too long sizes of the class symbol, only the types of objects that are passed as parameters to operations are listed to save the space. Consistency of attributes, operations, parameters, and their orders of classes is essential. In language conventions, if names of attributes and operations imply the stereotypes, these stereotypes should be omitted. Exceptions of operations can be indicated with a property string. Inheritance is modelled vertically and other relationships horizontally. If two classes interact with each other, some kind of relationship may be needed between them. The transitory relationship is a . In class diagrams, multiplicity between classes is essential and necessary and usually the multiplicity “*” can be replaced by “1..*” or “0..*”. In some cases, attribute types can replace relationships. Sometimes implied relationships or every single dependency are not necessary to model. It is a common convention to centre the name of an association above an association path and use one or two descriptive words to name the association. The directions of association can be modelled with filled triangles and common direction name is left-to-right. If multiple associations exist between two classes, the roles of classes should be introduced to clarify the class diagrams. And role names can be used to describe recursive associations that involve the same class on both ends. Only when collaboration occurs in both directions the associations are bi-directional. The associations are inherited by implication and when changes occur the associations may need to redraw. In common modelling process, the minimums and maximums in multiplicity can be extended to “1..*” or even “0..*” to make class diagrams more flexible. Inheritance models the like between the subclass (son) and superclass (parent). One of the following sentences should make sense: “A subclass is a superclass” or “A subclass is like a superclass”. It is common that a subclass is below its superclass. A subclass inherits attributes and operations of a superclass. If it only inherits some data without basic attributes or operations, the relationship between these two classes may need to detect. The last part is aggregation and composition guidelines. An aggregation is a specification of association that depict a whole-part relationship. Composition is a stronger form of aggregation where the whole and parts have coincident lifetimes and commonly the whole manages the lifecycle of its parts. It should make sense for aggregation to say “the part is part of the whole”. The whole is commonly at the left of class diagrams. When one class is both parts of the system and physical items, the relationship is composition. When the lifecycle of the parts is the same as the whole, the relationship is composition. Because class diagrams are fatal in modelling the system, they represent many characteristics of the system and they describe the systems in class level, it is really important to follow these guidelines to model the system into class diagrams.

2. Expressing UML Activity Diagram from Legacy system

2.1. UML Activity Diagram

UML activity diagram describes the dynamic aspect of system. It is essentially a , showing flow of control from activity to activity. An activity diagram shows the flow from activity to activity. An activity is an ongoing non-atomic execution within a state machine. Activities ultimately result in some action, which is made up of executable atomic computations that result in a change in state of the system or the return of a value. Actions encompass calling another operation, sending a signal, creating or destroying an object, or some pure computation. It models the sequential and possibly concurrent steps in a computational process. It also models the flow of an object as it moves from state to state at different points in the flow of control. Activity diagrams may stand alone to visualise, specify, construct, and document the dynamics of a society of objects, or they may be used to model the follow of control of an operation (Booch,1999; Rumbaugh,1999; Larman,1998). It is necessary to comprehend legacy system with UML activity diagrams. Software systems are becoming larger and more complicated with the rapidly developing steps. Meanwhile, they are easy to be old-fashioned and become legacy. Legacy system depicts large, complicated, old, heavily modified, difficult to maintain and old-fashioned software that is still fatal to the organization (Yang,1999). A legacy system is a computer system or application program that continues to be used because of the cost of replacing or redesigning it and at the same time despite its poor competitiveness and compatibility with modern equivalents (Howe,2002). The implication of legacy system is that the system is large, monolithic and difficult to modify. If legacy software only runs on antiquated hardware, the cost of maintaining legacy software may eventually outweigh the cost of replacing both the software and hardware, unless some form of emulation or backward compatibility allows the software to run on new hardware. It is important to note that the term “legacy” refers to the state of a system before the strategic change. Legacy is a function of the change of a system. It is the result of the change of the environment. Without change, there would be no legacy. It is essential to realise that legacy system is not useless. In most situations, legacy system is important, valuable, and fatal to the business organisations. Because activity diagram shows the control flow and models the sequential and concurrent procedures, it is important for legacy system to be represented with UML activity diagrams.

2.2. Getting Along with Domain Knowledge

Legacy system is closely related to the specified commercial area. This focuses on program boundary. The boundary is decided by the domain in which place legacy system performs its tasks. Practically, the same program is used in different software, and the same software is used in different domains. That will results in the software programming correspondence to objective reality. The domain knowledge is permeating all the software including the name of the parameters, the attributes of the procedure and parameters, and the operational constrains of execution. Even the software itself is the description of the business rules (Yang,1991). Business competition is vigorous in a global market. It is important with ongoing business and technological change as well as ever-increasing complexity due to their evolutionary response to rapid change. Businesses are charted to solve convoluted and ever-changing problems by applying their knowledge within this environment. The problem-solving process involves understanding a problem and evolving a solution from it through the application of domain knowledge. Therefore, the domain knowledge that a business is able to capture, communicate, and leverage to solve problems becomes the dictating factor of its success or failure. If such domain knowledge is captured and reapplied, the business organizations will become even more competitive and proactive, rather than reactive, to the change and complexity, thus increasing its probability of success. Domain Engineering takes advantage of the similarities inherent in systems within one domain and builds a domain model that defines the basic structure for existing and future requirements of systems. The domain is a specific phase of the software life cycle in which a developer works. Domains define developers’ and users’ areas of responsibility and the scope of possible relationships between products (Howe,2002). In practice, the comprehension of legacy system involves the domain knowledge (Yang,1991). When one legacy system is targeted, it is the precondition of describing legacy system where that legacy system was used and where it will be used. The names of parameters, variables, procedures, and even functions are closely related to the application domain. The names, attributes, and operations of classes are also decided by the domain. Domain knowledge is conductive to comprehend the source code. Legacy systems were designed to deal with business process and these systems modelled the business rules of the problem domain in which they resided. Furthermore, every legacy system is located within its specific commercial domain. If this domain of a legacy system is not uncovered, it is difficult to understand the system in isolation from its domain. Consequently, in order to understand a legacy system, the domain knowledge of this system must be addressed during its reverse engineering process. Because legacy software code was designed and developed in regards to its specified business area, the code is domain-related and not isolated. The domain knowledge is achieved from the operating system in which legacy system is performed, its data environment, the programming language in which legacy code was written, the input and output data of legacy system, the hardware that executes legacy system, the documents of legacy system, the stakeholders of legacy system, and legacy code itself. This acquired domain knowledge is applied when modelling the static and dynamic aspects of a legacy system.

2.3. Classifying Calls

The call stands for the procedure or function call in the programming language. The starting point in analysing the structure of legacy system is to develop a call graph. Examining the calling structure of legacy system can be used to identify program elements with minimal dependencies that could be migrated easily. Four different kinds of program elements are distinguished: root program elements that call other program elements but are not called by any; leaf program elements that are called by other program elements but do not call any; program elements that both call and are called by other program elements; and isolated program elements that neither call nor are called by other program elements. Root program elements are typically programs that are invoked directly by the user or some external process otherwise there would be no way to execute these programs. By themselves, these program elements may not be good candidates for the start point of understanding legacy system since they call other program elements. This means that it would have to call back to legacy system. This should be avoided because the execution control would go back and forth between the being designated legacy system and the original code. Node program elements are even more difficult to migrate that root program elements. They share the difficulty of root program elements, but also require that they must be created from legacy system so that the remainder of legacy system can continue to function in the same manner and those node program elements can then be used by other program elements. Isolated program elements can be migrated easily. These elements could be used in any given increment, since converting those elements does not increase or decrease the number of elements that need to be developed from legacy system. Leaf program elements are the best candidates for the start point of comprehending legacy system. They do not call back to legacy source code, and although they require the development, it is possible to minimise the number of these elements by transferring entire subsystem in a single iteration.

2.4. Removing Dead Code

Legacy system has some code that is useless. That code is called “dead code”. That code can be easily eliminated. The structure of legacy system is extracted and improved with an intermediate programming language into the structured code. The dead code is the routines that can never be accessed because all calls to them have been removed, or code that cannot be reached because it is guarded by a control structure that provably must always transfer control somewhere else. The presence of dead code may reveal either logical errors due to alterations in the program or significant changes in the assumptions and environment of the program; a good compiler should report dead code so a maintainer can think about what it means (Ramage,1998). Dead code means unnecessary, inoperative code that can be removed. Dead code includes functions and sub-programs that are never called, properties that are never read or written, constants and enumerations that are never referenced. Variables are dead if they are never read - even if they have been given a value. User-defined types can also be dead, and a project may contain redundant API declarations. Even the whole files are sometimes completely unnecessary. Some routines may look live even if they are not operational at the run-time. This happens when dead procedures call other procedures. There is a call in the code, but it is never executed. Dead code is a very usual phenomenon. It can account for 30-40% of a project's size, and increase the EXE file size by 100s of kilobytes. This means excessive memory use and slower execution. More resources are wasted. Dead code also means more source code to read and maintain. This leads to higher costs, especially during maintenance. If the code is in there, programmers must spend time understanding it. Leaving dead code in a completed project often means carrying untested code around. Dead code may inadvertently become operative later, meaning a possible source for errors. Dead code exaggerates system size measures such as line counts, and may even lead to purchases of unnecessary code at a high price. In legacy system, dead code is useless to the execution of the tasks. That dead code maybe existed at the stage of the development. Or with the change of the environment, especially with the improvement of the hardware, some ways of inputting or outputting data have been modified; or some ways of storing data have been improved. Therefore the corresponding code become useless; on some occasions it even results in the failure of the system. Consequently that dead code must be recognised and removed from legacy system. In legacy system, the isolated program elements are possibly served as the candidates of dead code.

2.5. Specifying Behaviour of Legacy System

Because legacy system is large, unstructured, complicated, and old-fashioned, it is difficult to read and understand. The specification from the source code is necessary to realise. At the same time, the specification includes all the information of the source code. It is similarly complex in structure with the source code. Therefore the next work is to present the specifications in graphical manner, which means extracting the high-level representation that is more understandable than legacy system. That process will result in the great loss of the detail information and producing the description of legacy system with UML activity diagrams. They describe the behavioural aspect of being modelled legacy system (Yang,1997). It is necessary to deal with the data structure and gathering code up as operations on those data structures instead of trying to recognise simple operations in the code and attempting to limit the number of program variables involved in those operations. The specifications of the system contain detailed information, the interactions of variables and classes, the detailed code structure, and the complete description of executions of variables. The higher equivalent reasoning is essential. When the simplified specifications are achieved, the data structures contained in the specifications will be eliminated to obtain the clearer description. Behavioural structures implement the dynamic aspect through the internal constrains and the hidden behaviour is presented explicitly. It is useful for those techniques to be applied within the object abstraction technique in the case that a section of code implements a mathematical function or a simpler description of a higher-level behaviour or auxiliary behaviour may be obtained (Breuer,1991). Behavioural specifications consist of a set of top-level behaviour definitions, an explicit definition of the valid domain, a set of lower-level local specifications. After the code is improved, the specifications of legacy system are generated entirely with a line-by-line translation mechanism. Although many new variables and executions of those variables are produced and the first version of the specification from legacy system is not very readable, the application of simple transformations presents an acceptable form. This is served as the basis for the further transformations employed in equivalent reasoning. Specifications of legacy system are presented with the purposes of describing legacy software through using the real-world name and the notes of the programs with “student”, “university”, “bank”, “This is to input data”, “This is to print the file” or “This is to stop the program” to make the users of legacy system clearly know what is done or will be done. For large, complicated software with millions of the lines of the programming language, it is essential to identify the detailed behaviour of the calls (Li, 2001).

2.6. Realising Activity Diagrams

The UML activity diagrams model the complex operation, business rules, business process, and software process. They contain activity states and action states. Action state represents an operation on an object, sending a signal to an object, or even creation or destroying of an object, which is the executable and atomic computation. Action state cannot be decomposed, which means that events may occur, but the work if the action state is not interrupted. The work of an action state is generally considered to take insignificant execution time. In contrast, activity states can be further decomposed, their activity being represented by other activity diagrams. Furthermore, activity states are not atomic, and they may be interrupted and considered to take some duration to complete. An action state is a special case of an activity state that cannot be further decomposed. At the same time, an activity state is thought of as a composite, whose flow of control is made up of other activity states and action states. An activity state is extended to an activity diagram (Booch,1999; Bennett,1999; D’Souza, 1999). Action states and activity sates are just special kinds of states. An activity state is semantically equivalent to expanding its activity graph in place until actions are presented. However, activity states are important because they can help break complex computations into parts. This is helpful when comprehending legacy system that is large, complicated, and difficult to understand. Activity diagrams are used to model the dynamic aspect of legacy system (Systa,2000). These dynamic aspects may involve the activity of any kind of abstraction in any view of a system’s architecture. Activity diagram models a workflow and an operation of a system. An activity diagram is attached to any modelling element for the purpose of visualising, specifying, constructing, and documenting that element’s behaviour. When realising an activity diagram, the start point is allocated at the top-left corner of that activity diagram and an end point at the bottom-right corner. All of them are modelled with a filled-in circle with a border around it. Every UML activity diagram has a starting point. An activity on a UML activity diagram typically represents the invocation of an operation, a step in a business process, or an entire business process. When an activity except the starting or ending point has the transition into it but none out, or has the transition out but none into it, it is possible that one or more transitions have been missed. In UML activity diagram, a decision point is modelled as a diamond. The guards depicted using the format “[description]”, which is a condition that must be true in order to traverse a transition, on the transitions leaving the decision point help to describe the decision point. Each transition leaving a decision point must have a guard. The guards on the transition leaving a decision point, or an activity, must be considered with one another, and should not overlap. The guards on decision points form a complete set. It is necessary that exit transition guards and activity invariants must form a complete set. It is possible that activities can occur in parallel that are regarded as forks and joins. A fork should have a corresponding join. One fork has one entry transition and one join has one exit transition (Ambler,2002). Activity diagrams allow for a great deal of freedom. They use the right level of detail to describe the system functionality. A model is a communication device, so it requires an adequate level of detail to address the problem to be solved. Clarity and brevity are important to avoid visual overload, and a model should present key features of the control flows. It is important to limit the level of complexity of each activity diagram. If there are more than three possible paths (alternate or exceptional), it is necessary to use additional activity diagrams to promote understanding. It is also necessary to use additional activity diagrams if the processing requires specific data elements.

3. Achieving UML Sequence Diagrams from Legacy System

3.1. Confirming Business Domain

The domain of a system is the set of demands allocated to it. It is essential to analyse where the problem is located. Because legacy system implies business data and rules, the problem analysis is to understand business problems. This is closely related to the business domain. Initial solution boundaries and constraints of legacy software code are defined from both technical and business perspectives. This focuses on program boundary. This boundary is decided by the domain in which place legacy system performs its tasks (Yang,1991; Pu, 2003). Practically, the same program is used in different software, and the different software packages are used in different domains. That will results in software programming corresponding to objective reality. The domain knowledge is permeating all the software including the name of the parameters, the attributes of the procedure and parameters, the operational constrains of execution, and the associations between the elements.

3.2. Discovering the Interactors

It is important to identify who is going to be using legacy system directly. This should be done from the outside of legacy system. That is deeply involved in the human interaction and closely related to the domain knowledge. The candidates of the actors include the humans that interact with the code, the hardware that is external from the code, and the other systems that have interaction with that code. Software interacts with humans and other systems in the real world. The interactors of the code mean the users of it and other systems that interact with it. The interactors have relationship with legacy system. The users of the code will perform the special tasks with it and the code can exchange information with other software, sending data to or receiving data from other software, or both. When the code is executed, it may exchange messages with other systems. It may send data to and receive data from other systems, send signals to control other systems, receive signals to be controlled, or be executed by other hardware systems. All these systems are regarded as the interactors of legacy system. Different interactors look at legacy system from different views. Every interactor is interested in special aspect of legacy system. All the interactors are served as the candidates of the actors of legacy system.

3.3. Distilling Interactors from stakeholders

The stakeholders are closely related to the software system. They are important to the analysis of legacy systems. Different stakeholders means end users, analysts, developers, system integrators, testers, technical writers, and project managers—each bring different agendas to a project, and each looks at that system in different ways at different times over the project’s life. System stakeholders are the people who are closely related to the system. They can be direct users, indirect users, managers of the users, senior managers, supporters who provide the help, the buyers who invest in the system, developers who work on the other systems and interact with the being developed, maintainers who will maintain the system, or system developers who work on it. Active stakeholder participation describes the need to have on-site access to people, typically users or their representatives, who have an authority and ability to provide information pertaining to the system, and to make pertinent and timely decisions regarding the requirements and prioritization. System success requires a great deal of involvement by system stakeholders. The system was accomplished by the contributions of all the stakeholders. Some of the stakeholders are regarded as the candidates of the system actors. It is vital to the identification of the objects of legacy system and the realisation of sequence diagrams. Because the humans are generally the most important candidates of the actors, it is important to take into account the stakeholders of legacy system. Some of the stakeholders can be regarded as the actors of legacy systems. Because legacy system is critical for the business and organisations, it is able to accomplish the specified tasks although ineffectively, inefficiently and riskily. The direct users, indirect users, the managers of the company, the supporters who provide the help, and the maintainers of legacy system all are the interactors.

3.4. Regarding Human Beings as the Most Important Actors

It is essential to regard one of those interactors as one . If there are more than two actors, humans should be considered as the most important actor candidate. Because only the humans can utilise the software directly or indirectly for their special purposes, it would be better to take into account all the users, the maintainers, the managers of the company, or even the organisations of the business. It is important for the choosing of the actors to think over the importance of the candidates. It is correct that all candidates of the actors are important to legacy system. However, different candidates may be of different importance to legacy system. The direct end users are the most important among all the candidates of the actors of legacy system. In most cases, they would utilise legacy system and determine when it is executed. The operational results are reported directly to them. Then they decide on what should be done next. The next may be the company managers because they not only use legacy system indirectly but also manage the direct end users of legacy system in the company. They maybe program one project of the business based on the output of legacy system. Both the direct end user of legacy system and the company managers use legacy system for the commercial purposes and in some cases may be the same. The next may be the indirect users and the maintainers. In most cases they do not use legacy system for the business purposes.

3.5. Distilling Hardware and Other Systems as Actors

Other interactors except human beings are considered and chosen as the actors, including the hardware that executes legacy system and other systems that interact with it. This describes the interacting effect and process of how legacy system is executed in the hardware and what messages are exchanged with other systems. Any systems that invoke legacy system are regarded as the actors.

3.6. Gathering Parameters

The original code statements are replaced within the corresponding intermediate language to generate the new representation. The original code with fully identified inputs and outputs with complicated constructs is used to recognise the data and the code of the program to facilitate analysis being prepared to produce complete formed objects. At the beginning of data organization and abstraction, the simple syntax for the objects is defined to abstract an object from the complex code by extracting the related program variables and working together the operations performed on them as single action on the object. In legacy system, all parameters are gathered together and important information about main data structures is recorded in files or tables. Basically, the program is the set of all parameters and the operations on them. It is a semantic set with the specified operational environment. Therefore the input parameters are served as the original raw, the programming environment and the execution of program lines are regarded as the producing machines that change the characteristics of those parameters, and the output parameters are the outcomes that are produced during the execution of the whole software program. In some cases the input parameters are the same as the output. In other cases they are different.

3.7. Classifying All Parameters into Several Groups

All the parameters and the operations on them are classified into several groups. Each group has one nucleus. Each group is related closely to each other to describe the common core. Although that core is not always obvious, every parameter and related operations are part of the specifications of that nucleus in that group. In every group, each parameter depicts one piece of the characteristics of that nucleus, such as its name, its age, its weight, its height, and its ID; and every operation is the change, assessment, or detection of those pieces of characteristics of that nucleus, such as the reaction of increasing or decreasing its weight, confirming whether it is at that age, and determine whether it has another name. It is possible that one operation may be involved in different groups. The best way to dealing with that problem is to separate that operation into different ones, each of which concentrates on only one group.

3.8. Extracting Objects

The nucleus that is contained in each group is regarded as one object. Sometimes the name of that nucleus is not mentioned in the program. It is necessary to define a name for it. All the parameters and operations in that group describe the attributes of that nucleus, the operations on that nucleus, and the relationships with other groups. They can be distilled as attributes and operations of that object. It is important to note that the name of the object, its attributes, and its operations are domain- related. The domain is the place in which the problem is allocated. All the definitions and extractions must be based on the domain knowledge. It is vital to check the validity of the objects and corresponding operations and attributes. The objects should not overlap. If it happens, the groups are redefined. If one operation on one object contains operation on other object, it is necessary to divide that operation into different parts, each of which focuses on only one object. If the inconsistence occurs in the code where the objects are accessed, it is necessary to restructure the code at a high level and correct those types of logical flaws. For identifying attributes, attributes are properties of individual object. When identifying properties of objects, only the attributes relevant to the system should be considered. It is important to note those properties that are represented by objects are not attributes. An attribute has a name that identifies it within an object; a brief description; and a type used to describe the legal values it can take. In the case of entity objects, any property that needs to be stored by the system is a candidate attribute. It is important to note that attributes represent the least stable part of the object model. Unless the added attributes are associated with additional functionality, the added attributes do not entail major changes in the object and system structure.

3.9. Defining Relationships among Objects

An association shows a relationship between two or more objects. Associations have several properties: z A name that is used to describe the association between the two objects. Association names are optional and need not be unique globally. z A role at each end that identifies the function of each object with respect to the associations. z A cardinality at each end that identifies the possible number of instances. Initially, the associations between objects are the most important because they reveal more information about the application domain. Every association should be named and roles assigned to each one. The associations between the objects are also domain-related. It is necessary to comprehend generalisation relationships between objects. Generalisation is used to eliminate redundancy from the analysis model. One form of this generalization relationship is aggregation. Aggregation is an association between two classes where one end of the association plays a more important role than another. The following criteria are used to identify an aggregation z a class is part of another class z the attributes’ values of one class propagate to the attribute values of another class z an action on one class implies an action on another class z objects of one class are subordinates of objects of another class Composition is a specialization of the aggregation relationship where class A, which is composed of class B, is part of class B. Patterns are used to identify and define generalization relationships among classes. A pattern is defined as recurring combinations of objects and classes (Alexander,1977). The main purpose of patterns is to find abstraction or higher-level designs, among objects that are very specific to the application domain. A pattern has the advantages of programming language independence, flexibility, extensibility, and portability. Modelling large systems entails many classes. These many classes, often tightly coupled together, make it difficult for this model to be understood. Furthermore, the architectural features of these classes are obscured by the large number of intertwined classes. Patterns may categorize these classes into a hierarchy of classes and sub-classes; however, patterns do not represent the general form of an application. In order to represent the implementation of targeted applications; frameworks, which are reusable infrastructure, are used. The architecture of a system is represented using categories, which are defined as packages that are stereotyped, in order to focus on their architectural function. Classes, which are logically connected with larger elements, are grouped into categories. With these categories, it is possible to build hierarchical layers that segment an application into levels of increasing abstraction. These categories are connected via stereotyped dependency relationships that represent imports between categories.

3.10. Classifying Calls

The call stands for the procedure or function call in the programming language. Four different kinds of program elements are distinguished: root program elements, leaf program elements, node program elements, and isolated program elements. Leaf program elements are the best candidates for the start point of comprehending legacy system. They do not call back to legacy source code, and although they require the development, it is possible to minimise the number of these elements by transferring entire subsystem in a single iteration.

3.11. Refining Operations

It is obvious that the root program elements call node or leaf program elements, leaf program elements are always called by root or node program elements, and isolated program elements neither call nor are called by all other program elements. Therefore the isolated program elements are understood at the structural level at first. This will lead to the decline of the complexity and size of legacy system. Then the operations of leaf and node program elements are understood. At last, the root program elements are taken into account. Legacy system will be understood over a series of increments, making functionality available to the user sooner than possible with a big bang deployment strategy. The goal is to break up the migration into small manageable steps, where the objectives of each increment are well defined. Incremental plans are driven foremost by complexity and technical feasibility. It is critical to ensure that the functionality, reliability, and performance of the system are not diminished after the development of the clean code has completed. Incremental deployment also offers opportunity for the organisation to gradually begin sustentation of modernised components, easing the transition from legacy to modern technologies. When refining the operations of legacy system, the preconditions and post-conditions of each operation are necessary to understand.

3.12. Refining Messages from Operations

A message that goes from one object to another goes from one object's lifeline to the other object's lifeline. An object can send a message to itself-that is, from its lifeline back to its own lifeline. It is necessary to realise that the timing sequence of operations is used in legacy system. The operations of legacy system are classified into different layers, especially in legacy system containing node and leaf program elements. They are described in sequence diagrams of that legacy system. Those operations that are similar and work sequentially and together are collected and presented as one message. For example, those three operations of sending the value of the day, the month and the year to the date are extracted into one message that is named sending the date. In legacy system, some kinds of complex computations, especially many mathematical formulas, are represented by the messages in that application domain. For instance, the formula S=Length1*Length2 are achieved as the message computing the rectangle area. Meanwhile, it is fundamental to order the messages of legacy system in sequence diagrams according to the time and sequence they are executed by. The objects from which and to which the message is sent are recorded . It is important to concentrate on the critical operations to refine the messages of legacy system. Not all the operations are treated as the messages and one operation to one message (Li, 2001).

3.13. Realising

A sequence diagram is an interaction diagram that details how operations are carried out – what messages are sent and when. Sequence diagrams are organized according to time. Normally time proceeds down the page. A sequence diagram has two dimensions: z the vertical dimension represents time. z the horizontal dimension represents object interaction. The vertical line is called the object’s lifeline. The lifeline represents the object’s life during the interaction. A message is represented by an arrow between the lifelines of the sender and the receiver objects. A message is shown as a horizontal solid arrow from the lifeline of the sender to the lifetime of the receiver. The arrow is labelled with the name of the operation to be invoked or the name of the signal. Its argument values or argument expressions may be presented, as well. The arrow may be also labelled with a sequence number. Optionally, a message can be prefixed with an * (iteration maker), which shows that the message is sent many times. Sequence diagram are used to demonstrate the flow of control for a certain part of a program. It shows how objects in the system interact based on messages sent and returned. Layering is a common approach to systems to be organized. As a result it makes sense to layer sequence diagrams of legacy system in a similar manner. That is done based on the layers of the program calls in legacy system. The root program element is regarded as the first and the most important sequence diagram. Other program elements are included in that diagram. The node program elements are presented before the leaf and isolated program elements. The primary actor of legacy system is allocated at the top at the left side of sequence diagram. Other actors are following it on the time and importance. And the reactive actors that are the reactors of legacy system are described at the top at the right side of sequence diagrams that are treated as the entities that legacy system interacts with. The message name is justified and aligned with the arrowhead. The receiver of the message implements the corresponding operations and it makes sense that the message name is close to that . The syntax of the implementation language of legacy system is utilised in naming the messages. That improves the understandability and readability. In a sequence diagram, an object receives messages and invokes the operation of time ordering. Only if one message has been executed, the next message can then be performed in time dimension. So the time periods of executing the messages by objects are clearly shown on sequence diagrams. Return values are common in legacy system. When an object finishes processing a message, control returns to the sender of the message. This marks the end of the activation corresponding to that message, and is marked by a dashed arrow going from the bottom of the activation rectangle back to the lifeline of the role that sent the message giving rise to the activation. Activations and return messages are optional on a sequence diagram. They are optionally indicated using a dashed arrow with a label indicating the return value. When they are referred to in the next part of sequence diagram, it is necessary to model the return values. Otherwise, they are ignored in sequence diagrams in order to make those diagrams of legacy system clearer and simpler.

4. Achieving Use Case Diagram from Legacy system

4.1. Confirming Business Domain

The domain of a system is the set of demands allocated to it. It is essential to analyse where the problem is located. Because legacy system implies business data and rules, the problem analysis is to understand business problems. This is closely related to the business domain. Initial solution boundaries and constraints of legacy software code are defined from both technical and business perspectives. This focuses on program boundary. The boundary is decided by the domain in which place legacy system performs its tasks. Practically, the same program is used in different software, and the different software packages are used in different domains. That will results in the software programming corresponding to objective reality. The domain knowledge is permeating all the software including the name of the parameters, the attributes of the procedure and parameters, the operational constrains of execution, and the associations between the elements (Yang, 1991; Pu, 2003).

4.2. Eliminate Dead Code

The dead code is useless. Without it the latter program has the same function as the former. When dead code is deleted, the sequential execution is necessary to pay attention to in the semantics.

4.3. Specifying Clean Code

Specifying the clean code means transforming the clean code into the specification. Because legacy system is large, unstructured, complicated, and old-fashioned, it is difficult to read and understand it. Although the clean code is improved, it is still unreadable. The specification includes all the information of the clean code. It is similarly complex in structure with the clean code. The work of understanding legacy system is an intelligent activity involving the understanding of what the original programs have been designed to do, how those legacy programs have been achieved from the original design, why those programming approaches were chosen (Li,2001). Because no method of code comprehension is perfect and it relies on the involvement of the human beings completely, it is important for all those aspects to be understood for the changes and enhancements of the source code. Having well-structured specification supports consistency in modelling use cases, helps to accomplish completeness especially to identify precondition, constrains, and business rules, and is useful to define use cases.

4.4. Finding Out the Interactors

It is important to identify who is going to be using legacy system directly. This should be done from the outside of legacy system. That is deeply involved in the human interaction and closely related to the domain knowledge. The candidates of the actors include the humans that interact with the code, the hardware that is external from the code, and the other systems that have interaction with that code. Software interacts with humans and other systems in the real world. The interactors of the code mean the users of it and other systems that interact with it. The interactors have relationship with legacy system. The users of the code will perform the special tasks with it and the code can exchange information with other software, sending data to or receiving data from other software, or both (Cotterell,1995). When the code is executed, it may exchange messages with other systems. It may send data to and receive data from other systems, send signals to control other systems, receive signals to be controlled, or be executed by other hardware systems. All these systems are regarded as the interactors of legacy system. Different interactors look at legacy system from different views. Every interactor is interested in special aspect of legacy system. All the interactors are served as the candidates of the actors of legacy system.

4.5. Distilling Interactors from the stakeholders

The stakeholders of legacy system may contain the interactors. The stakeholders mean end users, analysts, developers, system integrators, testers, technical writers, and managers—each brings different perspective to a project, and each looks at that system in different ways at different times (Booch,1999). Because legacy system is critical for the business and organisations, it is able to accomplish the specified tasks although ineffectively, inefficiently and riskily. The direct users, indirect users, the managers of the company, the supporters who provide the help, and the maintainers of legacy system all are the interactors.

4.6. Regarding Human Beings as the Most Important Actors

It is essential to regard one of these interactors as one actor. If there are more than two actors, humans should be considered as the most important actor candidate. Because only the humans can utilise the software directly or indirectly for their special purposes, it would be better to take into account all the users, the maintainers, the managers of the company, or even the organisations of the business. It is important for the choosing of the actors to think over the importance of the candidates. It is correct that all candidates of the actors are important to legacy system. However, different candidates may be of different importance to legacy system. The direct end users are the most important among all the candidates of the actors of legacy system. In most cases, they would utilise legacy system and determine when it is executed. The operational results are reported directly to them. Then they decide on what should be done next. The next may be the company managers because they not only use legacy system indirectly but also manage the direct end users of legacy system in the company. They maybe programme one project of the business based on the output of legacy system. Both the direct end user of legacy system and the company managers use legacy system for the commercial purposes and in some cases may be the same. The next may be the indirect users and the maintainers. In most cases they do not use legacy system for the business purposes.

4.7. Refining Hardware and Other Systems as Actors

Other interactors except human beings are considered and chosen as the actors, including the hardware that executes it and other systems that interact with it. This describes the interacting effect and process of how legacy system is executed in the hardware and what messages are exchanged with other systems. Any systems that invoke legacy system are regarded as the actors.

4.8. Defining the Main Tasks

The selection of the actors is based on the specification of legacy system. The specification of the software contains the functionality of the system. Legacy system is able to perform the main tasks that represent the functionality of it. The choosing of the actors is closely related to the tasks of what legacy system performs. The main purpose of the users of legacy system, including the direct and indirect end users and company managers, is for the success of the business in the commercial society. Other humans are also get benefit from the operation of legacy system. Other systems and the hardware are the support for it. Legacy system has the responsibility of the accomplishment of the expected tasks for the selected actor. It is important to define what that actor wants to do with the system. The demands of that actor must be defined and analysed in order to resolve conflicts and remove ambiguities. The start point is the idea that depicts a vague notion of main task that the actor wants legacy software to do, which may contain ambiguities, inconsistencies, confusion, and incompleteness (Yang,2000). It is necessary to gather all demands of that actor, identify the functional operations of the existing legacy system for that actor’s purpose and quality attributes including performance, safety, portability, environmental questions of each part, and form viewpoints to represent all of the parts that interest that actor.

4.9. Introducing Other Tasks

It is possible that the selected actor may have more than one purpose for the performance of legacy system. The actors may demand legacy software to input the original records, calculate the interests of the business, query the sequence of the record according to the rate, output the income to the file, and then print it. The demands of that actor for legacy system should be prioritised in order of the importance. Meanwhile, a collection of associated missions may be conflicting and ambiguous. If the specification is simple and not related too much, it will work well. But during specifying legacy system, when the missions are closely related or overlap, it is necessary to have to understand the overall tasks of legacy system and distinguish the basic and usual task from them. In order to create one definitive task, the various demands for the system of that actor must be compared and any conflicts, ambiguities or overlaps must be identified and removed. The purpose of performing legacy system for that actor should be regarded as the possible candidates of the functional behaviour of legacy system for the use case of that actor. Each of these things that the actor wants the system to do is a candidate of use case.

4.10. Identifying Use Cases

A use case represents a task that the users of legacy system that are the selected actor want the system to do (Howe,2002). A use case contains a special function of that legacy system that is specified as the usage. As a user-centred analysis technique, the purpose of a use case is to yield a result of measurable value to that actor in response to the request of that actor in the commercial society. The tasks related with that selected actor within a complex, large legacy system should represent the development of a complete detailed use case of that selected actor (Reed,1998). Identifying use cases is a deep involvement of human beings from the external users’ point of view, and is also handled in the business domain. If one use case of the selected actor for legacy system is performed, then other tasks related to the same selected actor are chosen into the extraction and served as the other use cases of the same actor. Therefore all the use cases that have relations with the same selected actor are extracted and defined. In legacy system, the systematic calls are regarded as use cases. Those calls invoke the operational system to perform the specific tasks. They input parameters to operational environment and achieve results with the feedback of that environment. The detail of how the operational environment performs that task is not shown. But the procedures of legacy system are different from those systematic calls and they are still served as use cases. They present the way in which the tasks are performed, and include subprograms and functions.

4.11. Producing Primary Use Cases

Each use case represents one task of legacy system. It is essential to depict that basic course in the description for that selected actor. The basic course represents the main task of the selected actor for that legacy system. The main task of legacy system should be performed at first. The corresponding use case is the essential. The most usual task is regarded as the primary use case of the selected actor of that legacy system. The most frequently used use case is considered. It is critical to decide on the most usual course when that selected actor uses legacy system. The importance for all the use cases of the selected actor of legacy system is different. It is useful to classify the use cases of that selected actor. Use cases are categorized as primary use cases that are main functions and secondary use cases that are secondary to the system or rarely occur. A primary use case is presented as a brief description of the main processes used to accomplish the system function (Graham, 2000). Each of the major processes is corresponding to a primary use case of the modelled system. In legacy system, if one systematic call is not a report for the failure of the execution, and it is one main task, it is a primary use case. The procedure is designed to accomplish the specific task and if it is not used to handle the failure of the execution, and it is one main function, it is a primary use case. A set of sequential lines that are served as one main task, are regarded as a primary use case.

4.12. Yielding Secondary Use Cases

After the basic task of the selected actor for legacy system is produced as the primary use case, other tasks that seldom occur or may result in failure for legacy system are refined and regarded as alternative use cases. Other tasks are accomplished alternatively. Other courses are specified as the alternatives use cases. They are served as the secondary use cases. In legacy system, if one systematic call is a report for the failure of the execution, or it is one secondary task, or it seldom happens, it is a secondary use case. If the procedure is used to handle the failure of the execution, or it is one secondary function, or it is neither invoked nor executed in usual, it is a secondary use case. A set of sequential lines that are utilised to deal with the failure of the execution, or perform one secondary task, or cope with unusual occasion, are regarded as a secondary use case.

4.13. Describing Use Cases.

The documentation of the use case contains the name of the process that is also called use case name, the actors that interact with the system including the initiating actor, the type of use case, and the description of the process. The syntax of the primary use case is described as z Name: system action, z Actors: Actor1 (Initiator), Actor2, other systems, z Type: primary, z Description: the use case begins when Actor1 interacts with the system. The secondary use cases are similarly described only except the difference of the type of the notion “secondary”. It is important to notice each use case description against the descriptions of the other use cases. It is necessary to differentiate the mission of one use case from another. The overlap, incompleteness, inconsistence and ambiguities should be removed. It is essential to pick out any commonality and extract those out as the “used” use cases. This is the efficient way of finding “used” use cases (Rubin, 1998).

4.14. Generating All Actors and Use Cases

The description mentioned above can help find other actors of legacy system and their use cases. The relationships among the actors and the use cases of legacy system are identified (Graham, 2001). During the production of use cases, it is possible that one use case has relation to more than one actor. One use case may be invoked by two actors or more. One procedure may be called by more than two other procedures in legacy system and it may be performed at the needs of two other procedures or more. Therefore at the stage of the producing the use cases of different actors, it is important to record the interaction between actors and use cases. It is essential to decide on whether the use case is invoked by only one actor or by more.

4.15. Building up the Relationships

Based on the identification of all use cases of that selected actor of legacy system, the relationships among those use cases are built up for the extend or the include. The relationship between that selected actor and its one use case is the association. The association relationship between an actor and a use case represents that the actor that means the user of legacy system initiates the use case or transfer messages between the actor and the use case. An extend relationship (Booch,1999) between use cases represents that the base use case implicitly incorporates the behaviour of another use case at a location specified indirectly by the extending use case. The base use case may stand alone, but under certain situations, its behaviour may be extended by the behaviour of another use case. An extend relationship can model part of the use case that the user may regard as optional system behaviour. An include relationship between use cases is defined as that the base use case explicitly incorporates the behaviour of another use case at a location specified in the base (Larman,1998). The like between actors should be considered with generalizations existing between these actors. A generalisation relationship represents that the child actor inherits the behaviour and meaning of parent actor. When a use case may be invoked across several use case steps, <> is applied. Stereotype <> is adopted when one use case is invoked by another.

4.16. Drawing Use Case Diagrams

Use cases and actors are interacted in the use case diagrams with the presentation of their relationships. In use case diagrams, use case name should begin with a strong verb and be consistent with their domain terminology. When a use case diagram is about to be drawn, the importance and time sequence of use cases should be considered in sequence (Ambler,2002). An actor represents a coherent set of roles that user of use case plays, not positions. Actor’s name should be singular and business-relevant noun. When actors are drawn, they should be outside of the boundaries of use case diagrams and primary actors are put in the top-left corner of the diagram. An actor can interact with one or more use cases, but it cannot interact with another actor. The interacting relationships between actors except the generalisation should not be depicted in the use case diagrams even if in fact they interact with each other. Those interactions of actors are recorded in the text of the use case diagrams. The system actor is stereotyped with “system” (Heywood,2002). The use case diagrams should be consistent with the actors of legacy system and their use cases within those use case diagrams. It is necessary to avoid more than four levels of use case associations to make the use case diagram clearer and simpler. When a use case diagram of legacy system is being drawn, an included use case is placed to the right of the invoking use case, the extending use case below the parent use case, and an inheriting use case below the base use case. It is important to note that the consistence between parent and child actors, and use cases are essential for legacy system. Names, meanings and domains of these actors or use cases of legacy system are consistent with each other. The like of use cases and use case diagrams for the same legacy system is essential to understanding those use case diagrams. System boundary of legacy system indicates the scope of the modelled programs. A use case diagram only shows the interactions of actors of legacy system with the code. It should not include the boundaries of other software and programs. There are many tools that help draw use case diagrams. They include Rational Rose, UMLet, Artisan Real-time Modeller and Real-time Studio Professional, Atos Origin Delphia Object Modeller, Documentator, Logic Explorers Code Logic, MasterCraft Component Modeller, etc. Many tools support reverse engineering, forward engineering, and reengineering.

5. Achieving Collaboration Diagrams from Legacy system

Because UML collaboration diagram is equivalent to sequence diagram, it is produced with the same steps of the realisation of sequence diagram. In collaboration diagram, the boxes represent the roles that are named as the objects and the solid lines are the association paths representing association roles. Layering is a common approach to systems to be organized. As a result it makes sense to layer collaboration diagrams of legacy system in a similar manner. That is done based on the layers of the program calls in legacy system. The root program element is regarded as the first and the most important collaboration diagram. Other program elements are included in that diagram. The node program elements are presented before the leaf and isolated program elements (Fowler, 2000). The message name is justified and aligned with the arrowhead. The receiver of the message implements the corresponding operations and it makes sense that the message name is close to that classifier. The syntax of the implementation language of legacy system is utilised in naming the messages. That improves the understandability and readability. In a collaboration diagram, an object receives messages and invokes the operation of time ordering. Only if one message has been executed, the next message can then be performed in time dimension. So the time periods of executing the messages by objects are clearly shown with the numbering of those messages on collaboration diagrams. Return values are common in legacy system. When an object finishes processing a message, control returns to the sender of the message. This marks the end of the activation corresponding to that message, and is marked by an arrow going from the receiving object back to the object that sent the message giving rise to the activation. Most of the messages do not have return values modelled for them. This is the common modelling convention because it makes it easy to visually determine the amount of message flow to a given object, and thus judge to potential coupling it is involved with, often an important consideration for refactoring the design. Activations and return messages are optional on a collaboration diagram. They are optionally indicated using an arrow with a label indicating the return value. When they are referred to in the next part of collaboration diagram, it is necessary to model the return values. Otherwise, they are ignored in collaboration diagrams in order to make those diagrams of legacy system clearer and simpler. The lines between the classifiers depicted on a UML collaboration diagram represent instances of the relationships – including associations, aggregations, compositions, and dependencies – between classifiers. Relationship details – such as the multiplicities, the association roles, or the name of the relationship – are typically not modelled on links within collaboration diagrams. Roles are indicated using two styles, on links and within a class. The link-based approach is more common than the class- based role notation. The links on collaboration diagram must reflect the relationships between classes within UML class diagrams. The only way for an object to collaborate with another is for it to know about that other object. This implies that there must be an association, aggregation, or composition relationship between the two classes, a dependency relationship, or an implied relationship. Sometimes it is difficult to validate the consistency between those diagrams, particularly if class diagrams do not model all of the dependencies or implied relationships.

6. Acquiring Statecharts

Statecharts differ from activity diagrams in that these statecharts must model events external to the system in addition to modelling the system’s internal processing. The reverse engineering process has difficulty in determining these external events because these events are often not apparent from the system artefacts, such as source code. In order to determine external events, a run-time environment for legacy system must be created and then, with a comprehensive set of test data, the /condition sequence, which has been outputted by legacy system in response to this test data, is recorded and analysed. Systa describes a process of extracting state diagrams from legacy systems by first creating an event trace diagram that models the interaction of a set of objects and actors during a specific usage of a system and then uses this event trace diagram to create a . Given a comprehensive set of test cases that provides coverage of all possible behaviours of a system, legacy system is run using these test cases as input. The running system is monitored for the event and condition sequences that are produced by objects of that system. This event/condition sequence is sent to a tool, SCED, that constructs a scenario diagram that model the interactions of a set of objects implied by the event/condition sequence. SCED then is used to synthesize the general behaviour of an object as a state diagram, given a set of scenario diagrams in which the object participates (Systa,1997; Booch, 1999).

7. A Reverse Engineering Tool, TAGDUR

According to Hausi A. Müller, reengineering tools may be categorised several different ways according to their function. Some tools function as analysis tools by extracting artefacts, such as call graphs and metrics, from the legacy system. Other tools function as an understanding environment which parses the legacy system and stores the extracted software artefact in a repository for querying, behavioural pattern matching, and abstract representation. Still other tools offer an integrated forward and reverse engineering environment that incorporates both analysis and understanding tools with the ability for code generation. Furthermore, tools can be designed for scale, extensibility, or applicability and can be integrated along control, data, and presentation lines (Müller,2004). Integrisoft’s Hindsight tool is designed for program understanding with the ability to provide documentation of the program’s control flow, data structures, test coverage, and complexity. Viasoft’s Existing Program Workbench (EPW) tool is a parsing engine that also provides documentation of the program’s control and data flow. EPW decomposes a large COBOL program into smaller, more manageable units through program slicing and code extraction. ’s Logiscope is a program analysis tool. IDE's StP/SE and StP/RevC is an integrated forward and reverse engineering toolset for C. McCabe’s Visual Reengineering Toolset analyses systems written in multiple programming languages such as C, COBOL, and Fortran. Reasoning’s Software Refinery generates tools for reverse engineering. This tool has the features of executable program specifications and rule-based program transformations. Modeling a system into UML diagrams during the reverse engineering process poses some particular problems. UML was designed to model object-oriented systems; legacy systems, the target of most reverse engineering efforts, tend to be procedural rather than object structured. Furthermore, these legacy systems tend to be designed to operate in a strictly sequential manner and to respond to procedural invocations rather than events. In order to enable this type of legacy system to be modeled in UML, it is important to first transform the original legacy system from a procedurally structured and strictly sequential-operating design to an object-oriented, event-driven system. TAGDUR is a forward and reverse engineering toolset which, combined with Fermat, has the ability of program analysis, dead code elimination, rule-based program analysis and transformation, and code generation. TAGDUR also documents the transformed system via a series of UML diagrams.

7.1 Transforming Procedural Legacy Systems into Object-Oriented Systems

Before the transformation, we first convert legacy system from its original programming language into WSL using a set of conversion rules particular to that original programming language which were formulated using Martin Ward’s paper, The Syntax and Semantics of the Wide Spectrum Language, which defined the basis of the Wide Spectrum Language. WSL, or the Wide Spectrum Language, is a mathematical, intermediate language. WSL is called wide spectrum because this language represents both represent both high and low level constructs. Consequently, WSL is ideally suited to represent all types of programming languages. Furthermore, this conversion from original programming language gives the converted system an implementation independence such that any transformation of this program is independent of the programming language that the system was originally developed in. WSL was chosen for an intermediate language for several reasons. WSL is programming and platform independent. Consequently, the transformations and modelling that TAGDUR performs on a WSL-represented system could be performed regardless of whether the original legacy system was in COBOL or C. The original legacy systems need only to be converted into WSL first. WSL has other advantages as well. WSL has excellent tool support, in the FermaT transformation system which allows transformations and code simplification to be carried out automatically. It has the capability of enabling proof-of-correctness testing. WSL is programming and platform independent. WSL was also specifically designed to be easy to analyse and transform. This transformation process involves three main steps. The first step is object identification where the degree of coupling and cohesion between variables and procedures are analyzed. Closely- related variables and procedures are grouped into objects with the variables becoming the object’s attributes and procedures becoming the object’s methods. (Millham, 2002) The next step of this transformation process is to analyze two or more normally-sequential units (whether these units are procedures, program blocks, or individual code lines) for dependencies. If there are no dependencies, the units that are being evaluated are deemed to be able to execute independently; otherwise, if there are dependencies, the units being evaluated are deemed to execute sequentially only. A dependency is defined as a simultaneous read and write operation of the same shared variable by two or more granular units of execution. Simultaneous reads and writes of the same shared variable results in an inconsistent state; an example, the value read by task A may be different depending on whether parallel-executing task B updates this shared variable before or after the variable is read by task A. Because such an inconsistent state can not be allowed, these tasks must execute in their original sequential order rather than be allowed to execute independently. (Millham, 2003a) The third step in this process is to identify possible events in the system and to model these events as asynchronous or synchronous events in UML. Although events occur outside the application domain, events such as user input, in many batch-oriented legacy systems, most events occur within the application domain. In other words, in batch-oriented legacy systems, the only external event is the arrival of batch intput. All other events occur within the application domain. The latter type of events include input/output operations, procedure invocations, and system interrupts and exceptions. This transformation step consists of parsing the source code in order to identify these possible events. These events, once identified, are then analyzed in order to determine if any dependency occurs between the code line where the event occurs and code lines immediately successive to this code line. If a dependency exists, which means that a system must wait for the event handler invoked by the occurrence of the event to complete its execution before resuming its normal post-event execution, the event is deemed to be synchronous. Otherwise, if no dependency exists, the event is deemed to be asynchronous. It is necessary to determine if each event is asynchronous or synchronous before they can be properly depicted as such in various UML diagrams which model these events.

7.2 Creating UML Diagrams Using TAGDUR

In many legacy systems, finding system artefacts to use as a basis for modelling UML diagrams of the system is difficult. Any documentation of the system often does not exist. The original developers and end-users, who would be most knowledgeable about the design of the system, have long since left the organization. Often these systems have been left in light maintenance mode for many years; consequently, current maintainers and end-users have a minimal knowledge of the system. Because the source code, along with the associated data files, are the only available system artefacts, any UML diagrams that are generated to model this system must be based primarily on source code.

7.2.1 Class Diagrams

Class diagrams represent the static structure of the system. Class diagrams convey information about classes used in the system such as their properties, their interfaces, and how these classes interact with one another. Associations are relationships between instances of two classes. After our reverse engineering process transforms legacy system from its procedural structure to that of an object-oriented one, our tool extracts the class diagram from the transformed system. Class definitions from the system are modeled as classes in the UML class diagram; variables encapsulated within a class become class attributes and procedures associated with a class become methods in the UML class diagram. Classes in this system are grouped into UML packages. A , in this legacy system example, corresponds to the original COBOL copybook. The assumption is that the original programmers divided up the system into modules, in this case COBOL copybooks, according to some logical criteria. This logical modularization of the system is preserved in the form of packages in UML diagrams. These packages with their classes can also be considered frameworks of classes with each framework retaining the logical partitioning criteria that divided the original system. Classes that have been identified within legacy system may be aggregated into a super-class hierarchy using a number of criteria. Highly-coupled groups of classes may be grouped into a super- class hierarchial structure. Since this legacy system makes much use of logical file names rather than physical file names, I/O operations may be modeled as interactions between the class whose method invokes the I/O operation and a class deemed to represent the logical file. Because many of the latter class may access the same physical file, these logical filename classes may be deemed to be sub- classes of the physical filename class. In turn, the physical filename class is, in turn, deemed to be a subclass of the File class itself.

Pre-Defined_Object: FILE

1 1 * *

Physical_File_Object: F101 Physical_File_Object: F103

1

* * * 1

1 1 * Logical_File_Class: TRANS-MASTER Logical_File_Class: RPT-MASTER Logical_File_Class: 1000-RPT Logical_File_Class: 1500-TRANS

Class A Class B Class C Class D

In this example, four classes, whose methods each invoke an I/O operation, interact with a logical filename class. These logical filename classes, in turn, are grouped, as part of a compositional relationship, with their physical filename classes, F101 and F102. These physical filename classes are then grouped, as part of a compositional relationship, with the pre-defined super-class, File. Our tool models accesses between classes of other’s attributes or methods as static associations between classes. Each end of an association contains a multiplicity; the multiplicity of an association end is the number of possible instances of the class associated with a single instance of the other end. Depending on the ratio of classes accessing the attributes/method to the classes accessed, the multiplicity of these association ends may be modeled as many-to-one, one-to-one, etc. An example, if Class A accesses multiple methods and attributes of Class B, this association end would be modeled as many-to-one. If Class A access only one attribute of Class B, then this association end would be modeled as one-to-one multiplicity. The information gained during the transformation process of this system is used in modelling these multiplicities. During the object identification process, TAGDUR constructs two matrices: one matrix is a procedural usage grid which records the number of times procedure A is called by procedure B and the other matrix is a variable usage grid which records the number of times variable A is accessed within procedure B. These matrixes are used during the object identification process where highly coupled procedures and variables are grouped into classes. These matrices are also used when modeling UML diagrams. Variables or procedures that form attributes/operations of one class, class A, but are accessed by procedures that form operations of another class, class B, are modelled as an association between classes A and B. These two usage matrices are used when modeled the multiplicity of these associations. Using both procedural and variable usage matrices, the number of times that all variables and procedures of object A are accessed by procedures of object B form the type of multiplicity relationship between classes A and B. An example, if the procedures of object B access the variables and procedures of object A ten times, the association between object A and B is modeled as an association of 1:n multiplicity. (Millham, 2003b)

7.2.2 Sequence Diagrams

A sequence diagram is an interaction diagram that details how operations are carried out -- what messages are sent and when. Sequence diagrams are organized according to time. The time progresses as you go down the page. The objects involved in the operation are listed from left to right according to when they take part in the message sequence. Sequence diagrams are modeled as one line; this line containing all the objects in the system with each object in its own swimlane. Because all objects have a global scope, variables in COBOL are globally scoped, these objects are shown as being immediately activated after their declaration in the swimlane. Messages are depicted with their origin at the source object and their endpoint being the target object. These messages are depicted as synchronous or asynchronous. Exceptions are depicted as messages originating in the object where the exception occurred and ending at System object, which handles exceptions. Similarly, system interrupts are depicted as messages originating in the object where the interrupt occurred and ending at System object, which handles interrupts. Procedure calls between objects are depicted as messages between the source, or caller, object and the target, or callee, object. Exceptions and interrupts are modeled as messages between the object where the interrupt/exception occurred, the source object, and the System object, the target object. Procedure I/O calls, such as Put or Fetch statements, are modeled as messages between the source object, where the procedure I/O calls is invoked, and the File object, the target object. Depending on whether immediately successive tasks are dependent on the result of these message as determined during the identification of independent process, these messages are depicted as asynchronous or synchronous in the sequence diagrams. The determination of independent tasks process assigns sequence numbers at the procedural and individual codeline granularity. The order of sequence number indicates the order of execution such as the task(s) with the lowest sequence number being executed first followed by the tasks with the next lowest sequence number and so on. Tasks with the same sequence number may be executed in parallel. Each message depicted in the sequence diagram is given a sequence number in the number: .. The procedure task seeuence number is the sequence number of the procedure where the message is invoked and individual codeline task sequence is the sequence of the codeline where the message is invoked. The sequence indicates the order of execution in ascending order; messages with the same sequence number may be executed in parallel. Depending on whether the task immediately succeeding the message task is dependent on the message finish execution first, the message is modeled as synchronous, if such a dependency exists, or as asynchronous, if no such dependency exists. We model sequence diagrams as message passing between objects of the system. (Millham, 2003b)

ObjectA ObjectB File Object System Object

Message: Function Call from A to B()

Message : Return Value of Function Call from A to B()

Message: Asynch Procedure Call from B to File Object()

Message: Exception Occurrence from ObjectA to System Object()

7.2.3 Component Diagrams

Component diagrams model software components and their relationships within the implementation environment. These components may be simple files or dynamic libraries. Relationships between components are modelled as dependency relationships; a relationship between two components is identified when one component offers services to other components. Generally, these components represent compilation dependencies. Several of these components may be grouped into packages, or subsystems, according to some logical criteria. TAGDUR models component diagrams in its re-engineering process. Often legacy system consists of a main program file that calls, or loads, several program sub-files which, in turn, load other program sub-files. This sub-file call hierarchy is first identified by parsing the source code of legacy file in order to identify which sub-files are loaded from which files; this call graph is modelled as a . (Millham, 2004)

7.2.4 Deployment Diagrams

Deployment diagrams show the physical configurations of software and hardware. Our tool parses the source code to identify any relationships between the classes or packages that contain the source code and any external entities that this source code refers to. An example, a WSL statement in the source code might access a Terminal device. Parsing by this tool will reveal the relationship between the class containing this WSL statement and the Terminal object. This relationship is then depicted in the Deployment diagram. Deployment diagrams are derived by parsing source code to determine the possible relationship between the class containing the source code being parsed and external entities such as a file system and its physical files, peripheral devices such as a printer, or user interfaces. Physical devices, such as printers, are modeled as nodes in deployment diagrams. Program modules, such as the original COBOL Copybooks, are modeled as component instances. These modules are composite parts of legacy system (original source code programs). (Millham, 2004)

MainCOBOLProgram

CopybookA CopybookB

* *

* *

File System Database

7.2.5 Activity Diagrams

Activity diagrams describe the internal behaviour of a class method as a sequence of steps. These sequence of steps model the dynamic, or behavioural, view of a system in contrast to class diagrams, which model the static, or structural, view of the system. An activity in UML represents a step in the execution of a business process. Activities are linked by connections, called transitions, which connect an activity to its next activity. The transitions between activities may be guarded by mutually exclusive Boolean conditions. These conditions determine which control flow(s) and activities are selected. Activity diagrams may contain action states. Action states are states that model atomic actions or operations. Activity diagrams may also contain events. Activity diagrams can be partitioned into object swimlanes that determine where an activity is placed in the swimlane of the object where the activity occurs. Tasks that have been determined to be able to execute in parallel by the independent task evaluation step of the transformation process are modelled as parallel activities and flows in the activity diagram while tasks that have been determined to be able to execute sequentially only are modeled as sequential activities and flows. In activity diagrams, synchronization bars are used to synchronise the divergence of sequential activities into parallel tasks or the merging of parallel tasks to a sequential task. These enable the control flow to transition to several parallel activities simultaneously and to ensure that all parallel tasks complete before proceeding to execute the next sequential task. Our activity diagrams are code-based. Each activity represents an atomic WSL statement. Because the WSL code lines of a procedure form steps in the execution of this procedure and because individual WSL code lines form an atomic unit of execution, basing activities on individual WSL code lines is a logical basis for the nodes of an activity diagram. Conditions within WSL control constructs, such as WSL’s if-then statements, form conditions within the guards that govern the flow of control to activities enclosed by the condition blocks of this WSL control construct. One might question why TAGDUR chooses to generate activity diagrams from WSL code rather than simply allow the developers to view the WSL or generated C++ code of the transformed system. Activity diagrams were chosen for many reasons. UML is widely understood by many developers while WSL and, to a much lesser extent, C++ has less of a universal understanding. Furthermore, activity diagrams clearly represent the interaction among objects and the occurrences of events among activities; this representation would be much less apparent than if the developer were simply perusing the code of the transformed system. Although our activity diagram is based on WSL code and individual WSL code lines are used to distinguish action states in the activity diagram, this lack of understanding is mitigated by attached comments which describe the WSL code line being modeled. An example, a file I/O event is described in terms of its type (File I/O), sub-type (Read), and destination, source, and index variables. The decision to base our activity diagram on WSL, rather than C++, was due to several reasons, including the fact that WSL is programming and platform independent. An example, a file I/O operation is represented through one type of WSL statement while representing the same operation in C++ may take several C++ statements depending on the type of file being accessed. Consequently, many implementation-specific details that would clutter an activity diagram based on C++ code are avoided if this same activity diagram is based on WSL code instead. A small sample of WSL code is presented with a corresponding UML activity diagram based on this code.

WSL Code Sample: X : = Y + 1 If X > 4 Then Fetch D1, Y2, Z2 Else Call B.UpdateRec(X) Fi J := D1 + X M := N + 2 /* potentially parallel operation */ K := 3 /* potentially parallel operation */

The following diagram models the following WSL code sample as action states in an activity diagram. Each action state is labeled by the WSL statement whose entry action the state represents. Each action state is placed in the object swimlane whose object produces the action. An example, the invocation of Class B’s UpdateRec method is represented as an action state in the Object B swimlane. Potential parallel operations, such as “M := N + 2” and “K:= 3”, are modeled as parallel flows emanating from a fork synchronization bar. An if-then-else WSL construct is modeled in the activity diagram as a branch, with mutually exclusive guard conditions, to two action states. Using these guard conditions, the control flow in the activity diagram is governed in a similar manner to the system that it represents. Potentially parallel executing WSL code lines are modeled as parallel control flows in the activity diagram. (Millham, 2003b) ObjectA ObjectB ObjectFile

Event: File I/O Sub-Type: [X > 4] X := Y + 1 Read [Not (X > 4) ] Destination Variable: D1 Event: Method Source Invokation Variable: Procedure: Y2 UpdateRec Index J := D1 + X Parameter(s): X Variable: SourceCode: Z2 UpdateRec(X) SourceCod e: Fetch D1, Y2, Z2

M := N + 2 K := 3

Activity Diagram Representation of the WSL Code Sample

7.2.6 Statecharts

TAGDUR does not extract statecharts, similar to state diagrams, from source code but does derive activity diagrams, which describe a system’s behaviour, from source code. Statecharts differ fundamentally from activity diagrams in that statecharts model external events interacting with system objects while activity diagrams model the internal processing of the system. Although statechart modelling is a future feature of TAGDUR, statecharts may not be necessary to understand the behaviour of many legacy systems. While external events may be a critical part of legacy systems that are highly interactive and reactive, many legacy systems are batch-oriented. Because these systems are batch-oriented, the only event external to this type of system is the arrival of batch input which invokes the batch application. Consequently, in batch-oriented legacy systems, activity diagrams are sufficient to describe the behaviour of this system.

7.2.7 Use Case Diagrams

Use case diagrams are very necessary in order to clarify system requirements from the end-users. However, many other methods, such as textual representation or formal specifications methods such as Z, exist to depict the information contained within a use case diagram. Thus, while defining system requirements is a necessary part of the software development process, it is not strictly necessary to utilize use case diagrams to model this information. Use case diagrams are very difficult to derive from source code during the reverse engineering process. While it is possible to extract system processes from activity diagrams and to properly label these processes using some type of artificial intelligence, it is very difficult to derive use cases without significant manual input from the users. An example, the purpose of a process may be derived through such means as natural language parsing and analysis of programmer comments associated with this process and of the names of associated procedures. An example, if a procedure is named UPD-ACCT, the reverse engineering tool, through an analysis of this procedure name and concordance with programmer-defined abbreviation standards, may conclude that this procedures purpose is to update a bank account. However, this natural language parsing and analysis process is often faulty. Furthermore, a reverse engineering tool can not easily know, without explicit user input, of actors external to the system. An example, the reverse engineering tool may detect places in the source code which accept user input but from this source code, the tool can not easily determine what particular type of user might input this information. Consequently, a reverse engineering tool cannot produce a relatively error-free use-case diagram from source code. Instead, users must participate heavily in this process. These users must develop use cases from their own system knowledge and activity/class diagrams of this system. Because one of the goals of TAGDUR is to automate the derivation of UML diagrams from software architecture as much as possible, use case diagrams, because they require so much manual user intervention, are not yet an implemented feature of TAGDUR.

7.3 A Small Case Study

A small case study of a reengineering effort using TAGDUR is presented. The original legacy system is a COBOL, batch-oriented system of 1500 source code lines (Hutty, 1997). This COBOL program, after its translation into the intermediate language WSL, is transformed into an object- oriented, event-driven system. This transformed system is then analysed and various UML diagrams are derived from the code of this system. The following statistics are produced: 1. There 23 classes identified. Of these 23 classes, 20 classes are derived from the application domain and 2 are pre-defined objects (System, File) 2. These classes have a total of 887 attributes and 40 methods. There are 627 associations, one class shares variables or methods of another class, between different classes. 3. There are 337 events identified. Of these events, the following types of events occur: z 1 external event (the invocation of the application upon arrival of input) z 19 error z 19 system interrupts z 220 file I/O operations z 113 method invokations 4. In the activity diagram derived from the source code, there are 1275 action states identified and modelled. There are 41 conditions that govern the execution of these action states.

8. Conclusions

Creating a well-defined system model of a legacy system that is being reversed engineered is crucial in order to understand the structure and dynamics of a legacy system. Visualization of this model, through a graphical modelling notation such as UML, has the advantage over textual notations in that graphical notations are better able to depict the complex relationships between model elements. UML was chosen as the graphical modelling notation for several reasons, including UML’s good tool support and its ability to support multiple perspectives of the same system. Deriving information from an analysis of legacy system is a complex process. A theoretical framework to extract UML diagrams from a legacy system is provided. A partial implementation of this framework is available in the TAGDUR tool which extracts seven of the possible nine UML diagrams from legacy systems. These seven diagrams represent the behavioural, static, dynamic, and architectural views of the system. However, TAGDUR does not extract statecharts or use case diagrams from legacy systems. Statecharts, because they model event-response actions precisely in terms of states, are important in modelling and in understanding the behaviour of highly reactive systems. Activity diagrams, which are provided by TAGDUR, are often sufficient to model the behaviour of simpler batch-oriented legacy systems. Use case diagrams are crucial in modelling business process and in representing the system from an end-user viewpoint. However, use case extraction from legacy systems requires a considerable amount of user intervention. Consequently, use case extraction has not been implemented in TAGDUR.

References

(Ambler,2002) Ambler, S.W. (2002). UML Class Diagramming Guidelines. http://www.modelingstyle.info/useCaseDiagram.html. (Arango, 1986) Arango, G., Baxter, I., Freeman P., & Pidgeon C. (1986). TMM: Software Maintenance by Transformation. IEEE Software, 3(3), 27-39. (Alexander,1977) Alexander, C., Ishikawa S., Silverstein M., Jacobson M., & Fiskdahl-King I. (1977). S Angel A Pattern Language. New York: Oxford University Press. (Baecker,1981) Baecker, R. (1981). Sorting out Sorting Dynamic Graphics Project. University of Toronto: ACM SIGGRAPH ’81 (distributed by Morgan Kaufmann, Los Altos, CA). (Barstow,1985) Barstow, D. (1985). On Convergence Toward A Database of Program Transformations. ACM Transactions on Programming Languages and Systems, 7(1),1-9. (Ben-Menachem,1997) Ben-Menachem, M., & Marliss, G.S. (1997). Software Quality Production Practical, Consistent Software. International Thomson Computer Press. (Bennett,1999) Bennett, S., McRobb, S., & Farmer, R. (1999). Object-Oriented Systems Analysis and Design Using UML. The McGraw-Hill Companies. (Biggerstaff,1989) Biggerstaff, T.J. (1989). Design Recovery for Maintenance and Reuse. IEEE Computer. 22(7). (Booch,1999) Booch, G., Rumbaugh, J., & Jacobson, I. (1999). The Unified Modeling Language User Guide. Addison-Wesley-Longman, Inc. (Breuer,1991) Breuer, P.T., & Lano, K. (1991). Creating Specification from Code: Reverse- Engineering Techniques. Journal of Software Maintenance: Research and Practice. John Wiley and Sons Limited. (Chifosky,1990) Chifosky, E.J., & Cross J.H.II (1990). Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software. 7(1),13-17. (Chen,1986) Chen, Y.F., & Ramanoorthy C.V. (1986). The C Information Abstractor. COMSASC 86. 291-298. (Cotterell,1995) Cotterell, M., & Hughes, B. (1995). Software Project Management. International Thomson Computer Press. (Dorfman,1997) Dorfman, M., & Thayer, R. (1997). Software Engineering. Los Alamitos: IEEE Computer Society Press. (D’Souza,1999) D’Souza, D.F., & Wills, A.C. (1999). Objects, Components, and Frameworks with UML. Addison-Wesley. (Fowler, 2000) Fowler, M. (2000). UML Distilled. Second Edition. Addison-Wesley. (Goldstein, ) Goldstein, H.H., & Neumann, J.V. (). Planning and Coding Problems for an Electronic Computing Instrument. New York: McMillan. pp 80-151 (Graham, 2000) Graham, I. (2000). Requiremenst Engineering and Rapid Development. Addison- Wesley. (Graham, 2001) Graham, I. (2001). Object-Oriented Methods Principles & Practice. Third Edition. Addison-Wesley. (Hall, 1992) Hall, P.A.V. (1992). Software Reuse And Reverse Engineering In Practice. Chapman & Hall. (Heywood,2002) Heywood, R. (2002). UML Use Case Diagrams: Tips and FAQ. http://www.andrew.cmu.edu/course/90-754/umlucdfaq.html#actors. (Howe,2002) Howe, D. (2002). FOLDOC: Free On-Line Dictionary of Computing. http://foldoc.doc.ic.ac.uk/foldoc/index.html. (Hutty, 1997) Hutty, R., & Spence, M. (1997). Mastering COBOL Programming. Macmillan Press Ltd. (Jacobson,1992) Jacobson, I., Christerson, M., Jonsson, P., & Qvergaard, G. (1992). Object-Oriented Software Engineering, a Use Case Driven Approach. Addison-Wesley. (Jacobson,1999) Jacobson, I., Booch, G., & Rumbaugh, J. (1999) The Unified Software Development Process. Addison Wesley Longman. (Johnson, 1986) Johnson, W.L. (1986). Intention-Based Diagnosis of Novice Programming Errors. Los Altos, CA: Morgan Kaufmann Publishers. (Johnson,1988) Johnson, R.E., & Foote B. (1988). Designing Resusable Classes. Journal of Object- Oriented Programming. 1(2), 22-35. (Larman,1998) Larman, C. (1998). Applying UML and Patterns--an Introduction to Object-Oriented Analysis and Design. Prentice-Hall, Inc. (Li, 2001) Li, Y., & Yang, H. (2001). Simplicity: A Key Engineering Concept for Program Understanding. International Workshop on Program Comprehension (IWPC01). (Lubara,1991) Lubara, M.D. (1991). Domain Analysis and Domain Engineering in IdeA. Domain Analysis and Software Systems Modelling. IEEE Computer Society Press. 163-178. (Millham, 2002) Millham, R “An Investigation: Reengineering Sequential Procedure-Driven Software into Object-Oriented Event-Driven Software through UML Diagrams”. Proceedings of the International Computer Software and Applications Conference (COMPSAC), Oxford, 2002 (Millham, 2003a) Millham, R, H Yang, M P Ward “Determining Granularity of Independent Tasks for Reengineering a Legacy System into an OO System” , Proceedings of the International Computer Software and Applications Conference (COMPSAC), Dallas, Texas, 2003 (Millham, 2003b) Millham, R, H Yang “TAGDUR: A Tool for Producing UML Diagrams Through Reengineering of Legacy Systems”, Proceedings of the 7th IASTED International Conference on Software Engineering and Applications (SEA), Marina del Rey, USA, 2003 (Millham, 2004) Millham, R, J J Pu, H Yang “TAGDUR: A Tool for Producing UML Sequence, Deployment, and Component Diagrams Through Reengineering of Legacy Systems”, Proceedings of the 8th IASTED International Conference on Software Engineering and Applications (SEA), Innsbruck, Austria, 2004 (Muller,1997) Muller, P.A. (1997). Instant UML. Birmingham: Wrox Press. (Müller,2004) Müller, H.A. (2004). Understanding Software Systems Using Reverse Engineering Technologies Research and Practice. http://www.rigi.csc.uvic.ca/UVicRevTut/F6tools.html#Reengineering%20tool%20taxonomy. (Neighbors,1984) Neighbors, J.M. (1984). The Draco Approach to Constructing Software from Reusable Components. IEEE Transactions on Software Engineering. SE-10(5), 564-571. (Price,1992) Price, B., Small I., & Baecker R. (1992). A Taxonomy of Software Visualization. Proc. 25th Hawaii Int. Conf. System Sciences. (Pu, 2003) Pu, J., Millham, R., & Yang, H. (2003). Acquiring Domain Knowledge in Reverse Engineering Legacy Code into UML. Conference of IEEE Software Engineering and Application. (Rajlich,1992) Rajlich, V. (1992). Workshop Notes – Program Comprehension. IEEE Computer Society. (Ramage,1998) Ramage, M. (1998). Report on the First SEBPC Workshop on Legacy Systems. Durham University. (Reed,1998) Reed, P. (1998). The Unified Modelling Language Takes Shape. Jackson-Reed Inc. (Rumbaugh,1991) Rumbaugh, J., Blaha M., Premerlani W., Eddy F., Lorenson W. (1991). Object- Oriented Modelling and Design. Prentice-Hall. (Rumbaugh,1999) Rumbaugh, J., Jacobson, I., & Booch, G. (1999). The Unified Modeling Language Reference Manual. Addison-Wesley-Longman, Inc. (Rugaber,1993) Rugaber, S., & Clayton R. (1993). The Representation Problem in Reverse Engineering. Proceedings of the 1993 Working Conference on Reverse Engineering. (Srinivas,1991) Srinivas, Y. (1991). Pattern-Matching : A Sheaf-Theoretic Approach. PhD thesis, Dept of Information and , University of California at Irvine. (Systa,1997) Systa, T., & Koskimies K. (1997). Extracting State Diagrams from Legacy Systems. ECOOP Workshops. 272-273. (Systa,2000) Systa, T. (2000). Static and Dynamic Reverse Engineering Techniques for Java Software Systems. University of Tampere, Finland. (Taentzer, 2001) Taentzer, Gabriele “Towards Common Exchange Formats for Graphs and Graph Transformation Systems”, Electronic Notes in Theoretical Computer Science, vol. 44, no. 4, 2001 (Van,1992) Van, S.L. (1992). Workshop Notes – AI and Program Understanding. AAAI. (Ward,1989) Ward, M., Calliss F.W., & Munro M. (1989). The Maintainer’s Assistant. Proceedings of Conference on Software Maintenance. 307-315. (Webster,1987) Webster, D.E. (1987). Mapping the Design Representation Terrain : A Survey. Technical Report. Micro-Electronics and Computer Technology Corporation. MCC STP-093-87. (Whitney,1995) Whitney, M. et al. (1995). Using an Integrated Toolset for Program Understanding. Proceedings of the CASCON '95. pp 262—274. (Wile,1987) Wile, D.S. (1987). Local Formalisms: Widening the Spectrum of Wide-Spectrum Languages. Program Specification and Transformation. 165-195 (Wills,1993) Wills, L. (1993). Flexible Control for Program Recognition. Baltimore, Maryland: Proceedings of the 1993 Working Conference on Reverse Engineering. (Yang,1991) Yang, H. (1991). The Supporting Environment for A Reverse Engineering System --- The Maintainer's Assistant. IEEE Conference on Software Maintenance (ICSM 91). (Yang,1997) Yang, H., Luker, P., & Chu, W. (1997). Measuring Abstractness for Reverse Engineering in A Re-engineering Tool. IEEE International Conference on Software Maintenence – 1997. 48-57. (Yang,1999) Yang, H., et al. (1999). Acquisition of Entity Relationship Models for Maitenance - Dealing with Data Intensive Programs in A Transformation System. Journal of Information Science and Engineering. Vol. 15(2). 173-198. (Yang,2000) Yang, H., et al. (2000). Abstraction: A Key Notion for Reverse Engineering in A System Reengineering Approach. Journal of Software Maintenance: Research and Practice. 12(5), 197-228.