RECOVERING LOST BEHAVIOUR FROM LEGACY CODE

Software Archaeology in Practice Archaeology in Practice Recovering lost behaviour from legacy code

Verum Software Tools BV

Abstract—Reengineering legacy software is an undesirable but recovered from the legacy codebase. Future rot can be nevertheless occasionally unavoidable necessity. The core minimised by converting the legacy code into verifiably challenge is to efficiently and effectively uncover the behaviour of complete and correct models. legacy code, the origins of which have been lost in the mists of time, establishing a complete and correct foundation for further re- In an ideal world, it would be possible to automatically engineering. In this paper, we present a technique by which such reverse engineer an existing codebase using tooling. But then in lost or poorly understood behaviour can be recovered and turned an ideal world water could flow uphill and time could be nd into formally verifiable models. These models then offer a solid reversed. The simple fact is that the 2 law of thermodynamics foundation for the further development of a software system. means that creating a more highly ordered system (reengineered software) – from a lower ordered system (legacy code) – Keywords—software reengineering; legacy code; model driven requires work. The trick is to perform that work as efficiently ; ; software engineering tools; and effectively as possible. software components;

I. INTRODUCTION II. MODEL DRIVEN SOFTWARE ENGINEERING An unpleasant but nevertheless unavoidable truth of is that conventionally developed software A generalised definition of Model Driven Software “rots” in time. Rot occurs slowly and insidiously, driven by the Engineering (MDSE) can be found on Wikipedia [1]. A more very nature of source code itself and the human factors that pragmatic description is provided by Küster [2]: impinge on developing and maintaining it. It often starts when • Model-Driven Software Engineering (MDSE) is a changes to source code are not reflected in documentation, software engineering paradigm. leading to a loss of readily accessible information. It accelerates when development teams change and knowledge of – in the • Models are considered as primary artefacts from which meantime, poorly documented – code is lost. It proliferates when parts of a software system can be automatically new features are added by fresh software engineers, based on generated. incomplete documentation and knowledge. It reaches its zenith as the law of diminishing returns bites and development progress • Models are usually more abstract representations of the grinds to a halt. It is at this point that reengineering the software system to be built. becomes unavoidable. • MDSE can improve productivity and communication. From a business perspective, reengineering software is a • MDSE requires technologies and tools in order to be nightmare. It involves spending a lot of time and money just to successfully applied. stand still. It is also highly risky, simply because the existing legacy software is so poorly understood. And in the worst-case As stated by Küster, MDSE can be used to improve history can repeat itself with the newly reengineered codebase productivity and communication. Further, models designed being no more resistant to rot than its predecessor. This risk, and using formal verification techniques have the additional benefit the work involved in reengineering, can be greatly diminished if of increased longevity. Specifically, by objectively establishing the essential functionality and behaviour of the software can be that a model is complete and correct for a range of properties,

www.verum.com amongst other benefits one ensures that the model will not be subject to quite the same level of rot as conventionally developed source code. Therefore, when reengineering a software (sub) system one can not only improve productivity, quality and understanding by restating the design of the (sub) system in verifiably complete and correct models, but one also greatly reduces the risk of the design deteriorating into chaos again. In this paper, we will consider the application of Dezyne [3] to the problem of recovering lost or poorly understood behaviour from legacy software and turning it into verifiably complete and correct models. Dezyne is an MDSE tool that enables software engineers to create, explore and formally verify component based software designs for embedded, technical and industrial software systems.

Figure 1: Dezyne Interface Model Example III. COMPONENTS AND COMPOSITIONALITY B. Component Models The Dezyne modelling paradigm is analogous to that of the hardware world. Namely, Dezyne is based on the concept of a In the hardware world, components are implemented in component and the inherent compositionality of components. silicon and it is in the silicon that the actual work of a component Dezyne recognises three types of models, interface models, is done. It is essential to the basic utility of an IC that its component models and system models. implementation fully refines the interface that it specifies in its technical documentation. A. Interface Models In the hardware world, the externally visible behaviour of an IC is generally described by two things, a pin-out and a timing diagram, both usually found in a data sheet or technical manual for the component. This abstraction provides enough information for the IC to be used without revealing details of how it is implemented.

Dezyne component models are analogous: the component does the actual work of a design. Every Dezyne component provides an interface model and thus it is essential to the basic utility of a component that it fully refines the specification that the interface provides. Dezyne interface models are analogous. They describe the API provided by a software component and the protocol - the Dezyne goes one step further than the hardware paradigm. sequence of allowed events and responses - that the API Namely, any component that in turn requires the use of another implements. An interface model is an abstraction that provides component must also adhere to the protocol of the interface enough information for a software component to be used without provided by that component. Dezyne verification technology revealing details of how the component is implemented. Thus, formally ensures that every component model adheres to the an interface model amounts to a specification of the externally interface models that it provides and requires. visible behaviour of a software component. Interface models are a key Dezyne concept upon which the compositionality of Dezyne components rests. When an interface model is used in conjunction with a Dezyne component model, Dezyne’s verification engine will assert that the component completely and correctly adheres to the protocol of any interface that it provides or requires (the Liskov Substitution Principle [3]). In this way, the structural integrity of entire systems composed from Dezyne components is established.

Figure 2 Dezyne Component Header

www.verum.com C. System Models IV. SOFTWARE ARCHAEOLOGY In the hardware world, systems are designed by composing Dezyne offers a means to reduce the cost and risk of the work components together. involved in reengineering the behaviour of complex software systems. In a process that we have come to call “Software Archaeology”, Dezyne can be used to rediscover the ‘lost’ behaviour of a system. The key to this process is the use of Dezyne’s interface models to capture externally visible behaviour across interfaces to legacy software and to separate it into expected behaviour on the one hand and unexpected or erroneous behaviour on the other. A. Interfacing to Legacy Software Dezyne interface models are used to connect Dezyne components to legacy software and vice-versa. In this case the The same is true in Dezyne. A Dezyne System Model is used interface model represents merely an assumption of how the to declare instances of Dezyne components and to link them legacy component behaves, against which the Dezyne together through their port definitions, based on interface component is verified. Dezyne’s verification engine cannot be models. A Dezyne System Model can itself provide and require used to show that the legacy software complies to the interface interfaces, making it possible to abstract and hide the entire model, meaning that the interface model offers no guarantee that implementation of a (sub) system. The result is that Dezyne sub- the legacy component will stick to the protocol that the interface systems can be nested and appear as components at a higher model defines. higher level in a system. With a small, simple or highly ordered legacy component it is possible to be confident that an interface model captures the exact protocol that the component implements. But when re- engineering legacy software, the behaviour of a component is poorly understood and therefore it is highly likely that a Dezyne interface model will represent an approximation of the legacy component’s visible behaviour. This can present an issue at runtime because Dezyne generated components assume that all other components they use or are used by, including legacy components, are correctly behaved. Dezyne’s verification engine guarantees that other Dezyne components meet this requirement. However, errant behaviour by a legacy component may cause a protocol violation on an interface, with the consequence that the related Dezyne component will abort. Figure 3 Dezyne System Model Example

D. Compositionality Dezyne models are compositional. Specifically, if Component A requires interface I and is shown to adhere to Interface I and Component B provides interface I and is shown to refine Interface I, then the mathematics underlying Dezyne asserts that under all circumstances, Components A and B will behave correctly with respect to each other when bound together. For example, referring to the previous figure, the component BarrierControl provides an interface model that is required by the component CrossingControl. During design of BarrierControl, Dezyne’s verification feature will assert that the BarrierControl component model refines its provided interface model. During the construction of CrossingControl, Dezyne will assert that the CrossingControl model adheres to the required BarrierControl interface model. The result is that operationally BarrierControl and CrossingControl will always behave correctly with respect to each other. This rule equally applies to every other binding shown in the RailCrossingSystem diagram.

Figure 4 The Armouring Concept

www.verum.com B. Armoured Interfaces subjected to test and any errant behaviour on the interface dealt Since practically every Dezyne (sub) system needs to with by improving the Dezyne interface and component models interface with legacy software components at some point, accordingly. Once completed for the first component the process Verum has developed a technique to deal with the potential of is repeated for the adjoining legacy component(s). In this way, errant behaviour on these interfaces. This technique involves the behaviour of entire (sub) systems can be systematically creating “armoured” interfaces between legacy and Dezyne rediscovered and captured in verifiably complete and correct components. An armoured interface is built between a legacy Dezyne models, which then provide a basis for further and a Dezyne component from two interface models that reengineering and a way to dramatically reduce the onset of sandwich a Dezyne “protocol observer” component. The two software rot. interfaces models are syntactically identical, but semantically E. Example: Legacy Timer different. The outer, legacy facing, interface model is written to be robust, meaning that it defines ‘weak’ semantics that accept As a simple example, consider interfacing to a legacy timer a wide range of behaviour from the legacy component, including (LegacyTimerComponent) with potentially unreliable potentially errant or erroneous behaviour. The inner interface behaviour. The timer accepts Create and Cancel events and that faces the Dezyne component is written with strict semantics responds asynchronously with Timeout and Cancelled events. that accept only known, verifiably complete and correct Such a timer might not guarantee that a Timeout does not occur behaviour. during or after receiving a Cancel event. Therefore, there exists the possibility that a Timer client could receive unexpected The Dezyne “protocol observer” component in the middle is Timeout events. Dezyne armouring can be used to prevent this written to deal with the difference between the two. This problem. component passes intended behaviour from the legacy component inwards through the strict interface to a receiving Dezyne component. It filters out any errant behaviour by the legacy component and handles it appropriately. In the simplest case, it might take passive action by just swallowing the errant behaviour or logging it. But it could equally take affirmative action, perhaps by triggering an exception - whatever is possible and appropriate in the circumstances. In the course of time armoured interfaces have become a standard pattern for interfacing to legacy code in the Dezyne community. C. Recovering Lost Behaviour from Legacy Code In practice, it turns out that armoured interfaces have uses beyond defending Dezyne components from errant behaviour in legacy code. Specifically, they can be used offensively to uncover lost legacy behaviour. In this case an armoured interface between the legacy component in question and a Dezyne component is constructed, with the protocol observer Dezyne component built to trap and log all errant behaviour. The resultant system is built, run and subjected to a wide range of tests. Should the behaviour of the legacy component at the interface differ from the assumptions built into the strict Figure 5 Example armoured interface to a legacy timer interface model, the protocol observer component will capture and log the difference, providing detailed information about the The interface to the legacy Timer is modelled with a robust true, real world behaviour of the legacy code. Of course, the interface model that includes the Timer’s API but essentially confidence level of this approach depends largely on the imposes no protocol on the behaviour of the Timer. behavioural coverage of the test suite used to exercise the interface ITimerRobust system. If that is in doubt then, as they generate little overhead, { the armoured interfaces can be left in place in a final system, extern double $double$; ready to catch the nasty sort of errant behaviour that only occurs in void Create(double tOuter); in the field. in void Cancel(); out void Timeout(); D. In Practice out void Cancelled(); Armoured interfaces have been in use by Verum’s customers behaviour for many years [5] and have been shown to provide a useful { on Cancel: {} starting point for the introduction of Dezyne engineered on Create: {} components into a legacy code base. One approach is to identify on inevitable: Timeout; a small legacy component that can relatively easily be isolated. on inevitable: Cancelled; The interface that this component has with the rest of its system } is modelled as an armoured interface and the legacy component } replaced with a verified Dezyne component. The system is then

www.verum.com The interface to the Dezyne client component is modelled This example is necessarily trivial for illustrative purposes. using the same syntax (API) but with the addition of a strict In practice, faced with an estimate of 85 man years to protocol that specifies exactly how the Timer is expected to conventionally reengineer a software subsystem with no work. guarantee of a successful outcome, one large producer of semiconductor manufacturing equipment applied the principles interface ITimerStrict of software archaeology to the problem. They successfully { extern double $double$; completed their initial reengineering project in 25 man years, in void Create(double tInner); with the resultant models forming the basis for a progressive, in void Cancel(); ongoing software reengineering program. out void Timeout(); out void Cancelled();

behaviour V. CONCLUSION { enum States {Inactive, Active, Cancelling}; Legacy codebases represent the sum total of the software States state = States.Inactive; intellectual property of a business and are at once invaluable and

on Cancel: state = States.Cancelling; irreplaceable. However, they are also prone to long term [state.Cancelling] deterioration with the result that a vital business asset can { become an engineering hazard and, ultimately, a business on inevitable: { Cancelled; liability. state = States.Inactive;} on Create: illegal; In this paper, we have shown how Dezyne can be used to } [state.Inactive] recover lost or poorly understood behaviour from a legacy { codebase, resulting in models of the (rediscovered) behaviour on Create: state = States.Active; that are both formally complete and correct. These models } provide a solid foundation for a further reengineering program. [state.Active] { By their very nature Dezyne models are also far more on inevitable: { Timeout; state = States.Inactive; } resistant to deterioration than source code and therefore will not on Create: illegal; depreciate as rapidly as (reengineered) code. They therefore } represent a far better investment in the future of a business. } } Dezyne based “Software Archaeology” is a technique in use by many Verum’s customers and has shown itself to be an In between the two interfaces a Dezyne armour component efficient, effective and systematic way to regain control over a is constructed to map the weak semantics of the legacy interface problematic legacy codebase. on to the strict semantics of the client interface. The behaviour of the armour component is shown in the figure below:

TimerArmourComponent Behaviour VI. REFERENCES

[1] Model Driven Engineering, Wikipedia [2] Dr. Jochen Küster, “Foundations of Model-Driven Software Engineering”, 2011, IBM Research. [3] About Dezyne, Verum Software Tools BV. [4] The Liskov Substitution Principle, Wikipedia [5] Rogier Wester, John Koster, "The Software behind Moore's Law", IEEE Software, vol.32, no. 2, pp. 37-40, Mar.-Apr. 2015

Figure 6 Behaviour of the Timer armour component

Note that in the Inactive or Cancelling states, the TimerArmourComponent ignores any spurious Timeout events from the legacy timer. Equally, any spurious Cancelled events are also thrown away. TimerArmourComponent and TimerClientComponent are both formally verified Dezyne components and are therefore guaranteed to work correctly.

www.verum.com