<<

A -Based Approach to Modeling and Re-engineering

Alice H. Muntz Christian T. Ramiller Hughes Corporation HughesInformation Technology Corporation MS SUS64lC410 MS SClS64lC41O PO Box 92919 PO Box 92919 Los Angeles, CA 90009 Los Angeles, CA 90009 ahmuntz8hac2arpa.hac.com 1. Introduction Abstract Re-engineering is the process of analyzing, upgrading, This paper reports on the managerialexperience, and integrating enterpriseinformation systemsto meet the technical approach, and lessonslearned from re- expandedoperating requirementsof the present, as well as engineering eight departmental large-scale to prepare a sound base for effectively meeting future information . The driving strategic needs. In this paper, we report our experience, key objective of each project was to migrate these technical approach,and lessonslearned from applying our systems into a set of enterprise-wide systems, ([MH94], [AHR94], [AHR93], [HI93a], which incorporate current and future [HI93b]) to eight large scale legacy information systems, , drastically reduce operational and each containing several million lines of code, and maintenance cost, and facilitate common thousandsof data elements. understandings among stakeholders(i.e., policy Our requirement-basedre-engineering approach differs maker, high-level management, IS from ([P&B94], [HDA87], [M&M90], [M&S89], developer/maintainer/end-users). A logical data [NEK94], [MNBBK94],[HCTJ93]) that we couple our model , which contains requirements, rules, extended entity-relationship modeling methodology and physical data representation as well as logical tools with model integration processes to produce a data object, clearly documents the baseline data normalized . In a re-engineered data model, requirements implemented by the legacy the physical data objects, pertaining to the physical and is crucial to achieve this strategic goal. schemas,link to the relevant logical data objects, external Re-engineering products are captured in the data objects,business requirements, and systemrules. dictionaries of a CASE tool (i.e., in the of a decomposition hierarchy, as-is Why Requirement-Based and Re- data model, normalized logical data model, and engineering? linkages among data objects) and are supplemented with traceability matrices in A requirement-basedre-engineering approach should be .The re-engineered data products used when the information infrastructure of an enterprise are used as follows: (1) migration of the legacy is out-of-control and is not meeting its goals. Typical to relational management symptomsare: systems, (2) automatically generation of people can’t or integrate data across the databasesand applications for migration from enterprise mainframes to client-, (3) enterprise data data can’t be combined from multiple sources standardization, (4) integration of disparate the information people need is in a system information systems, (5) re-documentation, (6) somewhere else, not accessibleto them data quality assessmentand assurance,and (7) the (IS) staff can’t support the baseline specifications for future systems. existing infrastructure becauseof obsolete platforms or programs designed, built, and supported by one person Pemdssion to copy withoutfee all or part of this material ir grantedprovided that the copies are not made or distributedfor direct commercial advantage, the VLDB the IS department charges are skyrocketing, but copyright notice and the title of the publication and it3 date appear, and notice is people still can’t get the information they need (what) given that copying b by permission of tk Very Large DotaBare Endowment. To copy otherwise, or to republish, requires a fee on&or special permission from the to conduct the business;when they want it, how they EJt&wment. want it, or in the form they want it Proceedings of the 20th VLDB Conference Santiago, the information systems don’t support enterprise Chile, 1994 strategicplanning or tactical decision support information and application requirements have not been mappedto businessfunctions

643 IS Development Life Cycle (IT) Task Use of Requirement-Based Re-Engineering Products I

BusinessRequirements * Functional Economic Analysis B understanding requirements of as-is business processes l BusinessProcess Improvement /data models and how IS supportsusers and business l Enterprise . inventory of baseline requirements implemented in the * EnterpriseData Model current IS l Cross-functional Model . composition of the logical data dimension and tbe physical Integration data dimension of enterprise data architecture and l EnterpriseModel Integration enterprisedata model l Standardization m integration of tactical and operational data models among l EnterpriseData businessfunctions l EnterpriseData Repository . integration of strategic, tactical, and operational data modelsfor all businessfunctions . uniform namefor the samedata object to facilitate reuse * same name, meaning, and usage for the sharable data instancesand objects . Information system l Enterprise IT Consolidation identification of duplicating or similar databasesfor the Requirements Planning samebusiness functions l EnterpriseIS Security l understanding requirements of as-is business processes /data modelsand how IS supportsusers and business /Database l Forward Engineering l elimination of obsolete requirements, modification of Requirements changedrequirements, and addition of new requirements Preliminary Design l data objects formulated from data usage,processing rules, Detail Design l Object Engineering attributes,and entities of logical data model l data quality requirementsrepresented in the re-engineered logical data model l Data Quality Engineering

Implementation l DatabaseGeneration l databasesand tables automatically generated by CASE tools using re-engineereddata models

l ScreenGeneration l screen/reportdesign and implementation via CASE tools l Report Generation using re-engineereddata models Unit Test l l linkage between the legacy data elements to the as-is Integration normalized data model used by the Program l linkage between the to-be data element to the as-is normalized data model used to store the extracted legacy datato the to-be databasetables

l Data Quality Assurance l data quality requirements specified in the re-engineered datamodel

l Data Administration l data requirements specified in the re-engineered data model

l Training l understandingof requirementsand how tbe IS implements I l Maintenance I the requirements 1. Use of Requirement-Based Re-Engineering Products new application architecture and technology

l modifying one application causes errors, aborts, and / infrastructure. or erroneous information in a different application As a part of our re-engineering approach, we resolve data A requirement-based data modeling and re-engineering conflicts (e.g., synonyms, homonyms) during data approach should be used to re-document or map business modeling and model integration. In section two, we requirements, functional requirements, and data describe our approach to identify data conflicts and requirements to architecture, design, and implementation. classify data conflicts. Depending on the scope of the re- As depicted in Table 1, re-engineered data models enable engineering effort, various levels of integration will occur. identification of obsolete portions of applications, We developed an integration to guide our outstanding unfulfilled requirements, applications that integration process for data modeling and re-engineering require changes or consolidation based on new functional legacy information systems. Sections 3 describes this requirements to meet current and future business needs. integration taxonomy. Section 4 describes our The re-engineered design will provide a basis for requirement-based data modeling approach. Section 5 developing a plan to migrate reusable applications to the uses an example to illustrate our model integration

644 approach. Finally, in the last section, we summarize Synonyms are data element names which are different lessonslearned from thesere-engineering experiences. but which are used to describe the same thing in the various systems. The “same thing” refers to (a) the same 2. Five Data Conflicts Types critical attribute of a real-world thing, or (b) the same real-world thing. An example is shown in Figure 1. Here When systems are to be re-engineered or consolidated, the term ACTIVITY within the pay system is used to functional and data conflicts, and data duplication need identify what is called a UNIT or ORGANIZATION special attention, to determine the impact of selecting one within the personnel system.This conflict was uncovered system’simplementation over another. Data conflicts fall by noting the similarity in associations as well as in into one of 5 general types. definition. Personnel Data MODEL I Pay MODEL ORGANIZATION_ UNIT _ POSITION ORGANIZATION- ACTIVITY - POSITION

ORGANIZATION-- any groupof individualsthat servea ACTIVITY- an organization identified by ths major daiment code, for which 1 functionwithin the Enterprise. grouping of employees works. UNIT-- an organizationalsub-group identified by the personnelaccounting symbol authorization number to track the ownerof a positionto whichemployees will be assigned.

Figure 1. Example: UNIT vs. ACTIVITY. Homonyms are data element namesthat are the sameor owner of a POSITION in the personnel system. almost the same but which are used to represent things Meanwhile, in the payroll system the term AGENCY which are different with respect to usage or other (with a synonym -- SERVICING AGENCY-- in the characteristics. An example of a homonym is shown in payroll system) is related not to a POSITION but to an 1 EMPLOYEE.

AGENCY EMPLOYEE AGENCY -- A code designating the Federal goveronment agency AGENCY -- A two position code that designates the Federal who owns the position regardless of the servicing agency. (Navy Government agency to which an employee is assigned or is PDS ) transferred. (DCPS ’s Manual, Appendix F) AGENCY -- Reflects the owning agency code. Used in conjunction with DIN JPV which reflects the servicing agency. (Table 102~- APDS Table) I Figure 2. Homonym Example: Agency Structural anomalies which are tracked as in another system it is treated logically as an Attribute Attribute/Entity conflicts are an occurrence of a data (one value which is used as a characteristic to describe element treated logically as an Entity in one system (i.e., sameentity). many characteristicsabout it are tracked and stored) while s Personnel MODEL I PPOINTMENT PPOINTMENT

CANDIDATE

EMPLOYEE EMPLOYEE

tNTERMllTENT

Figure 3 Type/Subtype Example: Employee

645 Type/Subtype conflicts show up where one systemkeeps only basic characteristics, or attributes, about an entity, 3. Integration Taxonomy and thus the varying types are simply multiple subtypesof that entity, while another system keeps a more complete The processof integrating information can be categorized set of characteristics about real-world things, thus into a taxonomy which spans four different integration encompassing all the variations within one type. As a processes,as shown in figure 4. result, these real-world things can be categorized into Intra-system integration are undertaken to resolve data multiple types, each described by its own set of conflicts that during the re-engineering of each distinguishing characteristics.This conflict can usually be system. Within each system, several physical data resolved by making the multiple type entities of the elements may logically reference the same information. second system subtypes and placing common When a logical model is generated, like elements are characteristics into the single entity from the first. An combined as one logical object. For example, in the example of this type of problem is shown in Figure 3. payroll system there are two similar objects: AGENCY Here the high-level model is more abstract and has only and SERVICING AGENCY. They represent the same an EMPLOYEE entity, while in the personnel system, data, and the same data values. The difference is not in separate characteristics are tracked for TEMPORARY their content but in their usage. Within a logical model, EMPLOYEES, INTERMITTENT EMPLOYEES, and both elementswould be combined, and the latter would be PART TIME EMPLOYEES. Further investigation may indicated by associating the resulting object AGENCY even find additional employee types. This conflict with the real-world object served--in this case, the actually helps to highlight a normalization need, such that personnel office responsible for pay transactions of the common attributes across the multiple types will not be employees. duplicated in the consolidatedimplementation. Inter-system integration is undertaken to resolve data The last conflict type is Stored vs. Calculated conflicts that exist among systembeing consolidated. Redundancy. An example of this kind of conflict is the Cross-functional integration is required when information DOB (date of birth) sent from the personnel systemto the models from two or more functional areasare combined, payroll system, and the EMPLOYEE AGE CATEGORY and where the same information is maintained in both stored within the payroll system after calculating the age areas. Inter-system integration may also occur concurrent from the date of birth. This conflict was at first only with cross-functional integration if the systems support suspected,but was later validated when the actual usage more than one functional area. Cross-functional of ‘DOB’ was noted by pay system experts--that is, that integration occurs within our integration task, for the date of birth is important to them only in determining example, by logically integrating information that the appropriate age category for life insurance premium supports personnel functions with information that calculation. supportspay functions. The overlap occurs in the process of calculating pay, where information about an employee’s salary, entitlements and deductions must be obtained from personnel to calculate civilian employees’ pay, and information about time, attendance,deductions, and actual pay must be sent back to personnel to keep their records up-to-date. Cross-level integration pertains to the level of conceptual detail being captured within the . At the lowest level, details that are used for day-to-day operations are included within the models. At the mid level, details are only included if relevant to the decision- making that will be made about how to accomplish broad functional goals. At the highest level, only abstract are captured, to provide a framework for decompositionof functional areasinto specific systemsor organizationswhich provide those functions. Thus, cross- level integration occurs when folding one or more lower level models into one of the mid or higher level models, personnel system or simply mapping the more detailed concepts of the lower level model to a higher level through INTER-SYSTEM / CROSS-FUNCTIONAL aggregation. Finally, our integration task also includes cross-level integration between a mid-level model which, Figure 4. Four Different Integration Processes. for example, reflects current policy for calculation of pay and the lower level models.

646 Process Uses Data b 1

Ph slcal Evidence .-..-.- I (FIE Structures, Model Views I . . . , ata Elements. Statements of Poilcles, Screens, & Rewrts) Business Rules, I . Procedures, Formal & Requirements Link

Model Views Contain Loglcal Objects

\ Pdldea 6 Budneea Rules Unkad To Lo!&d MO&l

1i ‘ig ure 5. Model Gestation “as-is logical data model” to an “as-is normalized data 4. Data Modeling and Re-Engineering model”, up to the fifth normal form when possible, by Approach using the up-to-date business strategies, business rules, and functional dependency information. The usage of With our re-engineering approach, we couple our each data object (i.e., entity, attribute, and association)are extended entity-relationship modeling approach with the also documented by specifying the “how” (i.e., Create, model integration processes. Read, Update, and Delete), the “why” (i.e., links to the As shown in Figure 5, our extended entity-relationship relevant policies, business rules, business procedures, modeling approach partitions data models into the practices, system implementation rules), the “when” (i.e., following components: (1) model , (2) plan the “triggering business processes” and the “to be dictionary, (3) physical dictionary, (4) data dictionary, and triggered business processes”),and the “where” (i.e., the (5) linkages (i.e., links between plan dictionary objects “business process view” where the entity resides.) are and model view objects, links between plan dictionary also documented. During the normalization process, objects and data dictionary objects, links between data synonyms and homonyms in the physical data structures dictionary objects and design dictionary objects). are identified and resolved. The resolution of each such Businessprocesses and data are inexorably linked. In our conflict is documentedwith linkages to/from the logical projects we have concentrated on the data requirements attributes and the conflicting data items. side, but we never forget about the underlying processes. Data does not appear suddenly out of the blue. Data ti The As-Is Model created by and & used by processes. Likewise, there are The as-is data model, once created, serves as a baseline no business processeswhich don’t use data. Failure to representingthe current configuration of the system. The recognize these fundamental truths leads to confusion and processof creating the as-is data model is called reverse artificial or theoretical results, not practical models and engineering. This model contains the existing definitions solutions. of processes, data structures, and data elements. As show in Figure 5, we first construct an “as-is logical Typically, our experience is that a large number of data model” to represent the requirements, the logical homonyms cannot be identified by analyzing application structuresand semantics,and the physical implementation code, data dictionary, or schema definition. Multiple of databases,screens, and reports. We then normalize the external definitions result from differing usageof the data

647 element betweenorganizations within the enterpriseusing the model views. Preliminary linkages from physical to the application/database or “phased” usage of a data logical are establishedas part of this extraction processto element during various stagesof the processsupported by establish the pedigree of the candidate logical objects. the application where the meaning of the element The as-is model is then frozen for future reference and is changes. For example, illustrated in Figure 6, ready for use as a baseline for the development of the as- “Manufacturers cage part number indicator code” is used is normalized model. in four different ways by four organizations (i.e., DCSC, The A-Is Normalized Model DESC, DGSC, and DISC. ) During focus sessions,we After as-is data model is derived, we begin the forward discovered this homonym and obtained definitions, engineering process which includes normalization purposes,and usageof data elementsfrom policy experts, (through fifth normal form) of the candidate objects. end users,system implementers,and systemmaintainers. Normalization reduces complex data structures to their Model views are created to the model into basic, most natural form by removing redundancies, smaller pieces for analysis and to create meaningful eliminating data conflicts, and assigning attributes to groupings of businessrules and data. Generally, theseare entities basedon the essentialmeaning of data throughout process- or function- based although we can also create the organization. The resulting data architecture is based “data” views basedon clusters of &ted data. on how data is related to other data, not on how the data is h used by the applications or on who uses the data. This provides a more stable and flexible structure. This logical data architecture is used to drive the creation of physical DCSC- A c& that indkater if the data structures free of creation, update, and deletion manufacturer’s CAGE code and the part anomaliesthat result in inconsistent data values and other be used to acquim the item. DESC, DGSC,and DISC. data intearitv uroblems. Codes: X-Yes; Blank - No DESC-A coda that Indicates if the item

DGSC- A code that indicates if government

Codes: Y-Yer;N-No

3 I ?igure 6. Homonym Detected from Usage.-I I 4s shown in Figure 7, functional and business rules, developed from policies and regulations, drive system rules while system rules describe processing logic and data relationships to be implemented by the application code. To facilitate the requirements-driven approach, business rules and policies are generated from external regulations and laws. Common practices and procedures I I . within the businessenterprise are documentedin the pIan Figure 7. Relationships Among Requirements, System dictionary of the data model as shown in Figure 6. Rules, and Application Code Additional rules are also captured in the plan dictionary As a result of normalization, new objects are created and through analysis of the application code (these are the refined relationships established that differ from the system rules - that implement the businessrules). Rules original as-is condition. Therefore additional rules are then linked to the model views to which they apply as (requirements) must be written to support the illustrated in Figure 5. normalization assumptions/rationale. New businessrules Physical data is correlated with businessprocesses (model are recordedas modeling issuesare resolved. views) via the create, read, update, (CRUD) matrix Standard naming conventions (per enterprise standard) in spreadsheets. This matrix is also used to capture the are applied to logical objects. In many instances new operations performed on the data for a specific process, standardlogical namesbased on the essential meaning of whether the data is optional or mandatory, and is system- the data will prove more transportable across the supplied or user-entered. enterprisethan the legacy physical data names. After the physical components of the model are defined A full set of trace links between logical objects and rules (in the design dictionary) and validated, are “extracted” (planning statements)can now be completed. Trace links from the design dictionary, they are placed in the logical between logical and physical objects must be refined to data dictionary as candidate objects and used to populate account for logical objects eliminated to condense

648 synonyms and additional logical objectscreated to resolve the businessenterprise supportedby the application being homonyms. In some cases, new objects reverse-engineered. (entities/attributes) are created which are not currently The stage has now been set to achieve a model-driven part of the application/database but are logically part of developmentenvironment as shown in Figure 8.

Figure 8. New Derived from the Re-Engineered Data Model Forward Engineering and Model-Driven Development elements from the two systems as a small example to Upon completion, the normalized model can be archived illustrate integration complexity and our approach to as a baseline for configuration managementpurposes, and integration. In this example, we have cross-level a new model is generated from the normalized model to integration between a mid-level model which reflects meet two possible scenarios. Scenario 1 is to upgradethe current policy for calculation of pay and the lower level systemimplementation for current needsby migrating the models; we have cross-finctional integration betweenthe existing data baseto a relational data baseby creating new personnel system representing personnel functions, and physical components from the as-is normalized data the payroll system representing pay functions; we have model. inrra-system integration within each migration system Scenario 2 is to expand system functionality. We would model as it relates to tbe calculation of pay that is part of frostmodify the as-is normalized data model into the to-be the re-engineering effort that prepares it for integration; normalized data model by capturing new business and finally we have inter-system integration since we need requirements. In addition, we integrate strategic, tactical, to resolve data conflict between the two re-engineered and operational level data models with the re-engineered data models representingthe data requirementsof the two “as-is normalized data model”. We first use the strategic systems. level enterprise data and process models to isolate the relevant concepts and identify relationships between Integration Scope tactical data and process models. Next, we use the The scope of this task is the information exchanged relevant tactical data and process models to identify between the two systems for the pay calculation. As grouping of business processesand possible entities and shown in Figure 9, there are three subsets of data that associations. Then, we identify identical data objects, should be distinguished. synonyms, and homonyms in the related operational data models. Finally, we resolve conflicts and produce an The most critical items are those shown in the integrated logical data model with a . intersection, i.e., elements that are shared by the two Then, migrating the existing data baseto a relational data systemsand are directly related to pay. Examples of this baseby creating new physical componentsfrom the to-be type of data are an Employee-Id and the employee’s normalized data model . associated Pay Grade. The second subset includes elementsthat are not sharedby the systemsbut which are directly related to pay. Examples of this kind of data are 5. Our Model Integration Approach such things as Time and Attendance data or Allowance In this section, we will use a payroll system, a personnel Categorieswhich are kept only within the payroll system. system, the pay calculation application and related data The third subsetincludes elementsthat are not shared,but

649 are common to both systems, and are either directly or included DOD Directives, the Federal Personnel Manual . indirectly related to pay calculation. The most obvious and supplements,and several high-level models that have examples of this kind of data are referencetables, such as been developed, including the DOD Enterprise Model. A locality adjustment percentages which affect pay, or similar mid-level data model in a non-government various benefit plan codes. The less obvious are elements organization can also be developed using similar such as salary limitation accumulatorswhich are officially documents such as Federal Law, State Law, corporate kept only within personnel, but are actually maintained directives, a corporate personnel manual. The mid-level within both pay and personnel. Stewardship (i.e., mode1resulting from this policy review has been called organizational authority and responsibility for a specific the Joint Personneland Pay Model, JPPM. set of data elements) may be the most difficult to The two re-engineered models, DCPDS and DCPS, determine for this tvne of data. developed bottom-up using existing data structures and Not-Shared Elements, in-depth system expert interviews as input sources, are Indirect support of Function Scope mergedone-at-a-time. The first , DCPDS (i.e., the personnel system data model) into JPPM, creates an intermediate model which is being called Joint Personnel, or JPERS. This model integrates personnel information from a bottom-up and top-down view. The secondmerge, DCPS into JPERS, creates a second intermediate mode1 which is being called Joint Civilian Personnel and Pay Model, or JCPPM. This model uses the integrated top- down/bottom-up personnel model to help guide the integration of bottom-up pay information. The bias Supporting Common but not shared priority is inherent in the sequenceof integration. In this example, the sequence of incorporating DCPDS, 7Figure 9 Scope of Data Elements to be integrated. DCPS, and MCTFS is driven by the time that a re- engineered data model is available. We did not Bias Priority Approach intentionally impose preferenceon any of the systems,but the timing of mode1availability turns out determine the We use a bias priority approach discussed in [BLN86]. bias priority. Integration tasks encountered in our projects involve all As shown in Figure 11, the mid-level mode1(JPPMl) has four integration processes(i.e., intra-system, inter-system, the highest priority; if there is a conflict between it and cross-functional, cross-level integration.) Therefore, the DCPDS, the mid-level mode1 representation will take approach being taken on this task is to merge each of the precedence. JPERSwill therefore be biased towards this re-engineered data models into one medium-level logical mid-level view. Then , JCPPM will be biased towards model. JPERS (i.e., the integrated model from the mid-level model and DCPDS) over DCPS. Finally, the Standard (the JPPM model) will have the bias toward JCPPM (i.e., the integrated mode1from the mid-level model, DCPDS, and DCPS) over the MCTFS . As the integration process moves to the final stage, MCTFS will have least impact on standardizedterms as comparedto DCPDS and DCPS Figure 10. Priority Sequence Biases Resulting when conflicts arise. Standard Elements. The bias priority supportedby this approachis especially important when integrating a large number of existing In Figure 10, DCPDS is a re-engineereddata model for a systemsover time. The greatestadvantage is in managing personnel system,DCPS is a re-engineereddata model for the complexity that arises when consulting the number of a payroll system, and MCTFS is a re-engineered data system and functional experts who may not previously be mode1for a combined pay and personnel system. In the aware that their data is also kept in other systems,perhaps process,conflicts of naming, structural representation,and used as well as namedquite differently. Special data uses semanticsare uncovered. In addition, stewardshipmay be may also have to be compromised to accommodatethe unclear. The goal of the task is to isolate each of these standards, and mediation without a pre-determined conflicts and document the alternatives with the data- priority could be very costly in time, money, and delays related impact of each. The stewardshipof the conflicting due to stalemate. The most effective sequence is elements may also be discovered; if so, this is noted to established by weighing the following factors: (1) the help in establishing resolution authority. breadth of use of the data elements in each system to The initial version of the mid-level data model is ensure the most complete understanding possible early- developed top-down, using a variety of policy-level input on, (2) the dollar costs of changes to each system or the sources. In this specific example, policy documents

650 time and cost of mapping the system’s data to the standards resulting from earlier integration, to limit the Establishment of Use of Re-Engineering Prod+s ultimate costs, and (3) the number of employees or organizations supported by each and the relative Throughout the initial phaseof our projects, we have been importance of the functions supported by the data being continuously asked: integrated, to consider the risk of change. More detailed 9 What reverseengineering products are produced? advantagesof this approach over others are discussedin l How can theseproducts be used? [BLN86]. l How is reverse engineering related to other system Since one of the strategic objectives is to produce developmentactivities? sharable data among enterprise organizations performing l When will the re-engineering products be ready? various business functions (such as personnel, payroll, l Why do they need re-engineering in order to obtain procurement, health care benefit, inventory management), standarddata elements? after the integration process we use a naming The matrix in Table 1 helps to answerthese questions. standardization procedure [DoD94] which ensures reusability of data objects. For example, we re-use the Data Standardization and Integration Misconceptions standardname if the data object is equivalent to a standard data element in the enterprise data repository; this is done A dangerousmisconception is that data standardizationis by transforming the name of an attribute into the format simply assigning related data elements with the same “Modifier Name.Generic Data Element” such that the name in each system. If implemented blindly, this will transformed attribute and its associateddata model can be cause future unforeseenand disastrousresults, the causes of which are difficult to isolate and correct. To correctly submitted as an enterprisestandard. use information from data elements, the business rules, policies, and the functional dependenciesamong elements Lessons Learned must be identified and represented in a data model for each system. Before standardization can be achieved, Establishment of Re-Engineering Strategic Objectives model integration is an essential step to identify and resolve synonyms and homonyms based on the rules, Getting formal commitment and authorization from the policies and functional dependenciesrepresented in the system stakeholders (especially administrative and models. Without the integration step, incorrect technical management)is a crucial factor to the successof information continues to propagate throughout the reverse engineering projects. One problem we enterprise. experienced was the lack of timely accessto key system and functional personnel because they were The Need for a Re-Engineering Cost Model simultaneously required in other activities. In addition, the costs of properly re-engineering systemsand the value of Costs of re-engineering efforts are difficult to estimate. the re-engineering products produced are consistently Costs models are important in order to clearly enumerate under-estimated by management (this is true is the the contributing factors/tasks/resources required for a commercial environment as well). It has been challenging project (based on the environment (technology, to convince management that re-engineering is a administrative and operational) and the size of the system substantially broader and more complex task than just in question. One vendor we know estimates a flat “restructuring the code” by pumping the code through a $1JO/line of code in the system. We were unable to even CASE tool. Restructuring the code is useful to improve estimate the lines of code in the medical portion of the the system maintainability, but it does not support the project becauseof the unstructured nature of the MUMPS following critical tasks: (1) facilitation of data migration, code. One of prowess (2) evolution of a departmental system to satisfy among MUMPS programmersis how complex a program enterprise-wide system requirements, (3) data sharing can be written with a single line of code. Perhaps this among functional areas (e.g., Health Care, Personnel, accounts for the various estimatesin the number of lines Pay), and (4) incorporation of new or changed data of MUMPS code in one of a hospital information system requirements resulting from business process ranging between 1.3 million to 2.5 million depending on improvement activities. Nor does code restructuring whom you ask. consider deletion of obsolete functions or requirements Re-Engineering Tools and Tool Usage that don’t need to be included in the future systems or The tools available are geared to specific target portions provide for identification of incomplete coverage of of the IS life cycle. Many tools on the today businessrequirements by existing programs. exhibit varying degrees of tunnel-vision (covering only the portion the vendor is interested in and not providing adequate links to other tools or common formats. Business rules and other information captured in logical or conceptualmodels (or phases)may not be passedeasily

651 to forward engineering tools of choice, and as a result it is necessarilyon term resolutions. The question that must be not easy to implement the structures, rules, and business answeredby that reviewer is “Will those who are required requirementswith full traceability. to resolve have enough information to make a decision?“. Selection of the proper tool set acrossthe entire life cycle If not, the task output can not be relied on. However, requires a comprehensive strategic vision and a re- there is a need to steer clear of too much information, as engineering processmodel (acrossthe entire IS life cycle) well, in order to keep the documents readable. Some aligned with the objectives of the customer organization. representationsare good for working documents,some are Tool selections should align closely with the customer good for technical input, some are good for formal organization’sgoals, priorities and resourceconstraints. review, and somecan be mademore simple for validation. The current crop of CASE tools focuses on associating It is important to adapt to the readers’needs, and to have a physical data structuresand variables to segmentsof code. customerreviewer available to help guide the content. This is useful in identifying how physical schemasare created,updated, read, and deleted, but it is not sufficient Track Progress Usmg. . Metnq. to construct the logical and conceptual external views of data requirements. Even if the functional and technical It is possible to establishtangible metrics that can be used experts help the information engineers understand the to gauge the progress of the integration task for status existing physical systems,the CASE tools do not provide, reporting and scheduling. For our integration project, the for instance, the facility to link to the physical evidence metrics were basedon the intersection shown in (e.g., , software, data dictionary, Figure 1. Each legacy systembeing integrated has a total focus session results) recorded in support of the number of objects (entities and attributes) within it. This functional requirements. More importantly, using such number applies to each full circle in the diagram. The tools in a purely mechanical fashion (i.e., without help of isolation of overlap areas in the systems as they are the technical and functional experts) will be more prepared for integration will uncover the number (and damaging, producing inadequate, inaccurate, or percent) of objects that fall within each band on the incomplete results. Recovering a normalized logical data diagram (i.e. a--sharedand direct functional relevance,b-- model together with the associated business rules, sharedand indirect functional relevance,and c--common). policies, and physical data structures is hard, and requires Once thesenumbers have been identified, progresscan be significant amount of human effort, but this extra effort is monitored by reporting the number of objects in required in the long run. CASE tools may augment the categoriesa, b and c that have been merged; the number analysis to provide initial understanding of physical of objects in each category that show up in a conflict; the implementation of the system, but CASE tools cannot number of objects in each category that have resolutions; replace hum effort for recovering the conceptual and the number of objects that have had traceability noted upwards,the number noted downwards; and so on. logical data requirements. Integration Task Process/Methodology Task Thing: Availability of Re-Engineered Models If the legacy systems being integrated do not have . . . . existing and up-to-date models of the processes and Draft ToD-Down Model is Kev to Famdlarume w ith information structures, the integration task should be Terms delayed until they are close to readiness. While it is In someof the first systemswe worked on, the mid-level important for the integrator to attend and perhaps model which reflects policy (to compare to participate in some of the re-engineering modeling implementations) was not planned. Its development on sessions held in order to become familiarized with the the integration task was done as an exercise of the tools terms used in discussion, this participation can occur and procedures,while awaiting the re-engineeredmodels towards the later stagesof model review, as preparation of the personnel and payroll systems. It was found that for integration. The models should be at a relatively this exercise greatly contributed to familiarity with stable state, and complete with traceability before terminology as usedby each of the systems’experts. This integration begins. familiarity helped guide the initial focus sessions,to help in quickly isolating terms that were being used differently Task Reviews in each system’scontext. . . Products Reviewer o Shed- It is important to find a reviewer of the deliverables, especially of the conflict cross-references,the conflict The sequence of integration will affect the bias of resolution document, and the traceability of the model, terminology usage that ends up in the ultimate who will concentrate on coverage and format, not consolidated standard. If the focus sessions do not

652 proceed in this order, however, or if attendance by the grouped to include all of the several conflicts that involve integrator does not occur in this sequence, or if the those terms. The reasonis that conflict resolutions impact sessionsdo not cover interchange elementsright from the eachother when like terms are involved. beginning, terminology bias will occur incorrectly. In the For example, if there was a synonym conflict between case of our project, payroll system information was ACTIVITY in one system and UNIT in another, it may be available significantly earlier than personnel system agreed that UNIT should be selected; if there is also a information. While the bias has for the most part been synonym conflict between UNIT and ORGANIZATION, overcome through rigor in integration, it is possible that and on its own the term ORGANIZATION is selected, information provided by the payroll system contacts does that now mean that what used to be called might have been questioned more thoroughly from a ACTIVITY should now also be changed to personnel usage standpoint than it was, simply by being ORGANIZATION? Or would the existence of both introduced after personnel functions and data were more conflicts as a group lead the integrator to suggestthat all familiar. One suggested resolution is to be sure to of ACTIVITY, ORGANIZATION, and UNIT be called schedule interview sessions according to the priority UNIT? Or, would the integrator find that there was a sequence. An additional suggestionis to be sure to obtain homonym use of UNIT going on? The grouping of both on-going validation of the terms from each system being conflicts together (or multiple conflicts if necessary) integrated even after its integration has been completed. provides the highest level of semantics from which the For this prototype effort, it was important to take the integrator can make recommendedresolutions. payroll system input back to the personnel system experts to validate the feedbackobtained, and to reach a common understanding. When identifying stewardship of the exchanged data Use of Legacv Svstem Documentation elements,there may be a subsetwhich is claimed by both (or multiple) system experts. This issue cannot be Legacy system logical data models are required, but, in resolved by the integrator during integration. It should be our case, models were not available at first, so legacy noted within the Conflict Resolution Document, and the system documentation was researchedas sampleinput for preliminary recommended resolutions can be used as conflict isolation. Since this documentation often has input in the integrator’s decision. The recommendations more in-depth definitions than are found in model data will require further review before actual implementation; dictionaries, conflicts were sometimesfound that would this multiple claiming should be addressedat that time. not have been discovered simply by reviewing the input The integrator can raise a flag within the document to models. On the other hand, the models contain more make reviewers awareof theseproblems. semantics in the form of data dependencies and association text, as well as in the structural representation USC of Cross-ReferenceList for Customer IIIDU~ of the element (i.e. entity vs. attribute, primary vs. subtype entity). The models also representcurrent actual practice, In our integration task, cross-reference lists were used something not easily gleaned from documentation. The heavily to input from systems’ functional area recommendation is to try to use both as information experts. Be very careful to include only information that sources to help in investigating, analyzing, and is relevant to the given systemwhen requesting feedback, uncovering candidate conflicts. if possible, to keep the expert focused on input you need Along the same lines, the legacy system model rather than differences found in the other system(s). representation should take precedence whenever Along the samelines, be sure to provide separatelists for contradictions occur between them and the system each system, with their terms as the first, identifying documentation, since the models attempt to capture actual . This is not only an issue of pride-of-ownership as-is implementation as it has evolved over time, while but also a matter of providing information that will be documentation may not be up to date with actual practice. most recognizable to them; the more recognizable, the better and faster their review can be. Integration Task Output Conclusion

Conflict GI-OUDS Requirements-based re-engineering is not for solving minor information technology problems or providing The conflicts found between legacy systems were minor improvements to legacy information systems. Re- originally going to be documentedonly in cross-reference engineering is a non-trivial task best directed toward or matrix form. These show one element name from one formulating long-term solutions for the business system with one element name of the other system,noting enterprise. the potential conflict. To serve the purposes of conflict A requirements-basedapproach to dam modeling and re- resolution, however, the elements need to be logically engineering requires cooperation and resourcesfrom the

653 policy makers and interpreters, end users and functional [HI93b] Hughes Information Technology experts, as well as the spectrum of IS professionals Corporation. Reverse Engineering DOD Legacy Therefore, it is of paramount importance to align project Information Systems, Phase I, Final Report for Data objectives, plans, , and tools with the Processing Systems Modernization Program. Prepared organization’s strategic plan in order to obtain buy-in and for DefenseInformation SystemsAgency, April 13,1993. commitment from system stakeholders ([MH94], [BOW Batini, Lenzerini, Navathe. A Comparative [AHR94], [AHR93], [HI93a], [HI93b]). Analysis of Methodologies for It is essential to understand that automated tools, while Integration. ACM Computing Surveys, Vol. 18, No.4, facilitating the process of re-engineering, cannot do the December1986, pages323-364. entire job. A suite of tools supporting the life cycle of the [P&B941 Premerlani, Blaha. An Approach for re-engineering effort that can exchange data would be Reverse Engineering of Relational Databases. ideal. No CASE tool in itself can do the entire job and of the ACM, May 1994, Volume 37, may even prove to be bottleneck. To ensure that a Number5, pages42-49. thorough job of re-engineering is performed human [I-IDA871 Hogshead-Davis, Arora. Converting a intervention is required for analysis and problem solving. Relational into an Entity-Relationship Indeed, a committed team is required to establish the Model. Proceedings of the Sixth International continuity of expertise and purpose neededto successfully Conferenceon Entity-Relationship Approach, 1987. implement a model-driven system development and [M&M901 Markowitz, Makowsky. Identifying maintenanceenvironment. Extended Entity-Relationship Object Structures in Relational Schemas.IEEE Trans. Softw. Eng. 16,8 (Aug. Acknowledgments 1990),777-790 The work described here is basedon the joint efforts of [M&S891 Markowitz, Shoshani. On the Correctness the Systems Modernization Program. of RepresentingExtended Entity-Relationship Structures The authors are indebted to Carlo Zaniolo of UCLA for in the . SZGMOD‘89. ACM, New York, commentsand discussions. 1989. [NEK94] Ning, Engberts, Kozaczynski. Automated References Support for Legacy Code Understanding. Communications of the ACM, May 1994, Volume 37, [DoD94] DOD 8320.1-M-1: Data Element Number 5, pages50-57. StandardizationProcedures. [MNBBK94] Markosian, Newcomb, Brand, Burson, m941 Muntz and Hobson. LessonsLearned: Re- Kitzmiller. Using an Enabling Technology to Reengineer engineering DOD Legacy Information Systems. Software Legacy Systems. Communications of the ACM, May Engineering Techniques Workshop on Software 1994, Volume37, Number 5, pages58-7 1. Reengineer@, Pittsburgh, PA, May 3-5, 1994 [HCTJ93] Hainaut, Chandelon, Tonneau, Joris. r-941 Aiken, Muntz, Richards. DOD Legacy Contribution to a Theory of Database Reverse Systems: Reverse Engineering Data Requirements. Engineering. Proceedings of Working Conference on Communications of the ACM, May 1994, Volume 37, Reverse Engineering, May 21-23, 1993, Baltimore, Number 5, pages26-4 1. Maryland, page 161-170. w=931 Aiken, Muntz, Richards. A Frame work for Reverse Engineering DOD Legacy Information Systems. Proceedings of Working Conference on Reverse Engineering, May 21-23, 1993, Baltimore, Maryland, Page 180-191. [HI93a] Hughes Information Technology Corporation. Approaches to Cross System Functional Integration: Phase I , Final Report for Data Processing SystemsModernization Program. Preparedfor Defense Information SystemsAgency, April 5, 1993.

654