<<

Size Based Estimation For Legacy Applications Gerald Varghese / Venkatachalam Narayana Iyer

TABLE OF CONTENTS

ABSTRACT ...... 2

INTRODUCTION...... 3 ESTIMATION - AN OVERVIEW ...... 3 INDUSTRY FOLLOWED ESTIMATION MECHANISMS ...... 3 Analogy Estimation...... 3 Parametric Estimation...... 3 Bottom Up...... 4 Expert Judgment...... 4 PREDICTION MODELS...... 5 DATA COLLECTION...... 5 LEGACY ...... 6 ESTIMATION MODELS BASED ON CLASSIFICATION ...... 6 Development ...... 6 Maintenance...... 8 Testing Projects...... 9 SNAPSHOT ON USEFUL TOOLS ...... 10 Microsoft Excel for statistical data analysis and prediction ...... 10 Minitab...... 10 CONCLUSION...... 11

REFERENCE ...... 11

Page 1 of 11

Abstract This paper is meant to give an overview of the different size based estimation mechanisms, which could be most suited for legacy applications. This paper will touch up on the different industry known estimation mechanisms, with focused details on Analogy and Parametric models of estimation. The project experiences from development, maintenance & bug fixing kind of projects for the legacy applications have been considered while preparing this paper. The estimation practices that gave the best results for legacy applications have been analyzed and implemented. The paper describes the practical methodologies, which have been piloted in the various projects that have been executed in our . The paper will talk about the different situations/projects best suited for each of the types of estimation. Statistical methodologies of linear, quadratic and cubic models, which can be used for estimation, will be detailed. Statistical features of Microsoft Excel and other statistical analysis tools like MiniTab will also be discussed. An insight to “predictive ” will also be covered. Predictive management will discuss about predicting the remaining time, mid way through the project execution period. Practicing the modus operandi mentioned in the paper helps in interactions with the different stakeholders in a more professional and robust manner, as the entire will be based on quantitative data. The paper shares the estimation experience gained from executing legacy projects and supports the readers with some of the best practices to be adopted for estimation in legacy projects.

Keywords: Estimation, Legacy applications, Analogy, Parametric model, Predictive, Statistical Analysis

Page 2 of 11

Introduction Software Estimation - An Overview During the initiation of any software project, one of the key challenges faced by project managers is with respect to estimating the effort required for execution of the project. Especially in this highly competitive world of Software market, the ability to come up with an accurate estimate for the project is one of the most desired competencies from any .

One of the major inputs for the project manager in deciding the estimated effort is the SIZE of the work product coming out of the project. The accuracy in the project’s effort estimate will depend on how accurately the size of the deliverables is estimated. One of the contributing factors for failure of software projects is that many projects get estimated based on the “personal working methods/experience” of the project managers. There is lot of information available in the knowledge base industry on estimating the projects. A few of the industry standard practices and estimation mechanisms are listed hereby.

Industry Followed Estimation Mechanisms During the project phase, once the list of activities is identified, the next step is to estimate the duration required for each activity. This paper focuses on this part of project management process. Different project managers follow different methodologies/process for this, which includes • Analogy Estimation • Parametric Estimation o Function Point (FP) Based o Lines Of Code (LOC) Based • Bottom Up • Expert Judgment

Analogy Estimation This methodology is based on the “gut feelings” (or unaided intuition) of the person who is doing the estimation process with/without some unclear data from older projects. At the project level, a number is judiciously arrived at and distributed across all the activities involved in the project. The project is compared in with another completed project that may be of the similar type. The basis of estimation is to characterize the project for which the estimate is to be made and then to use this characterization to find other similar projects that have already been completed. The effort estimation is then done based on the known effort values for these completed projects. A previous experience in a similar assignment could be an added advantage and could help contribute towards the accuracy of the estimate, but in a subjective manner. Documented historic data as such, will not be used much while estimating using this mechanism. It is the most commonly used estimation for effort based /maintenance estimation. For estimation, one needs to determine how best to characterize the project (define variables that might have to be measured on different scales), determine similarity and relate projects for comparison and finally, how much one should depend on the effort values from analogous projects for the estimation of the new project.

Parametric Estimation The key focus in this type of estimation is on identifying the “parameters” that contribute towards the size required for the tasks in the project.

Function Point (FP) based: Function point is a measure of the size of the software application measured from a functional or user point of view. It is independent of computer language, but depends on development methodology, technology and capability of the project team used to develop the application. Function point tends to be useful in data oriented application. It is not restricted to estimation of code alone. FP could be used for all class of software, any combination of languages, quality

Page 3 of 11

and productivity. It also gives a value added analysis and opportunities for software and discussion with clients.

FP required for the project might not be calculated accurately until the analysis is completed. (Software Requirement Specifications are baselined). Estimation method based on function points is derived from different metrics. The approach is to identify the number of unique function types as external inputs, external outputs, queries, external files or interfaces and internal files. Each of these are then individually assessed for complexity and given a weight-age value that might vary from 3 to 15. Unadjusted Function Point (UPF) is calculated as the sum of the product of each function type with its weighting value. Then adjusted FP is calculated as the product of UFP and technical complexity factor (TCF). TCF is the sum of the entire complexity factor based on their degree of influence on the measured (Environmental Factor).

The factor, Function Point being an imaginary unit, would never end up in an all time perfect estimation solution. It always depends on the definition attached to it and hence is hard to automate and difficult to compute. The weight-age given to the function types is subjective and do not fit on real time systems. The International Function Point Users Group (IFPUG) manages, regulates and issues updates to these standards, making function points fully documentable and traceable.

Lines Of Code (LOC) based: LOC based estimation is typically related with effort. It is easy to determine, automate and enforce. The LOC is counted excluding the comments and code created by application generators. Simple to measure estimation method and is more familiar and understandable but is less accurate than FP.

Stating the size of an application in LOC would not be a good measure for understanding any system. The basic assumption is that all lines would be of the same complexity. It is language dependent and may not be easily understood, as there is no way to express the difference between complexity and functionality! Especially for projects where a piece of application is getting modified, it is easier to measure the size of project using impacted lines of code rather than FP.

Bottom Up Detailed activity list needs to be known beforehand. The components identified are estimated by the responsible person. Then the individual estimates are summed to get the overall cost of the project. Consumes lot of time to estimate based on the components. Components level needs to be known, which may not be applicable for a maintenance task. It requires the same kind of estimation to be performed on the different components by the responsible person. This method introduces a very high , as there is no system level focus during the estimation. This kind of estimation is costly.

Expert Judgment This method is also known as ‘Delphi technique’ for software estimation. Estimated effort is just the decision made by a group of expert heads during the estimation. It is mostly practiced when no historical data is available, but people experienced in similar work are. Expert judgment could also be supported by historical data, process guidelines and checklists. But purely depends on the persons involved in estimation, their personal characteristics and past experience. Judgment is based on the observable requirements that are uninfluenced by emotions or personal prejudices. The whole process of estimation could be a group discussion among the experts that ends with an estimated effort that convinces everyone. The effectiveness of the estimation is solely based on the skill of the people and their role in the process of estimation. The variables for this combination strategy with combined estimate would depend upon Number of Experts, the individual’s estimation accuracy, the degree of biases among the experts, inter-correlation between the expert’s estimates. Expert judgment could be more accurate than most of the practiced estimation methods when environment changes are not considered and the estimation need to be done on a simple system.

Page 4 of 11

Prediction Models Most in this industry have various types of processes to collect and document the different project data. This data effectively becomes the historical data when estimating future projects. During the estimation phase, different techniques have to be looked into for doing proper predictions. Prediction basically is done using statistical data analysis and other statistical extrapolation techniques.

Statistical extrapolation is typically done by plotting the different data points in a graph and connecting the data points using one or other form of mathematical curves/equations. The curves/equations could be first-degree (alias Linear curves), second-degree (alias Quadratic curves) or even third-degree (alias Cubic curves). The curves plotted are extended for futuristic predictions. The unknown co-ordinates for any known parameter can be found out using the curve that is plotted. The process mentioned here is termed as “Defining the prediction model”. The linear curve plotted adopts first-degree prediction model, the quadratic curve adopts second-degree prediction model and like wise. The higher the degree of the prediction model, the higher the accuracy of the prediction as well. These prediction models are used in parametric based estimation.

In the subsequent sections of the paper, focus will be given to the different types of prediction models in detail. The focus will be on a step-by-step approach, which helps in the size based estimation of different types of projects. One of the success factors contributing towards the accuracy in the prediction models is the DATA that goes into defining the prediction model.

Data Collection During the execution of any project, there needs to be focused thoughts on what data needs to be collected. The data collected needs to be detailed to an extent that it could be used for similar projects. Again, the detail to which data collected should not be an overhead to the project execution team as well. The project execution management needs to identify optimum mechanisms for performing the data collection procedures.

A lot of thought needs to be applied on what all data needs to be collected. From a future project estimation perspective, the size of the application developed and the effort taken for completing the application, mandatorily needs to be captured and has to go to the organization’s historical database. Again, the effort captured needs to be categorized based on the different phases of the project, if applicable.

If the project involves a lot of repeated and similar set of activities, the activities needs to be categorized and data collected at the least component level to facilitate future decision making. Considering a project involving similar changes to 1000 programs, the programs needs to be classified into various categories like simple program, programs with average complexity and highly complex programs or so. The categorization may be based on the predefined set of attributes related to the program like total lines of code, number of database operations, number of screen/report interfaces etc. The effort required and the size for each program needs to be collected and documented. After collecting data for, say first 100 programs, the collected data can be analyzed for making valuable decisions regarding the remaining 900 programs including expected effort required for completing the remaining 900 programs. This example will be detailed to the next level in the “Maintenance projects” section.

A typical problem, a few organizations in this industry face is collecting the required and accurate data. The performing teams needs to be oriented towards the importance of collecting data in an accurate manner. The team needs to be appraised on how the data collected is going to benefit their organization in future projects. Once the team is aware of how important the data collected is, how the data is going to be used etc., accurate data collection process will automatically be part of the project execution system.

Page 5 of 11

Legacy Systems During the evolution of computers, IBM had a lot of focus in Mainframes and AS400 systems. Most of the early day computerization giants went for the then available options like Mainframes and AS400. The systems were robust and were able to serve the of the corporate giants to a great extent. The systems running on these machines are pretty old, running continuously for 20 to 30 years. Even during these days of technological revolutions, there is some reluctance in completely converting the massive systems running on the legacy systems to high end latest technologies. Though companies may not invest much into these systems in “developing” new applications from scratch, there is lot of focus on “maintaining” the existing systems. This includes some amount of enhancements to an extent. Lot of focus is provided towards “testing” the enhanced systems as well to ensure the enhancements don’t impact the execution of their existing business processes.

Estimation models based on Project Classification The projects has been basically classified here as

• Development projects • Maintenance/Enhancement projects • Testing Projects

Development Projects Typically in legacy systems, size of applications developed depends mainly on the number of functionalities to be implanted in the application that is being developed. Each of these applications can be drilled down to program level for better analysis and monitoring. Say in Mainframe systems, the programs will be in COBOL and in AS400, the program will be in RPG/RPGLE. These programs will be interfacing with maps/screens and reports. These primarily are the mechanism of user inputs and outputs in legacy system.

With some amount of technical effort, project performing organizations can come up with generalized “templates” for typical Inquiry/Update/Reporting programs. During the development of new programs, these templates could be used for building up on them. These acts as effective code re-use to an extent and save lot of productive time during project execution. The size of new programs will primarily depend on the number of inputs/outputs/interfaces, the program has. The project manager needs to identify the right parameters in technical terms that contribute towards the size of the application. Say, introduction of new screen will add 50 lines of code to the program. Again, the number of fields in the screen does matter. The detail to which the requirements are available will help in coming up with an accurate estimate of the size of the application.

Once certain number of programs is developed, their data needs to be collected and consolidated for further analysis. The data analysis will be oriented towards identifying the extent of correlation identified parameters has with respect to the size of the programs. So, the first step in data analysis is identifying the individual parameter correlation to the size of the program. This could be done by doing simple statistical analysis or by using some tools which help in identifying correlation between two variables. A correlation of 0.7 and above is considered excellent.

There will be some parameters with correlation value much less than 0.7. Inference should be that, those parameters are not contributing significantly to the size of the program and could be ignored during subsequent data collection/estimation. A different set of new parameters could be added for further data analysis. An overall correlation considering all the parameters together also needs to be calculated to check whether there is a high correlation. Overall correlation will consider the impact on one parameter on another parameter as well. These are the preliminary steps involved in defining the prediction model. An illustration is provided on the subsequent page.

Page 6 of 11

Some organizations use “Function Points (FP)” for estimating development projects. The illustration shown in Table 1 is a customized way of data collection and data analysis using Lines Of Code to define the size of application. Organizations needs to put thoughts into coming up with similar customized mechanisms which suits best, the type of projects executed.

Once after identifying the factors which has good correlation, individually and together with other parameters, the next step is to define the prediction model. As discussed earlier, different types of curves (linear, quadratic or cubic) can be plotted for the data. With the increase in degree of curve, the accuracy will increase, but will require more effort in coming up with the prediction model. Judiciously, the prediction model has to be baselined and all further estimation needs to be based on this prediction model. Special care needs to be taken to ensure that the data used in defining the model is continuously appended with more real time data as more programs are worked up on.

Illustration on data analysis and defining a prediction model Table 1 documents the data of few programs in the form of a list of parameters contributing to the size of application and the size of the programs measured in lines of code (LOC).

Table 1 Update files Input files Windows Screen Report Algorithm Calculation LOC count count count count count Complexity Complexity 0 1 1 1 0 1 1 792 1 2 1 2 0 2 1 1292 0 2 0 1 0 1 1 671 0 10 1 2 1 1 1 2430 ------0 4 1 1 1 2 2 1380

Individual correlations of each of the parameters with respect to LOC are listed in Table 2. This is calculated using CORREL function in Excel, which is explained in the last section of this document.

Table 2 0.178 0.560 0.661 0.198 0.896 0.885 0.818

Prediction model definition (Table 3) in this illustration is done using LINEST function (Gives the equation of a straight line in Excel. LINEST is an array function, giving a set of tabular values as the output. How to use this function is explained in the last section of the document.

Table 3 539.03 558.17 912.85 -41.69 -91.19 49.30 12.47 161.81 156.22 134.83 420.99 102.20 128.71 39.29 63.59 396.66 0.99 236.14 #N/A #N/A #N/A #N/A #N/A #N/A 46.74 3.00 #N/A #N/A #N/A #N/A #N/A #N/A 18245661.42 167291.49 #N/A #N/A #N/A #N/A #N/A #N/A

The first element in the third row gives the overall correlation value considering all the parameters together. In this example, the value is 0.99, which shows excellent overall correlation (though some of the individual correlation values are less than 0.7). The values in the first row are the coefficient to be used while defining the prediction curve. Using the first row values, the equation of the curve (to be used for predicting LOC) is derived as:

LOC = 539.03 * calculation complexity + 558.17 * algorithm complexity + 912.85 * printer – 41.69 * screen – 91.19 * windows + 49.3 * input files + 12.47 * update files + 161.81

Page 7 of 11

Maintenance Now let us consider the next type of project, the most common type of projects on legacy systems. Even the maintenance work comes in different styles; it could be just Field Expansion projects, Maintenance work similar to Y2K projects, File Replacement projects or even Application Enhancements. In these types of projects, the amount of work involved in the system can be typically measured by understanding how much part of the application is impacted. At program level, this can be measured with ‘Impacted Lines Of Code’ or ILOC.

In these types of projects, one significant factor to be considered is the impact percentage. This is the ratio of the size of the application impacted to the total size of the application. This is the ratio of total ILOC from all the impacted programs to the total LOC of all the impacted programs.

Let us go back to the example of ‘1000 programs’ mentioned in the data collection section in this document. Consider that similar activity is to be performed in 1000 programs. The project team needs to collect the data for initial set of say 100 programs. Again, this could be put into different categories like simple, average & complex programs. From each of these programs, the ILOC, LOC and the effort required for modifications needs to be captured considering the different categories. With this information, the different impact percentage for these categories can be identified. Knowing the LOC of the remaining 900 programs classified into the different categories, the impacted size of the remaining can be predicted. From the data for the initial 100 programs, the productivity factor for each category is known and with this information, the effort required for the remaining 900 programs can be predicted with a lot of accuracy.

Illustration on estimation for maintenance projects

Table 4 Total ILOC Till Date 500 Assumed for 1st 100 pgms Total LOC till date 15000 Assumed for 1st 100 pgms Total Hours till date 250 Assumed for 1st 100 pgms Impact % (Ratio of ILOC to LOC) 3.33% = 500 * 100 / 15000

Total No: of Programs 1000 Total No: of Programs completed 100

Average LOC / Program 150.00 = 15000 / 100 Expected LOC for remaining pgms 135000.00 = (1000 – 100) * 150

Expected ILOC for remaining pgms 4500 = 135000 * 3.33%

Productivity (ILOC / Hr) 2 = 500 / 250

Time Required for completing the 2250 = 4500 / 2 remaining programs (Hrs)

The methodology explained in Table 4 is a simple prediction model; it could be enhanced by having sub categories and performing the same process at the sub category level for more accurate results. For example, out of the 1000 programs, 300 programs are classified as simple, 400 programs classified as medium complexity and 300 programs are classified as highly complex. In such a case, there will be different Impact % and productivity values calculated for each of the categories. Ultimately, time required to complete simple, medium and highly complex programs is calculated separately. Summing them will give the information at project level.

Page 8 of 11

The mechanisms mentioned here are really helpful in effective communication with the different stakeholders. The stakeholders could be clients, project sponsors, or any other affected party in the project. This process could be performed at different stages in project for more quantitative on the project. This gives a lot of information on how the project is performing (in terms of completion), how much more/less number of human is needed for the remaining phase of the project and many similar areas. Communication using quantitative values will have a distinct effect compared to subjective . This process when performed during the execution phase of the project can be termed as ‘re-estimation’. The key factor to be remembered here is that estimation/re-estimation is basically performed on the size parameter of the work involved. Clients will get the assurance that they are charged for a defined size of work rather than in vague terms. Such objective measurements give a lot of punch for all negotiations on the projects with the clients and give a lot of confidence to the clients on the project performing organization as well.

Testing Projects Testing of any application has different phases. It includes understanding the business of the system, creating unit, integration and system test scripts, creating test data, performing testing, documenting the results etc. Out of this, there are some steps like test scripts creation, testing etc. which are really dependent on the size of the application. During the project performing periods, some of these steps are combined together for easy management. For example, understanding the business and creating the test scripts could be clubbed together. Similar is the case with creating test data, testing and documenting the test results. So, data needs to be collected only at this group level for proper prediction based on the size of application.

The size of a testing project is primarily decided by the number of test cases to be tested. This again is dependent on the number of lines of code to be tested. If testing is performed on a newly developed application, all the lines of code will have to be tested whereas in the case of enhancements/maintenance projects, only the impacted lines of code needs to be tested. In either case, the test scripts could be categorized as unit test scripts, integration scripts and system test scripts.

The organization needs to collect and consolidate data on testing project also. Specifically with respect to estimation, the historical data should consist of information on how many test cases were created for how many lines of code or impacted code, along with the effort required for creating the same. Similar data should be available for performing the testing process also. Linear prediction model could be easily applied on this historical data for predicting the project’s estimate. Again, the estimation here is based on size of the work involved like the number of test cases created and executed. Effectively, using the prediction models, the effort required for creating and performing the different test cases can be estimated in a very efficient and effective manner.

Summarizing the discussed points, the key steps to be performed for effective prediction,

Identify the parameters contributing to the size of applications Identify the detail at which data will be (practically) collected accurately and analyzed Statistically evaluate how much identified parameters contribute to the size Define the prediction model based on correlating parameters Apply the prediction model to predict future work

These methodologies hold good at both organization level and project level. Some organizations do have dedicated group of people to perform these practices. In other cases, the project managers could apply these practices for effective project management.

Page 9 of 11

Snapshot on useful tools Microsoft Excel for statistical data analysis and prediction Microsoft Excel comes with a lot of functions for performing statistical data analysis. It has many built-in functions facilitating statistical analysis for finding beta distribution, covariance, for using fisher transformation etc. An overview of the functions that would be used for the prediction model is discussed here.

CORREL function usage: CORREL function helps us calculate correlation coefficient of any two data sets. Correlation indicates how one factor affects another. The parameters for the CORREL function are: 1. An array of known X’s (For e.g.: update files count or input files count values). 2. An array of known Y’s (For e.g. the LOC values)

After having a table with a list of known X’s and Y’s in excel, select any cell, click the ‘Insert’ menu and select ‘Function’ option. Select CORREL function and select the array of cells corresponding to the all X’s as the first parameter to the function and Y’s as the second parameter. Pressing OK will give the correlation between X and Y in the selected cell.

LINEST function usage: To find the equation of the line (y=m1x1+m2x2+…+b) that fits the best for the data, LINEST function is used. It calculates the coefficients for the equation for the line by using the “least squares” method. The ‘Y’ is the desired LOC and the X’s are the different parameter values (Update files count, input files count etc) with ‘m’ acting as weight-age value. The LINEST function returns an array that describes the line (i.e. the linear curve)

The usage of LINEST function is bit different from the usage of CORREL function. The parameters for LINEST function are 1. An array of known Y’s (Known LOCs) 2. An array of all known parameters (X1’s, X2’s) (update file count, input files count etc.) 3. Logical value. Enter TRUE, to force the constant ‘b’ to be non zero. 4. Logical value. Enter TRUE, to let the function return additional regression .

In Excel, select 5 rows and that many columns as one more than the total parameters used. Choose the LINEST function and select the array of cells corresponding to LOC (Y’s) as the first parameter, the array of cells corresponding to all X’s as the second parameter and TRUE for the third and fourth parameter. Third parameter = True is needed when a constant in the equation for the curve is preferred. Coefficients listed on the first row of the output are in the reverse order of the parameters as mentioned in the equation related to illustration in Table 3.

Minitab Minitab is a pure statistical analysis tool introduced into the market by the MINITAB Inc. This tool has lot of features which helps in doing basic statistical analysis, regression analysis, analysis of variance, statistical process control, measurement system analysis, design of experiments, reliability or survival analysis, multivariate analysis, time series and forecasting, and distribution and many more. (These are with reference to version 1.4 of Minitab). For quantitative management at project level, knowing all the details in Minitab may not be required, but few features like correlation analysis, probabilistic analysis etc are really handy. If the organization has focused approach (with a dedicated team) on quantitative data management at organizational level, the expertise on these tools could add a lot of value in effectively managing the data that is collected and consolidated. Such an organization will be able to present itself to the market with more predictive power and will always have a specific competitive edge over its competitors.

Page 10 of 11

Conclusion The different estimation mechanisms currently in practice in the software industry are discussed with pros and cons for each. The significance of Prediction Model is briefed with the highlight on the importance of accurate data collection and analysis. The role, precise data from projects, play for statistical data analysis is emphasized. The paper has brought out different size based estimation mechanisms for Development, Maintenance and Testing projects that are most suited for legacy applications. An attempt has been made here to prepare/help the project managers to face some of the key challenges faced during project execution like project estimation, quantitative data analysis etc. in a better way. Prediction model based on project classification helps having greater grip on the project by managing it in a quantitative and predictive way. The methodologies explained (like Re-estimation) also facilitate effective communication on the progress of the project to all stakeholders. Team would predict whether they would be able to complete the project on time with the current set of resources. A group competent to perform these practices at an organization level is really an asset to the organization allowing it to stand apart in comparison with its competitors.

Prediction using the models is more accurate as the data used for defining the model has the constraint’s consideration built-in. Usage of these models doesn’t demand things like “expert opinion”, there by making the model lot more scalable and effective to be used in all scenarios for the current team’s competency. The only requirement for this model is that it requires good collection of data to embark on. This model makes effective management of a project in legacy system at ease.

Reference The contents in this document have been basically drafted from the experience gained while working with different legacy projects. The knowledge acquired over long time has been shared herewith. That is why a descriptive reference section is provided instead of any specific reference points.

Page 11 of 11