<<

COMPUTING

Using Metrics to Evaluate Software Svstem J Maint ainabilitv J

In this month’s Computing Practices Don Coleman and Dan Ash, Hewlett-Packard we offer a sneak preview of Computer‘s September issue on software metrics. Bruce Lowther, Micron Semiconductor Software metrics have been much criticized in the last few years, some- Paul Oman, University of Idaho times justly but more often unjustly, because critics misunderstand the intent behind the technology. Software complexity metrics, for example, rarely measure the “inherent complexity“ embedded in software systems, but ith the maturation of software development practices, software main- they do a very good job of comparing tainability has become one of the most important concerns of the soft- the relative complexity of one portion of ware industry. In his classic book on , Fred Brooks’ a system with another. In essence, they claimed, “The total cost of maintaining a widely used program is typically 40 percent are good modeling tools. Whether they or more of the cost of developing it.” Parikh2 had a more pessimistic view, claiming are also good measuring tools depends that 45 to 60 percent is spent on . More recently, two recognized ex- on how consistently and appropriately perts, Corbi3 and Yourdon: claimed that software maintainability is one of the ma- they are applied. The two articles jor challenges for the 1990s. showcased here suggest ways of These statements were validated recently by Dean Morton, executive vice presi- applying such metrics. dent and chief operating officer of Hewlett-Packard, who gave the keynote address Our first article, by Don Coleman et at the 1992 Hewlett-Packard Software Engineering Productivity Conference. Mor- al., sets forth maintainability metrics for ton stated that Hewlett-Packard (HP) currently has between 40 and 50 million lines gauging the effect of maintenance of code under maintenance and that 60 to 80 percent of research and development changes in software systems, rank personnel are involved in maintenance activities. He went on to say that 40 to 60 ordering subsystem complexity, and percent of the cost of production is now maintenance expense. comparing the “quality” of two different The intent of this article is to demonstrate how automated software maintainabil- systems. ity analysis can be used to guide software-related decision making. We have applied The second article, by Norman metrics-based software maintainability models to 11 industrial software systems and Schneidewind, describes an approach used the results for fact-finding and process-selection decisions. The results indicate to validating metrics for that automated maintainability assessment can be used to support buy-versus-buildde- large-scale projects such as the space cisions, pre- and post-reengineering analysis, subcomponent quality analysis, test re- shuttle flight software. The proposed source allocation, and the prediction and targeting of defect-prone subcomponents. metrics isolate specific quality factors Further, the analyses can be conducted at various levels of granularity. At the com- that let us predict and control software ponent level, we can use these models to monitor changes to the system as they occur quality. and to predict fault-prone components. At the file level, we can use them to identify Please feel free to contact me di- subsystems that are not well organized and should be targeted for perfective mainte- rectly about articles you liked, didn’t nance. The results can also be used to determine when a system should be reengi- like, or would like to see in this section neered. Finally, we can use these models to compare whole systems. Comparing a (oman Qcs.uidaho.edu). known-quality system to a third-party system can provide a basis for deciding whether -Paul Oman to purchase the third-party system or develop a similar system internally.

NlX-9162194/$4.00 0 1994 IEEE COMPUTER

Authorized licensed use limited to: CALIF STATE UNIV NORTHRIDGE. Downloaded on April 26, 2009 at 01:38 from IEEE Xplore. Restrictions apply. Recent studies in metrics for software sented to HP Corporate Engineering the program or system is decom- maintainability and quality assessment managers in the spring and summer of posed into algorithms. have demonstrated that the software’s 1993. At that time it was decided that the (2) The information structure, which in- characteristics, history, and associated en- hierarchical multidimensional assess- cludes characteristics pertaining to vironment(s) are all useful in measuring ment and the polynomial regression the choice and use of data structure the quality and maintainability of that models would be pursued as simple and dataflow techniques. software?-7Hence, measurement of these mechanisms for maintainability assess- (3) Typography, naming, and comment- characteristics can be incorporated into ment that could be used by maintenance ing, which includes characteristics software maintainability assessment mod- engineers in a variety of locations. HP pertaining to the typographic layout, els, which can then be applied to evalu- wanted quick, easy-to-calculate indices and naming and commentingof code. ate industrial software systems. Successful that ‘‘line’’ engineers could use at their models should identify and measure what desks. The following subsections explain We can easily define or identify sepa- most practitioners view as important com- how these methods were applied to in- rate metrics that can measure each di- ponents of software maintainability. dustrial systems. mension’s characteristics. Once the met- rics have been defined andlor identified, HPMAS: A hierarchical multidimen- an “index of maintainability” for each di- sional assessment model. HPMAS is mension can be defined as a function of A comDarison HP’s software maintainability assessment those metrics. Finally, the three dimen- of five models system based on a hierarchical organiza- sion scores can be combined for a total tion of a set of software metrics. For this maintainability index for the system. For We recently analyzed five methods for particular type of maintainability prob- our work, we used existing metrics to cal- quantifying software maintainability lem, Oman and Hagemeister6 have sug- culate a deviation from acceptable ranges from software metrics. The definition, gested a hierarchical model dividing and then used the inverse of that devia- derivation, and validation of these five maintainability into three underlying di- tion as an index of quality. methods has been documented else- mensions or attributes: Most metrics have an optimum range where.7 Only a synopsis of the five meth- of values within which the software is ods is presented here: (1) The control structure, which includes more easily maintained. A method called characteristics pertaining to the way weight and trigger-point-range analysis is Hierarchical multidimensional as- sessment models view software main- tainability as a hierarchical structure of the source code’s attributes6 Polynomial regression models use re- gression analysis as a tool to explore the relationship between software maintainability and software metrics8 An aggregate complexity measure gauges software maintainability as a function of entropy.5 Principal components analysis is a sta- tistical technique to reduce collinear- ity between commonly used complex- ity metrics in order to identify and reduce the number of components used to construct regression model^.^ ed into Factor analysis is another statistical technique wherein metrics are or- thogonalized into unobservable un- derlying factors, which are then used to model system maintainabilit~.~

Tests of the models indicate that all five compute reasonably accurate main- tainability scores from calculations based on simple (existing) metrics. All five models and the validation data were pre-

August 1994

Authorized licensed use limited to: CALIF STATE UNIV NORTHRIDGE. Downloaded on April 26, 2009 at 01:38 from IEEE Xplore. Restrictions apply. used to quantify maintainability by cal- Polynomial assessment tools. Regres- of comments. That is, large comment culating a “degree of fit” from a table of sion analysis is a statistical method for blocks, especially in small modules, un- acceptable metric ranges. When the met- predicting values of one or more response duly inflated the resulting maintainabil- ric value falls outside the optimum range, (dependent) variables from a collection ity indices. To rectify this, we replaced the it indicates that maintainability is lower; of predictor (independent) variables. For aveCM component with percent com- hence, there is a deviation (or penalty) purposes of software maintainability as- ments (perCM), and a ceiling function on the component’s contribution to sessment, we need to create a polynomial was placed on the factor to limit its con- maintainability. The optimum range equation by which a system’s maintain- tribution to a maximum value of 50.1° value, called the trigger point range, re- ability is expressed as a function of the as- Also, because there has been much dis- flects the “goodness” of the program sociated metric attributes. We have used cussion of the nonmonotonicity of Hal- style. For example, if the acceptable this technique to develop a set of polyno- stead’s effort metric (it is not a nonde- range for average lines ofcode (aveLOC) mial maintainability assessment models8 creasing function under the concatenation is between 5 and 75, values falling below These models were developed as simple operation), we reconstructed the model 5 and above 75 serve as the trigger points software maintainability assessment using Halstead’s volume metric instead. for what would be classified as poor style. methods that could be calculated from ex- Thus, the final four-metric polynomial If the measured average lines of code isting metrics. Since these models were now used in our work is value lies within the acceptable range, intended for use by maintenance practi- there is no penalty. If the metric value tioners “in the trenches,” the models were Maintainability = 171 falls outside the trigger point range but is again calibrated to HP engineers’ subjec- -5.2 x ln(aveVol) close to the bounds (trigger points), we tive evaluation of the software as mea- then apply a proportional deviation, sured by the abridged version of the -0.23 x ave V(g’) which can run up to 100 percent (the AFOTEC software quality assessment in- -16.2 x ln(aveL0C) maximum penalty). The weighted devia- strument? That is, the independent vari- +(50x sin(d2.46 x perCM)) tion is computed by multiplying the cal- ables used in our models were a host of 40 culated deviation by a weighted value be- complexity metrics, and the dependent tween zero and one, inclusive. The metric variable was the (numeric) result of the This polynomial has been compared to attributes are combined based on the as- abridged AFOTEC survey. the original model using the same vali- sumption that the dimensional maintain- Approximately 50 regression models dation data. The average residual be- ability is 100 percent (highly maintain- were constructed in an attempt to iden- tween the effort-based model and the able); they are then reduced by the tify simple models that could be calcu- volume-based model is less than 1.4. deviation percentage of each metric. Di- lated from existing tools and still be mension maintainability is calculated as generic enough to apply to a wide range of software systems. In spite of the current research trend away from the use of Hal- Applying stead metrics, all tests clearly indicated the models to that Halstead‘s volume and effort metrics were the best predictors of maintainabil- industrial software ity for the HP test data. The regression A software maintainability model is The overall maintainability index is the model that seemed most applicable was a only useful if it can provide developers product of the three dimensions. Multi- four-metric polynomial based on Hal- and maintainers in an industrial setting plying the three dimensions’maintainabil- stead’s effort metric and on metrics mea- with more information about the system. ity gives a lower overall maintainability suring extended cyclomatic complexity, Hence, the data used to test and validate than averaging does, which underscores lines of code. and number of comments: our models consisted entirely of genuine the fact that deviation in one aspect of industrial systems provided by Hewlett- maintainability will hinder other aspects Maintainability = 171 Packard and Defense Department con- of the maintenance effort, thus reducing - 3.42 x In(aveE) tractors. The examples are presented maintainability of the entire system. - 0.23 x aveV(g’) here to show how these models can aid HPMAS was calibrated against HP en- - 16.2 x ln(aveL0C) + aveCM software maintainers in their decision gineers’ subjective evaluation of 16 soft- making. The data presented in the fol- ware systems, as measured by an where aveE, aveV(g’), aveLOC, and lowing subsections is real and unaltered, abridged version of the AFOTEC (Air aveCM are the average effort, extended except that proprietary information has Force Operational Test and Evaluation V(G), average lines of code, and number been removed. Center) software quality assessment in- of comments per submodule (function or ~trument.~HPMAS maintainability in- procedure) in the software system. Using HPMAS in a prelpostanalysis of dices range from 0 to 100, with 100 rep- Preliminary results indicated that this maintenance changes. Over several years resenting excellent maintainability. model was too sensitive to large numbers of , systems tend to

46 COMPUTER

Authorized licensed use limited to: CALIF STATE UNIV NORTHRIDGE. Downloaded on April 26, 2009 at 01:38 from IEEE Xplore. Restrictions apply. degrade as the number of “patches” to tion contains modules in the pretest sys- system. For example, section 1 of Table 2 them increases. To combat this increase tem that could not be matched to any consists of unchanged components with in entropy, a prelpostanalysis can be used module in the post-test system. (Visual relatively high HPMAS maintainability to ensure that the maintainability of a sys- inspection of the code revealed that the scores. If these components remain tem does not decline after each mainte- post-test components contained reused unchanged over several maintenance nance modification. To exemplify this, an code from the pretest system, but they modifications, they might be considered existing HP subsystem, written in C for could not be matched to any one post- for a library. Components in the Unix platform, was analyzed using test component.) Thus, the last section the second section address the system HPMAS prior to perfective maintenance represents an area of the program where goal but have not yet reached the refine- modification. Once the modification was the subsystem was repartitioned, result- ment of those in the first section. Their complete, the modified subsystem was ing in a new subsystem organization. HPMAS metrics are generally lower than analyzed by HPMAS and the results This type of postmaintenance analysis those in the first section, and they have were compared to determine if there was can provide the maintenance staff with a changed less than f 5 percent from the any detectable change in the maintain- wealth of information about the target pre- to postanalysis. ability of the subsystem. Table 1contains an overall analysis of the changes made to the subsystem. The HPMAS maintainability index in Table 1. Comparing pre- and post-test results shows how much maintenance Table 1shows that the maintainability of modification changes a subsystem. the subsystem was essentially unchanged (a 0.4 percent increase) even though the perfective maintenance changes had ac- Percent tually increased the complexity of the sys- Pretest Post-test Change tem. Specifically, 149 lines of code, two modules, and 29 branches were added to Lines of code 1,086.00 1,235.00 13.4 the system. Although the maintenance Number of modules 13.00 15.00 15.4 engineer denied that functionality in- Total V(g’) 226.00 255.00 12.8 creased, a visual inspection of the source HPMAS maintainability index 88.17 88.61 0.4 code revealed that increased error check- ing had, in fact, been added to the code. For example, the original version of mod- Table 2. Module-by-modulecomparison of pre- and postanalysis results. ule Function-F, shown in section 2 of Table 2, contained 12 error-screening checks, while the modified version con- Pretest Analysis Post-test Analysis tained 16 error checks. (Throughout this Section Percent discussion, function names have been Name Metric Name Metric Change changed to protect Hewlett-Packard pro- prietary information.) Function-A 93.83 Function-A 93.83 0.0 Table 2 contains a module-by-module 1 Function-B 93.82 Function-B 93.82 0.0 comparison of the pre- and post-test Function-C 92.96 Function-C 92.96 0.0 maintainability indices for the subsys- Function-D 84.41 Function-D 84.41 0.0 tem. The table is divided into four sec- tions to demonstrate the distribution of Function-E 86.24 Function-E 89.00 3.2 maintenance changes. The first section 2 Function-F 65.58 Function-F 67.27 2.6 of the table contains the modules that Function-G 88.06 Function-G 85.83 -2.5 were not modified during the mainte- nance task. The second section contains Function-H 78.41 Function-H’ 83.05 5.9 modules that were slightly modified but 3 Function-I 72.85 Function-I’ 63.15 -13.3 which retained their original module FunctionJ 67.75 Function-J’ 66.43 -1.9 names. The third section contains mod- Function-K 68.83 Function-K’ 66.67 -3.1 ules that have been modified and re- named. (The modules in this section Function-L 80.68 were matched by visually inspecting the Function-M 78.78 post-test system to identify any reused 4 Function-N 85.08 comments, variables, or control flow Function-0 80.75 used in the pretest system.) The last sec- Function-P 79.68 Function-Q 69.68 August 1994

Authorized licensed use limited to: CALIF STATE UNIV NORTHRIDGE. Downloaded on April 26, 2009 at 01:38 from IEEE Xplore. Restrictions apply. 2oo f Highest MI 250 System A - - 150- - System 0 - 8 150 g 100- - 3 50- c g 0- Eo li! -50- -1 00 Eli:: 2 -100-150-50 2Lowest MI -+

The last two sections contain the most were calculated on a file-by-filebasis, and that change-prone and defect-prone sub- extensive changes to the subsystem. a maintainabilityindex was calculated for system components (files) could be tar- Components in these two sections repre- each file. geted using the ranked order of the main- sent a large burden to the maintainer, es- The file-by-file analysis of the 714 files tainability indices. sentially representing a repartition of the constituting the software system is shown In a subsequent study, a similar analy- problem. This is evidenced by the re- in Figure 1. This histogram shows the sis was conducted on another third-party naming of components, lower HPMAS maintainability (polynomial) index for subsystem and compared against a main- metric values, and unmatchable pre- and each file, ordered from highest to low- tainability index profile for a proprietary post-test components. The maintenance est. The index for each file is represented HP system (an example is shown in the engineer renamed all of the components by the top of each vertical bar; for nega- next subsection). Based on that compar- in section 3 (presumably because he tive indices, the value is represented by ison, HP decided to purchase the third- thought the original names did not ade- the bottom of the bar. The maintainabil- party software. quately describe them) and substantively ity analysis for this system showed that changed their functionality. Section 4 the file maintainability scores (or in- Using polynomials to compare soft- contains old components that could not dices) range from a high of 183 to a low ware systems. The polynomial models be matched to components in the new of - 91. can also be used to compare whole soft- system. They represent the largest bur- All components above the 85 main- ware systems. We analyzed two software den to the maintenance effort because tainability index are highly maintainable, systems that were similar in size, number (1) the new components are untested, (2) components between 85 and 65 are mod- of modules, platform, and language (see the structure of the system has changed, erately maintainable, and components Table 3). requiring all documentation and dia- below 65 are “difficult to maintain.” The The first system, A, is a third-party ac- grams for this system to be updated, and dotted line indicates the quality cutoff es- quisition that had been difficult to main- (3) all maintainers who were familiar tablished by Hewlett-Packard at index tain. (Again, the names of the two sys- with the pretest system are unfamiliar level 65.1° Although these three quality tems have been changed to protect with the post-test system. categories are used by HP, they repre- proprietary information.) The second sent only a good “rule of thumb.” system, B, had been cited in internal Using polynomials to rank-order mod- The figure shows that 364 files, or Hewlett-Packard documentation as an ule maintainabgty.To detect differences roughly 50 percent of the system, fall be- excellent example of state-of-the-art soft- in subsystem maintainability, the four- low the quality-cutoff index, strongly sug- ware development. The four-metric poly- metric polynomial was applied to a large gesting that this system is difficult to mod- nomial model was used to compare the third-party software application sold to ify and maintain. Prior to our analysis, two systems to see the differences in their HP. The system consists of 236,000 lines the HP maintenance engineers had stated maintainability profiles. HP maintenance of C source code written for a Unix plat- that the system was very difficult to main- engineers, already experienced with the form. The software complexity metrics tain and modify. Further analysis proved systems, were asked to comment on the maintainability of each system. The results of the polynomial model shown in Table 3 corroborate the engi- Table 3. A polynomial comparison of two systems corroborated an informal neers’ informal evaluation of the two evaluation by engineers. software systems. The A system yielded a maintainability index of 89; while A clearly above our acceptability criteria, it is considerably lower than the 123 ~ ~ ~ HP evaluation Low High maintainability index calculated for sys- Platform Unix Unix tem B. This corresponds to the mediocre evaluation A received from the Hewlett- Language C C Total LOC 236,275 243,273 Packard engineers and the high praise Number of modules 3,176 3,097 B received from the engineers working Overall maintainability index 89 123 on that system. We performed a more

COMPUTER

Authorized licensed use limited to: CALIF STATE UNIV NORTHRIDGE. Downloaded on April 26, 2009 at 01:38 from IEEE Xplore. Restrictions apply. Med/um MI (34.4%) o date we have conducted an au- 4. E.Yourdon, The Rise and Fall of the Amer- System A tomated software maintainabil- ican Programmer, Yourdon Press Com- T ity analysis on 11 software sys- puting Series, Trenton, N.J., 1992. Low MI tems. In each case, the results from our 5. J. Munson and T. Khoshgoftaar, “The De- analysis corresponded to the mainte- tection of Fault-Prone Programs,” ZEEE nance engineers’ “intuition” about the Trans. Software Eng., Vol. 18, No. 5, May maintainability of the (sub)system com- 1992, pp. 423-433. ponents. But in every case, the auto- 6. P. Oman and J. Hagemeister, “Metrics for mated analysis provided additional data Assessing Software System Maintainabil- that was useful in supporting or providing ity,” Proc. Conf Software Maintenance, CS System 13 (96.6%) credence for the experts’ opinions. IEEE Press, Los Alamitos, Calif., Or- Our analyses have assisted in buy-ver- der No. 2980-02T, 1992, pp. 337-344. Low MI: x e 65 sus-build decisions, targeting subcompo- Med MI: 65 5 xe 85 7. F. Zhuo et al., “Constructing and Testing High MI: 85 5 x nents for perfective maintenance, con- Software Maintainability Assessment trolling software quality and entropy over Models,” Proc. First Int’l Software Metrics several versions of the same software, Symp., IEEE CS Press, Los Alamitos, Figure 3. Comparison of two systems, identifying change-prone subcompo- Calif., Order No. 3740-02T, 1993, pp. 61-70. with high-, medium-, and low-mainte- nents, and assessing the effects of reengi- nance lines of code expressed in per- neering efforts. 8. P. Oman and J. Hagemeister, “Construc- centages. Software maintainability is going to be tion and Testing of Polynomials Predict- a considerable challenge for many years ing Software Maintainability,”J. of Sys- tems and Software, Vol. 24, No. 3, Mar. to come. The systems being maintained 1994, pp. 251-266. are becoming increasingly complex, and a granular analysis by calculating the poly- growing proportion of software develop- 9. Software Maintainability - Evaluation nomial on a module-by-module basis. ment staff is participating in the mainte- Guide, AFOTEC Pamphlet 800-2 (up- Figure 2 shows a plot of the ordered re- dated), HQ Air Force Operational Test nance of industrial software systems. Our and Evaluation Center, Kirkland Air sults. The B system (the thick line) con- results indicate that automated maintain- Force Base, N.M.,Vol. 3,1989. sistently scored higher than the A sys- ability analysis can be conducted at the tem for all but one module. The component level, the subsystem level, and 10. D. Coleman, “Assessing Maintainability,” significant gap between the two plots ac- the whole system level to evaluate and Proc. 1992 Software Eng. Productivity Conf:,Hewlett-Packard, Palo Alto, Calif., centuates the fact that the A system is compare software. By examining indus- 1992, pp. 525-532. less maintainable. trial systems at different levels, a wealth of Figure 3 contains two pie charts show- information about a system’s maintain- ing the distribution of lines of code in the ability can be obtained. Although these three maintainability classifications models are not perfect, they demonstrate Don Coleman is a project manager with (high, medium, and low). The upper pie the utility of such models. The point is Hewlett-Packard Corporate Engineering. He chart, representing the A system, illus- that a good model can help maintainers works in the area of software maintainability trates the nearly equal distribution of guide their efforts and provide them with assessment and defect analysis. code into the three classifications. The much needed feedback. Before develop- Dan Ash is a firmware engineer for Hewlett- lower pie chart, representing the B sys- ers can claim that they are building main- Packard Boise Printer Division. He special- tem, shows that a significant portion of tainable systems, there must be some way izes in font technology and embedded systems. this system falls in the high maintainabil- to measure maintainability. W ity classification. The B system contains Bruce Lowther is a software engineer at only 15 components, representing 2.8 Micron Semiconductor. He works in object- oriented development, focusing on software percent of the lines of code, that fall be- reusability and software quality. He is a mem- low the quality cutoff. The A system, on References ber of the IEEE Computer Society. the other hand, contains 228 components, representing 33.4 percent of the lines of 1. F.P. Brooks, The Mythical Man-Month: Paul Oman is an associate professor of com- Essays on Software Engineering, Addi- puter science at the University of Idaho where code, that fall below the quality cutoff. son-Wesley, Reading, Mass., 1982. he directs the Software Engineering Test Lab. Hence, using lines of code to compare the He is a member of the IEEE Computer Society. two systems reveals that although their 2. G. Parikh and N. Zvegintzov, Tutorial on overall maintainability index is adequate, Software Maintenance, IEEE CS Press, the B system is likely to be much easier to Los Alamitos, Calif., Order No. 453,1983. system. This result Readers can contact the authors through maintain than the A 3. T. Corbi, “Program Understanding: Chal- Paul Oman, Software Engineering Test Lab, corresponds to the Hewlett-Packard lenge for the 1990s,”IBM Systems J., Vol. University of Idaho, Moscow, Idaho 83843, evaluations. 28, NO.2,1989, pp. 294-306. e-mail [email protected].

August 1994 49

Authorized licensed use limited to: CALIF STATE UNIV NORTHRIDGE. Downloaded on April 26, 2009 at 01:38 from IEEE Xplore. Restrictions apply.