An APPROACH IN SOFTWARE SIZING AND ITS ESTIMATION TOOL

Weider D. Yu Chi Ho Yiu*

Computer Engineering Department, San Jose State University, San Jose, CA 95192-0180, USA Email: [email protected]

• Objectives: Measuring the number of non-blank, Abstract— measurement has been non-comment lines of source code crucial to the success of software development. Metrics are counted by the analyzer. widely used as standards of measuring software’s quality • Input/Output: and guidelines for software development. Companies ♦ Input: Source code require their employers to follow a set of standards in order ♦ Output: Number of lines of code. to maintain software reliability, quality, , • Output Usage: completeness, accuracy, portability, consistency, testability ♦ Intuitive guide to the scale of the project. and . In this paper, sample metrics data collected ♦ Predictor of inadequate decomposition per from a large software and hardware company is accessed, module. and the result metrics are accessed in order to generate a set • Relationship to Quality Model Attributes: of statistical methods to provide reasonable guidelines to ♦ Accuracy. (Non-Comment, Non-Blank lines estimate expected lines of code (ELOC) with the input of of code). expected number of module. ♦ Structuredness. (LOC calculated on a

per-module basis). Index Terms—Software Metrics, Quality, Lines of Code, ♦ Structuredness. (LOC calculated on a Sizing Tool. per-function basis). • Thresholds: 24 lines for methods [1] I. INTRODUCTION • Suggested Guideline: Review code for classes with One of the challenges for managers to manage teams large methods. in building quality software is the difficulty in managing the time and available resources in development. Resources estimation is always a challenging question for managers. Lines of Code vs Number of Modules By estimating the needed resources to develop a component, manager can have better control on the team resources and 350000 time so that the overall progress is under control. In fact, 300000 many projects overrun their expected budget and schedule. 250000 As soon as the requirements for the project are defined, estimation of the size of projects is very important 200000 since it can ensure high quality outcome of the project, allow 150000

managers better control of available resources and make Code of Lines 100000 engineers easier to maintain the overall system. By 50000 measuring the size of the project, lines of code (LOC) is widely known in the industry. 0 0 200 400 600 800 1000 1200 1400 1600 1800 To study LOC, the following paragraph talks about Number of Modules the characteristics of LOC, and graphs are presented below to demonstrate the relationship of LOC with other software Figure 1. Lines of Code vs. Number of Modules. metrics [4], [6]. Moreover, the following analysis is based on software metrics data collected from several components written in C++ programming language [9].

ƒ Characteristics of LOC(Lines of code):

*Chi Ho Yiu graduated from San Jose State University and currently works at IBM Santa Teresa Laboratories, San Jose, California.

1

II. CALCULATING EXPECTED LINES OF CODE I Lines of Code vs McCabe Complexity

350000 Before starting to calculate the expected lines of code 300000 (ELOC), each component is classified into different categories because every component has different 250000 characteristics. Therefore, calculating the ELOC of each 200000 component without classifying them into different categories 150000 will yield inaccurate result. According to Figure 1 and

Lines of Code of Lines Figure 2, both of them demonstrate the relationship among 100000 LOC, NOM and MVG. However, if two figures are 50000 compared closely, they indicate that LOC vs. MVG provide

0 a more stable relationship that this relationship can be used 0 10000 20000 30000 40000 50000 60000 70000 to organize components into various categories. Therefore, McCabe Complexity components can be classified into different categories using the following equation: Figure 2. Lines of Code vs. McCabe Complexity [8].

LPM = LOC / MVG … (1)

McCabe Complexity vs Number of Modules In addition, as the calculation of ELOC involves the NOM attribute, it is good to calculate the LOC per NOM for 70000 each component because some components may tend to have 60000 high LOC per module, and some components may have low 50000 LOC per module. Therefore, this inconsistency may cause 40000 the output of ELOC for each component varies. As a result, 30000 calculating the LOC per NOM is a good indicator to further 20000 classify components in categories. 10000 McCabe Com plexity 0 LPC = LOC / NOM … (2) 0 200 400 600 800 1000 1200 1400 1600 1800 -10000 Number of Modules Through applying equation (1), it is possible to calculate the LOC/MVG for each component, and the output Figure 3. McCabe Complexity vs. Number of can be used to organize them into different categories. Modules. In order to calculate ELOC, a range of expected lines of code per module against different difficulties levels is From Figure 1, the chart demonstrates that the defined. Through applying equation (1) and (2) on the growth of number of modules (NOM) is not corresponding metrics data collected from various source data, a set of to the growth of LOC that the growth rate of LOC varies in metrics data is generated . By looking at the LOC/NOM data, different range of NOM. Increasing in the number of it is easy to observe that as the value of LPM increases, the modules does not guarantee the corresponding increase in value of LPC decreases. Therefore, a conclusion can be lines of code. On the other hand, Figure 2 demonstrates that drawn that if the component is not complicated, and the growth rate of LOC against McCabe Complexity is more developers are familiar with the logic they need to stable that increasing in the McCabe Complexity causes the implement, they tend to write most of the code in a single increasing in lines of code. module. In contrast, if the component is complicated, and From management point of view, Figure 1 and developers are not familiar with the logic they need to Figure 2 indicate that the system is inconsistent in terms of implement, they would like to break the component into LOC. Increasing in LOC should result in increasing in MVG different modules so that each module contains less code so and NOM. However, Figure 1 shows that increasing in LOC that it is easier for developers to implement it. does not result in corresponding increase in NOM. Figure 3 Based on the conclusion, the following ranges are further indicates that increasing in NOM does not result in defined that it groups components into three different major corresponding increase in MVG. categories based on components’ LPC: Easy, Average and In this paper, several guidelines are established to Complicated. We also further define each category into 3 calculate the expected lines of code for each component in subcategories: Easiest, Easy, Less Easy, Below Average, the software system. Open source metrics measurement tool, Average, Over Average, Less Complicated, Complicated CCCC, is used to calculate necessary metrics attributes, and and Very Complicated. Moreover, based on the empirical several source data from various applications created by data computed by Professor Weider Yu from San Jose State different companies are gathered in this research. University through statistical process on various source data, a range of lines of code per module is defined under various difficulties. By using these ranges, we can calculate a range

2

of new ELOC for each component in terms of different 2. LIB 12: difficulties. LIB 12 has a total of 80 modules. The ELOCs of LOC Range (LOCR) LOC per Module LIB 12 is calculated by using formula (3), and the corresponding results are displayed in Figure 8. Moreover, 1. (E1)Easiest: 200. the accuracy of the ELOC comparing to the actual LOC is 2. (E2)Easy: 185. calculated by applying formula (4) and the results are 3. (E3)Less Easy: 170. displayed in Figure 9. The following table displays the 4. (A1)Below Average: 155. calculated ELOCs and the accuracy for LIB 12: 5. (A2)Average: 140. 6. (A3)Over Average: 125. LOC Range ELOC Accuracy 7. (C1)Less Complicated: 110. E1 200 x 80 = 16000 44.8% 8. (C2)Complicated: 95. E2 14800 34.6% 9. (C3)Very Complicated: 80. E3 13600 23.7% A1 12400 12.8% By estimating the lines of code of a component, the A2 11200 1.9% number of the modules for the component is needed. After A3 10000 -9.0% the number of the modules is gathered, the estimated lines of C1 8800 -19.9% code can be computed by multiplying the number of C2 7600 -30.9% modules to each of the LOC range defined above, the C3 6400 -41.8% formula is defined below. Moreover, the accuracy of the . ELOC comparing to the actual LOC of the component is calculated by using equation (4). 3. LIB 8:

ELOC = LOCR × NOM … (3) LIB 8 has a total of 66 modules. The ELOCs of LIB

8 is calculated by using formula (3), and the corresponding Accuracy (%) = results are displayed in Figure 10. Moreover, the accuracy of (1-((ELOC – ALOC) / ALOC)) x 100 … (4) the ELOC comparing to the actual LOC is calculated by applying formula (4) and the results are displayed in Figure III. EXAMPLES OF CALCULATING ELOC: 11. The following table displays the calculated ELOCs and the accuracy for LIB 8: Each component’s estimated lines of code is calculated under 9 different difficulties level: Easiest, Easy, LOC Range ELOC Accuracy Less Easy, Below Average, Average, Over Average, Less E1 200 x 80 = 16000 71.4% Complicated, Complicated and Very Complicated. E2 12210 58.5% E3 11220 45.7% 1. LIB 10: A1 10230 32.8% A2 9240 20.0% LIB 10 has a total of 74 modules. The ELOCs of LIB A3 8250 7.1% 10 is calculated by using formula (3), and the corresponding C1 7260 -5.7% results are displayed in Figure 6. Moreover, the accuracy of C2 6270 -18.6% the ELOC comparing to the actual LOC is calculated by C3 5280 -31.5% applying formula (4) and the results are displayed in Figure

7. The following table displays the calculated ELOCs and the accuracy for LIB 10:

LOC Range ELOC Accuracy

E1 200 x 74 = 14800 25.6%

E2 13690 16.2% E3 12580 6.8% A1 11470 -2.6% A2 10360 -12.1% A3 9250 -21.5% C1 8140 -30.9% C2 7030 -40.3% C3 5920 -49.8%

3

IV. VALIDATION OF EXPECTED LINES OF CODE II Accuracy of LIB 12

50 44.83 Very Comp. LIB 10 40 34.64 Comp. 30 23.73 Less Comp. 16000 Actual 20 12.81 14000 10 1.89 Over Avg. 12000 Very Comp. Average 10000 Comp. 0 -10 Below Average 8000 Less Comp. Accuracy % LOC -9.02 6000 -20 Less Easy Over Avg. 4000 -30 -19.94 Eas y 2000 Average -40 -30.86 0 Eas ies t Below Average -50 -41.78 . . t l p . e y a ge g s es Less Easy m mp vg a a ctu o ra r E Easy A C e s s Co Easi Eas y ry Comp. s ver A Ave es Figure 7. Lib12 Accuracy of ELOC e O w Av L Ve L lo Eas ies t Be

Figure 4. Lib10 Calculated ELOC LIB 8

14000 Actual 12000 Accuracy of LIB 10 Very Comp. 10000 Comp. 8000 30 25.63 Very Comp. Less Comp. 16.20 LOC 6000 20 6.78 Comp. 4000 Over Avg. 10 Less Comp. 2000 Average 0 Over Avg. 0 Below Average -10 -2.64 . l y Less Easy Average tua mp. mp iest -12.06 c omp. o Easy -20 C r Avg. Eas as A Co C e E Below Average ry s Average Eas y -21.48 e Ov Accuracy % Less -30 V Les w Average Less Easy elo Eas ies t -40 -30.91 B -40.33 Eas y -50 -49.75 Eas ies t -60 Figure 8. Lib 8 Calculated ELOC

Figure 5. Lib10 Accuracy of ELOC Accuracy of LIB 8

80 71.38 LIB 12 Very Comp. 58.53 60 Comp. 45.68 18000 Actual Less Comp. 16000 40 32.82 Over Avg. 14000 Very Comp. 19.97 12000 Comp. 20 7.12 Average 10000 Less Comp. Below Average

LOC 8000 Accuracy % Accuracy 0 6000 Over Avg. Less Easy 4000 -5.74 Average -20 Eas y 2000 -18.59 0 Eas ies t Below Average -40 -31.45 . . t l p . e y a ge g s es Less Easy m mp vg a a ctu o ra r E Easy A C e s s Co Easi Eas y ry Comp. s ver A Ave es Figure 9. Lib 8 Accuracy of ELOC e O w Av L Ve L lo Eas ies t Be

Figure 6. Lib12 Calculated ELOC Figure 4, 5, 6, 7, 8 and 9 show the estimated LOC and the accuracy for some sample components which are gathered from 3 different categories: Easy, Average and Complicated based on LPM range. From Figure 4, the actual LOC of the component falls into the “Below Average” category with the closest estimated lines of code. By comparing to the defined LPM range, LPM range classifies Lib 10 as the category of “Easy and Average”, and the actual ELOC of Lib 10 also falls into

4

that category. According to Figure 5, the calculated ELOC in “Below Average” category is having the lowest error percentage which is underestimated by -2.64%. From Figure 6, the actual LOC of the component falls into the “Average” category, and, according to the LPM range, Lib 12 falls into the “Easy and Average” category, so the estimated LOC matches the LPM range against the difficulty of the component. According to Figure 7, the calculated ELOC in “Average” category is having the lowest error percentage which is overestimated by 1.89%. From Figure 8, the actual LOC of the component falls into the “Less Complicated” category, and Lib 8 actually falls into the “Complicated” category according to LPM range. Therefore, the ELOC matches the LPM range in terms of difficulty of the component. According to Figure 9, the calculated ELOC in “Less Complicated” category is having the lowest error percentage which is underestimated by -5.74%. As we can see, estimating LOC based on the defined LOC range actually is more precise. Calculation for each Figure 11. Collect Metrics 2 component is done individually and the difficulty category LOC falls in actually matches with the defined LPM ranges.

V. SOFTWARE VALIDATION TOOL-DELTA SIZER

A software tool is developed for validating the calculation, scanning source codes to gather software metrics [2], changing the defined lines of code range and estimating lines of code with the input of number of modules. The name of the tool is called “Delta Sizer”.

Collect Metrics: Collect Metrics scenario describes the process for users to collect software metrics data from source files in C, C++ or Java. User enters the path to the source file and presses the Run button to execute the backend utility program for collecting metrics data. After the process is done, summary is displayed in chart and table formats. The Figure 12. LOC Estimate screen shots of the scenarios are displayed in Figure 10 and

11.

LOC Estimate: LOC Estimate scenario describes the process for users to estimate the expected lines of code with the input of expected number of module. User clicks ‘LOC Estimate’ link to navigate to the page and enters the expected number of module for the component. Once the user presses Estimate button, the expected lines of code is displayed in a table under various difficulties. The screen shot of the scenario is displayed in Figure 12.

Validation: Validation scenario describes the process for users to validate the expected lines of code for components. User clicks the ‘Chart’ link in order to navigate to the page and selects the desired component to validate the expected lines of code. Once the user presses the ‘Draw Graph’ button, the Figure 10. Collect Metrics 1 corresponding bar chart for the component is displayed. The screen shot of the scenario is displayed in Figure 13.

5

VI. CONCLUSION

This project focuses on learning what kind of data is important for managing software projects, what is the difference between different software code metrics [5], how to collect the data using various open source programming languages, how to present the data with useful meanings, and how these data help managers managing high quality software projects. In conclusion, managers are able to estimate the expected lines of code for the component with the input of estimated number of modules. As a result, managers will be able to have better control of their resources when they know the expected LOC for the project since they can efficiently allocate how many engineers to work on the components and can have better prediction of when the components can be done. Moreover, engineers will have better control of the

Figure 13. Validate quality of the overall system. As a result, it will be easier for engineers to maintain components, and the number of bugs can be minimized and under control.

REFERENCES

[1]Linda Rosenberg, Ted Hammer & Jack Shaw. Software Metrics and Reliability. Retrieved Oct 27, 2006, from http://satc.gsfc.nasa.gov/support/ISSRE_NOV98/software_ metrics_and_reliability.html [2]Wikipedia. . (2006, Oct). Retrieved Oct 27, 2006, from http://en.wikipedia.org/wiki/Software_metric [3]Wikipedia. Software Quality. (2007, June). Retrieved June 16, 2006, from http://en.wikipedia.org/wiki/Software_quality [4]L. Rosenberg & L.Hyatt. Developing an Effective Metrics Program. (1996, Mar). Retrieved Oct 28, 2006, Figure 14. Define Lines of Code Range from http://satc.gsfc.nasa.gov/support/ESA_MAR96/metrics/esa _met.html Define LOC Range: [5]Bill Venners. How Useful Are Code Metrics?. (2006, Define LOC Range scenario describes the process for Apr) Retrieved Oct 30, 2006, from users to change the defined Lines of Code range for estimate http://www.artima.com/weblogs/viewpost.jsp?thread=1578 expected lines of code. User clicks the ‘Properties’ link to 39 navigate to the page and changes the defined lines of code [6]Karl E. Wiegers. (1999, July) A Software Metrics values. Once the user clicks the Edit button, the value in the Primer. Retrieved Oct 30, 2006, from properties database is updated. At the same time, the http://www.processimpact.com/articles/metrics_primer.htm expected lines of code in each existing component are l updated as well. The screen shot of the scenario is displayed [7] Mark Lorenz & Jeff Kidd, Object-Oriented Software in Figure 14. Metrics, Prentice-Hall, Inc. 1994 [8] Stephen H, Kan, Metrics and Models in Software Quality Engineering, Addison Wesley Professional, 2002 [9] Tim Littlefair, C and C++ Code Counter, 2001.

6