Components of a Modern Quality Approach To Development

Dolores Zage, Wayne Zage Ball State University

Sponsor: iconectiv

Final Report October 2015

1

Table of Contents

Section 1: Process and Quality 4

1.1 New Versus Traditional Process Overview 4 1.2 DevOps 5 1.2.1 Distinguishing Characteristics of High-Performing DevOps Development Cultures 5 1.2.2 Zeroturnaround Survey 5 1.2.2.1 Survey: Quality of Development 6 1.2.2.2 Survey Results: Predictability 6 1.2.2.3 Survey Results: Tool Type Usage 7 1.2.3 DevOps Process Maturity Model 8 1.2.4 DevOps Workflows 9 1.3 Code Reviews 10 1.3.1 Using Checklists 11 1.3.1.1 Sample Checklist Items 11 1.3.2 Checklists for Security 12 1.3.3 Monitoring the Code Review Process 12 1.3.4 Evolvability Defects 13 1.3.5 Other Guidelines 13 1.3.6 High Risk Code and High Risk Changes 14 1.4 Testing 14 1.4.1 Number of Test Cases 15 1.5 Agile Process and QA 16 1.5.1 Agile Quality Assessment (QA) on Scrum 17 1.6 Product Backlog 19

Section 2: Software Product Quality and its Measurement 19

2.1 Goal Question Metric Model 19 2.2 Generic Models 20 2.3 Comparing Traditional and Agile Metrics 22 2.4 NESMA Agile Metrics 23 2.4.1 Metrics for Planning and Forecasting 23 2.4.2 Dividing the Work into Manageable Pieces 24 2.4.3 Metrics for Monitoring and Control 24 2.4.4 Metrics for Improvement (Product Quality and Process Improvement) 25 2.5 Another State-of-the-Practice Survey on Agile Metrics 26 2.6 Metric Trends are Important 27 2.7 Defect Measurement 28 2.8 Defects and Complexity Linked 32 2.9 Performance, a Factor in Quality 35 2.10 Security, a Factor in Quality 37 2.10.1 Security Standards 37 2.10.2 Shift of Security Responsibilities within Development 39

2

2.10.3 List of Current Practices 40 2.10.4 Risk of Third Party Applications 40 2.10.5 Rate of Repairs 41 2.10.6 Other Code Security Strategies 41 2.10.7 Design Vulnerabilities 42

Section 3: Assessment of Development Methods and Project Data 43

3.1 The Namcook Analytics Software Risk Master (SRM) tool 43 3.2 Crosstalk Table 46 3.2.1 Ranges of Software Development Quality 49 3.3 Scoring Method of Methods, Tools and Practices in Software Development 50 3.4 DevOps Self-Assessment by IBM 50 3.5 Thirty Issues that have Stayed Constant for Thirty Years 51 3.6 Quality and Defect Removal 52 3.6.1 Error-Prone Modules (EPM) 52 3.6.2 Inspection Metrics 53 3.6.3 General Terms of Software Failure and Software Success 53

Section 4: Conclusions and Project Take-Aways 53

4.1 Process 53 4.2 Product Measurements 55

Acknowledgements 56

Appendix A – Namcook Analytics Estimation Report 56

Appendix B – Sample DevOps Self-Assessment 65

References 68

3

Section 1: Software Development Process and Quality

1.1 New Versus Traditional Process Overview

At first enterprises used Agile development techniques for pilot projects developed by small teams. Realizing the benefits of shorter delivery and release phases and responsiveness to change while still delivering quality software, enterprises searched for methods to mimic similar benefits for their larger development efforts by scaling Agile. Thus, many frameworks and methods were developed to satisfy this need. (SAFe) is one of the most implemented scaled Agile frameworks. Most of the pieces of the framework are borrowed, existing Agile methods packaged and organized in a different way to accommodate the scale. Integrated within SAFe and other enterprise development scaled methodologies are Agile methods, such as Scrum and other techniques to foster the delivery of software.

Why evaluate the process? Developing good software is difficult and a good process or method can make this difficult task a little easier and perhaps more predictable. In the past, researchers performed an analysis of software standards and determined that standards heavily focus on processes rather than products [PFLE]. They characterized software standards as prescribing “the recipe, the utensils, and the cooking techniques, and then assume that the pudding will taste good.” This corresponds with Deming who argued that, “The quality of a product is directly related to the quality of the process used to create it” [DEMI]. , the creator of the CMM, believes that high quality software can only be produced by a high quality process. Most would agree that the probability of producing high quality software is greater if the process is also of high quality. All recognize that possessing a good process in isolation is not enough. The process has to be filled with skilled, motivated people.

Can a process promote motivation? Agile methods are people-oriented rather than process-oriented. Agile methods assert that no process will ever make up the skill of the development team and, therefore, the role of a process is to support the development team in their work. Another movement, DevOps, encourages collaboration through integrating development and operations. Figure 1 compares the old and new way of delivering software. The catalyst for many of the enumerated changes is teamwork and cooperation. Everyone should be participating and all share in responsibility and accountability. In true Agile, teams have the autonomy of choosing development tools. In the new world, disconnects should be removed and the development tools and processes should be chosen so that people can receive feedback quickly and make necessary changes. Successful integration of DevOps and Agile development will play a key role in the delivery of rapid, continuous, high quality software. Institutions that can accomplish this merger at the enterprise scale will outpace those struggling to adapt.

Figure 1: Old and New Way of Developing Software

For this reason, identifying the characteristics of high-performing Agile and DevOps cultures is important to assist in outlining a new transformational technology.

4

1.2 DevOps

DevOps is the fusion of “Development” and “Operations” functions to reduce risk, liability, and time-to- market while increasing operational awareness. Over the past decade, it is one of the newest and largest movements in Information Technology. DevOps evolution stemmed from many of the previous ideas in software development such as automation tooling, culture shifts, and iterative development models such as Agile. During the fusion process, DevOps was not provided with a central set of operational guidelines. Once an organization decided to use DevOps for its software development, it had free reign on deciding how to implement DevOps which produced its own challenges. Even Agile, whose many attributes were adopted by the DevOps movement, falls into the same predicament. In 2006, Gilb and Brody wrote an article suggesting the same lack of quantification for Agile methods, and this is a major weakness [GILB]. There is insufficient focus on quantified performance levels such as metrics evaluating required qualities, resource savings and workload capacities of the developed software. This omission of not being able to measure the change to DevOps does not suggest that DevOps does not prescribe monitoring and measuring. The purpose of monitoring and measuring within DevOps is to compare the current state of a project to the same project from some time in the near past, providing an answer to current project progress. However, this quantification does not help to answer the question about benchmarking the overall DevOps implementation.

1.2.1 Distinguishing Characteristics of High-Performing DevOps Development Cultures A possible description of a high-performing DevOps environment is that it produces good quality systems on time. It is important to identify the characteristics of such high-performing cultures so that these practices are emulated and metrics can be identified that quantify these typical characteristics and track successful trends. Below are seven key points of a high-performing DevOps culture [DYNA]: 1. Deploy daily – decouple code deployment from feature releases 2. Handle all non-functional requirements (performance, security) at every stage 3. Exploit automated testing to catch errors early and quickly - 82% of high-performance DevOps organizations use automation [PUPP]. 4. Employ strict version control – version control in operations has the strongest correlation with high performing DevOps organizations [PUPP]. 5. Implement end-to-end performance monitoring and metrics 6. Perform peer code review 7. Allocate more cycle time to reduction of technical debt.

Other attributes of high-performing DevOps environments are that operations are involved early on in the development process so that plans for any necessary changes to the production environment can be formulated, and a continuous feedback loop between development, testing and operations is present. Just as development has its daily stand-up meeting, development, QA and operations should meet frequently to jointly analyze issues in production.

1.2.2 Zeroturnaround Survey Many of the above trends are highlighted in a survey conducted by Zeroturnaround of 1,006 developers on the practices developers use during software development. The goal of this survey was to prove or disprove the effectiveness of the best quality practices including the methodologies, tools and company size and industry within the context of these practices [KABA]. Noted in the report was that the survey respondents possessed a disproportionate bias towards Java and were employed in Software/Technology

5 companies. Zeroturnaround also divided the respondents into three categories depending on their responses into three groups. The top 10% were identified as rock stars, the middle 80% as average and the bottom 10% as laggards. For this report, Zeroturnaround concentrated on two aspects of software development, namely, the quality of the software and the predictability of delivery. The report results are summarized in sections 1.2.2.1-1.2.2.3.

1.2.2.1 Survey: Quality of Development The quality of the software was determined by the frequency of critical or blocker bugs discovered post- release. Zeroturnaround decided that a good measure of software quality is to ask the respondents, “How often do you find critical or blocker bugs after release?” This simple question is an easy method to judge at least minimum requirements, and the response is a key quality metric that is most likely to negatively impact the largest group of end users. The survey responses to this question were converted into percentages.

Do you find critical or blocker bugs after a release? A. No - 100% B. Almost never - 75% C. Sometimes - 50% D. Often - 25% E. Always - 0%

The analysis of the answers to this question, listed below, indicates that released software has a 50% probability of containing a critical bug. Most respondents did admit to “sometimes” releasing software with bugs. On average, 58% of releases go to production without critical bugs. It also demonstrates a distinct difference between developer respondents. The laggards deliver bug-free software only 25% of the time while the rock star respondents deliver bug-free software 75% of the time.

Average 58% Median 50% Mode 50% Standard Deviation 19% Laggards (Bottom 10%) 25% Rock stars (Top 10%) 75%

1.2.2.2 Survey Results: Predictability

Predictability of delivery for this report is determined by delays in releases, execution of planned requirements, and in-process changes also known as scope creep. The expression

Predictability = (1 / (1 + % late) x (% of plans delivered) x (1 - % scope creep) converts predictability of delivery into a mathematical expression. An example listed below is provided to demonstrate the use of the mathematical formula: Example: Mike’s team is pretty good at delivering on time--they only release late 10% of the time. On average, they get 75% of their development plans done, and they’ve been able to limit scope creep to just 10% as well. Based on that, we can calculate that Mike’s team releases software with 61% predictability,

6

(1 / (1 + 0.10)) x 0.75 x (1- 0.10) = 0.61 = 61%

The 61% is not the true probability because it is not normalized to 100%. The authors could normalize over delivery, but chose not to as it did not affect trends, but rather made the number harder to interpret. It was suggested that change in scope should be included in the formula, but this definitely affects the ability to predict delivery. The authors tested this suggestion by adding change in scope to the formula. The outcome was that omitting an estimate for change in scope does not change any trends. These trends are represented with concrete numbers, so statistical analysis may impact the accuracy of absolute numbers, but not the relative trends. The findings and observations on the predictability of software releases are that companies can predict delivery within a 60% margin. Considering just the rock stars, they can attain the 80% margin. When predictability was categorized by industry, there was no significant relationship. In fact, predictability by company size increases slightly (3%) for larger companies. The authors theorize that this is due to a greater level of organizational requirements, thus more non- development staff are available to coordinate projects and release plans as teams scale up in size.

Noting the obvious difference in the probability of bug-free delivery between the rock stars and laggards, it is important to enumerate their differences identified from the survey. Based on the responses of over 1,000 engineers, half of the respondents do not fix code quality issues as identified through static analysis (Sonar or SonarQube), and some do not even monitor code quality. Those that use static analysis see a 10% increase in delivering bug-free software. Automated tests exhibited the largest overall improvements both in the predictability and quality of software deliveries. The laggards did no automated testing (0%), and the rock stars covered 75% of the functionality with automated tests. Also, quality increases most when developers are testing the code. More than half of the respondents have less than 25% test coverage. There is a significant increase in both predictability and quality as test coverage increases. Code reviews significantly impact predictability of releases, but moderately affect software quality. A plausible explanation is that developers are poor at spotting bugs in code, but good at spotting issues and code smells which impact future development and maintenance. The majority of problems found by reviewers are not functional mistakes, but what the researchers call evolvability defects (discussed further in section 3.4). Close communication, such as daily standups and team chat seem to be the best ways to communicate and increase the predictability of releases. Most teams work on technical debt at least sometimes, but the survey results indicate no significant increases in quality and predictability of software releases. However, a negative trend appears by not doing any technical debt maintenance.

1.2.2.3 Survey Results: Tool Type Usage The article also contains a segment on the reporting of tool and technology usage/popularity combined with the corresponding increases or decreases in predictability and quality. Developers were asked about the tools they used and Figure 2 presents a bar chart showing their responses. The report analyzed which technologies and tool types influence quality and predictability of releases. The result for quality is that there were no significant trends in the quality of releases when compared to technologies and tool types. It appears that quality is affected by development practices, but not development tools. In relation to predictability, using version control and IDE will significantly improve the predictability of the deliveries, and there is a reasonable increase in predictability for users of code quality analysis, , issue tracker, profiler and IaaS solutions. Use of a text editor and has little or no effect on predictability.

7

Figure 2: Tool Type Usage Results from Zeroturnaround 2013 Survey What are our takeaways from this survey and report? First of all, it appears that the majority of the survey respondents are our peers, developing mainly in Java and categorizing the company type as Software/Technology companies. Therefore, comparisons can be made. Automation of testing is a leading indicator of quality reiterated by many reports [4, 5]. As test coverage increases, both predictability and quality increase and automation can promote greater coverage. Code reviews increase predictability and can increase the quality of the structure of the code, which is part of refactoring best practice in Agile developments. Code quality analysis and fixing quality problems is another practice that increases both quality and predictability. Quality was not significantly affected by a tool set, underscoring that quality is based mainly on practices, not tools. However, good tools can make a team more productive. They can also serve as focal points to enforce the practices that further improve the ability to predictably deliver quality software. Identifying and implementing best practices is one key to improving software development. However, metrics need to be chosen carefully to measure the improvement or lack thereof. The importance of automated testing and assessment of coverage has been outlined by numerous sources. All of these practices should be considered important in our SDL. 1.2.3 DevOps Process Maturity Model The reason many organizations adopt DevOps is rapid delivery maximizing outcomes. One of the key attributes and best practices of DevOps is the integration of quality assurance into the workflow. The earlier errors are caught and fixed which minimizes rework and pushes the team to a stable product. Figure 3 provides a DevOps Process Maturity Model with five different stages from Initial to Optimize aligned with the CMMI Maturity Model. Assuming that our target is level 4, major keys to achieve this level are quantification and control.

8

Figure 3: DevOps Process Maturity Model To reach the quantify level and eventually the optimize level within the DevOps maturity model, each workflow should be analyzed to determine how errors are introduced and subsequently identify possible quality assurance techniques to be inserted into those checkpoints. 1.2.4 DevOps Workflows The first workflow is documentation. A best practice is good documentation of the process and other components of development. Documentation should be readily available and up-to-date. Documentation that is out of date or incorrect can be very detrimental. The process should be documented along with the configuration of all the systems. The documentation and configuration files should be placed in a Software Configuration Management (SCM) system. A configuration management system (CMS) should be used to describe the families and groups of systems in the configuration. A benefit of a configuration management system is that it checks in periodically with a centralized catalog to ensure that the system is continuing to comply and run with the approved configuration. There are many instances where a change in the configuration has a devastating effect. As seen in Figure 4, a single configuration error can have far reaching impact across IT and the business. From chaos theory, a butterfly flapping its wings in part of the world can result in a hurricane in other parts. A configuration error may not be as devastating, but it has a large impact in terms of time, money and risk. Many CMS have a method to test or validate a set of configurations before pushing them to a machine. For example, many CMS have validation commands (Puppet or BCFG2) that could be executed as part of a process before continuing and installing the configuration. Another option is to invoke a canary system as the first line of defense for catching configuration errors [CHIL]. Figure 4: Configuration Errors Impact [VERI]

9

Another best practice is to use a SCM for all of the products. An SCM allows multiple developers to contribute to a project or set of files at once, allowing them to merge their contributions without overwriting previous work. An SCM also allows the rollback of changes in the event of an error making its way into an SCM. However, rollbacks should be avoided and a code review can be inserted as a quality assurance step.

1.3 Code Reviews

Code reviews employ someone other than the developer who wrote the code to check the work. Studies have shown that quick, lightweight reviews found nearly as many bugs as more formal code reviews [IBM]. At shops like Microsoft and Google, developers don’t attend formal code review meetings. Instead, they take advantage of collaborative code review platforms like Gerrit (open source), CodeFlow (Microsoft), Collaborator (Smartbear), ReviewBoard (open source) or Crucible (Atlassian, usually bundled with Fisheye code browser), or use e-mail to request reviews asynchronously and to exchange information with reviewers. These tools support a pre-commit method of code review. The code review occurs before the code/configuration is committed to an SCM.

Reviewing code against coding standards (see Google’s Java coding guide, http://google- styleguide.googlecode.com/svn/trunk/javaguide.html) is an inefficient way for a developer to spend their valuable time. Every developer should use the same coding style templates in their IDEs and use a tool like to ensure that code is formatted consistently. Highly configurable, Checkstyle can be made to support almost any coding standard. Configuration files are supplied at the Checkstyle download site supporting the Oracle Code Conventions and Google Java Style. An example of a report that can be produced using Checkstyle and Maven can be seen at http://checkstyle.sourceforge.net/. Coding style checker tools free up reviewers to focus on the things that matter such as assisting developers to write better code and create code that works correctly and is easy to maintain.

Additionally the use of static analysis tools upfront will make reviews more efficient. Free tools such as Findbugs and PMD for Java can catch common coding bugs, inconsistencies, and sloppy, messy code and/or dead code before submitting the code for review. The latest static analysis tools go far beyond this, and are capable of finding serious errors in programs such as null-pointer de-references, buffer overruns, race conditions, resource leaks, and other errors. Static analysis can also assist testing. If the unreachable code or redundant conditions can be brought to the attention of the tester early, then they do not need to waste time in a futile attempt to achieve the impossible. Static analysis frees the reviewer from searching for micro-problems and bad practices, so they can concentrate on higher-level mistakes. Static analysis is only a tool to help with code review and is not a substitute. Static analysis tools can’t find functional correctness problems or design inconsistencies or errors of omission or help you find a better or simpler way to solve a problem.

The reviewer should be concentrating on Correctness:  Functional correctness: does the code do what it is supposed to do – the reviewer needs to know the problem area, requirements and usually something about this part of the code to be effective at finding functional correctness issues  Coding errors: low-level coding mistakes like using <= instead of <, off-by-one errors, using the wrong variable (like mixing up lessee and lessor), copy and paste errors, leaving code in by accident

10

 Design mistakes: errors of omission, incorrect assumptions, messing up architectural and design patterns such as the model, the view, and the controller (MVC)  Security: properly enforcing security and privacy controls (authentication, access control, auditing, encryption) :  Clarity: class and method and variable naming, comments …  Consistency: using common routines or language/library features instead of rolling your own, following established conventions and patterns  Organization: poor structure, duplicate or unused/dead code  Approach: areas where the reviewer can see a simpler or cleaner or more efficient implementation.

1.3.1 Using Checklists

A checklist is an important component of any review. Checklists are most effective at detecting omissions. Omissions are typically the most difficult types of errors to find. A reviewer does not require a checklist to look for algorithm errors or sensible documentation. The difficult task is to notice when something is missing and reviewers are likely to forget it as well. The longer a checklist becomes, the less effective each item is reviewed [SMAR]. Limit the checklist to about 20 items. In fact, the SEI performed a study indicating that a person makes about 15-20 common mistakes. For example, the checklist can remind the reviewer to confirm that all errors are handled, that function arguments are tested for invalid values, and that unit tests have been created. It is estimated that people make 15-20 common mistakes in coding [SMAR]. Below is a sample review checklist from Smartbear (http://smartbear.com/SmartBear/media/pdfs/best-kept-secrets-of-peer-code-review.pdf).

1.3.1.1 Sample Checklist Items 1. Documentation: All subroutines are commented in clear language. 2. Documentation: Describe what happens with corner-case input. 3. Documentation: Complex algorithms are explained and justified. 4. Documentation: Code that depends on non-obvious behavior in external libraries is documented with reference to external documentation. 5. Documentation: Units of measurement are documented for numeric values. 6. Documentation: Incomplete code is indicated with appropriate distinctive markers (e.g. “TODO” or “FIXME”). 7. Documentation: User-facing documentation is updated (online help, contextual help, tool-tips, version history). 8. Testing: Unit tests are added for new code paths or behaviors. 9. Testing: Unit tests cover errors and invalid parameter cases. 10. Testing: Unit tests demonstrate the algorithm is performing as documented. 11. Testing: Possible null pointers always checked before use. 12. Testing: Array indexes checked to avoid out-of-bound errors. 13. Testing: Don’t write new code that is already implemented in an existing, tested API. 14. Testing: New code fixes/implements the issue in question. 15. Error Handling: Invalid parameter values are handled properly early in the subroutine. 16. Error Handling: Error values of null pointers from subroutine invocations are checked. 17. Error Handling: Error handlers clean up state and resources no matter where an error occurs. 18. Error Handling: Memory is released, resources are closed, and reference counters are managed under both error and no error conditions.

11

19. Thread Safety: Global variables are protected by locks or locking subroutines. 20. Thread Safety: Objects accessed by multiple threads are accessed only through a lock. 21. Thread Safety: Locks must be acquired and released in the right order to prevent deadlocks, even in error-handling code. 22. Performance: Objects are duplicated only when necessary. 23. Performance: No busy-wait loops instead of proper thread synchronization methods. 24. Performance: Memory usage is acceptable even with large inputs. 25. Performance: Optimization that makes code harder to read should only be implemented if a profiler or other tool has indicated that the routine stands to gain from optimization. These kinds of optimizations should be well-documented and code that performs the same task simply should be preserved somewhere.

An effective method to maintain the checklist is to match defects found during review to the associated checklist item. Items that turn up many defects should be kept. Defects that aren’t associated with any checklist item should be scanned periodically. Usually, there are categorical trends in your defects; turn each type of defect into a checklist item that would cause the reviewer to find it. Over time, the team will become used to the more common checklist items and will adopt programming habits that prevent some of them altogether. The list can be shorten by reviewing the “Top 5 Most Violated” checklist items every month to determine whether anything can be done to help developers avoid the problem. For example, if a common problem is “not all methods are fully documented,” a feature in the IDE can be enabled that requires developers to have at least some sort of documentation on every method.

1.3.2 Checklists for Security

Coding checklists are not specifically devoted to security reviews. Agnitio is a code review tool that guides a reviewer through a security review by following a detailed code and design review checklist and records the results of each review, removing the inconsistent nature of manual security code review documentation. Agnitio is an open-source tool that assists developers and security professionals in conducting manual security code reviews in a consistent and repeatable way. Code reviews are important for finding security vulnerabilities and often are the only way to find vulnerabilities, except through exhaustive and expensive pen testing. This is why code reviews are a fundamental part of secure SDLC’s like Microsoft’s SDL.

1.3.3 Monitoring the Code Review Process

The code review process should be monitored for defect removal. For example, how do code reviews compare to the other methods of defect removal practices in predicting how many hours are required to finish a project? The minimal list of raw numbers collected is lines of code including comments (LOC), inspection time, and defect count. LOC and inspection time are obvious. A defect in a code review is something a reviewer wants changed in the code. A tool-assisted review process should be able to collect these automatically without manual intervention. From these data, other analytical metrics can be calculated and, if necessary, classified into reviews from a development group, reviews of a certain author, reviews performed by a certain reviewer, or on a set of files. The calculated ratios are inspection rate, defect rate and defect density.

The inspection rate is the rate at which a certain amount of code is reviewed. The ratio is LOC divided by inspection hours. An expected value for a meticulous inspection would be 100-200 LOC/hour; a normal

12 inspection might be 200-500; above 800-1000 is so fast it can be concluded the reviewer performed a perfunctory job.

The defect rate is the rate defects are uncovered by the reviewers. The ratio is defect count divided by inspection hours. A typical value for source code would be 5-20 defects per hour depending on the review technique. For example, formal inspections with both private code-reading phases and inspection meetings will be on the slow end, whereas the lightweight approaches, especially those without scheduled inspection meetings, will trend toward the high end. The time spent uncovering the defects in review is counted in the metric and not the time taken to actually fix those defects.

The defect density is the number of defects found in a given amount of source code. The ratio is defect count divided by kLOC (thousand lines of code). The higher the defect density, the more defects are uncovered indicating that the reviews are effective. That is, a high defect density is more likely to mean the reviewers did a great job than it is to mean the underlying source code is extremely bad. It is impossible to provide an expected value for defect density due to various factors. For example, a mature, stable code base with tight development controls might have a defect density as low as 5 defects/kLOC, whereas new code written by junior developers in an uncontrolled environment except for a strict review process might uncover 100-200 defects/kLOC.

1.3.4 Evolvability Defects

However, there is even more to reviews than finding bugs and security vulnerabilities. A 2009 study by Mantyla and Lassenius revealed that the majority of problems found by reviewers are not functional mistakes, but what the researchers call evolvability defects [MANT]. Evolvability defects are issues causing code to be harder to understand and maintain, more fragile and more difficult to modify and fix. Between 60% and 75% of the defects found in code reviews fall into this class. Of these, approximately 1/3 are simple code clarity issues, such as improving element naming and comments. The rest of the findings are organizational problems where the code is either poorly structured, duplicated, unused, can be expressed with a much simpler and cleaner implementation, or replacing hand-rolled code with built in language features or library calls. Reviewers also find changes that do not belong or are not required, copy-and-paste mistakes and inconsistencies.

These defects or recommendations feed back into refactoring and are important for future maintenance of the software, reducing complexity and making it easier to change or fix the code in the future. However, it’s more than this: many of these changes also reduce the technical risk of implementation, offering simpler and safer ways to solve the problem, and isolating changes or reducing the scope of a change, which in turn will reduce the number of defects that could be found in testing or escape into the release.

1.3.5 Other Guidelines

An important aspect of enterprise architecture is the development of guidelines for addressing common concerns across IT delivery teams. An organization may develop security guidelines, connectivity guidelines, coding standards, and many others. By following common development guidelines, the delivery teams produce more consistent solutions, which in turn make them easier to operate and support once in production, thereby supporting the DevOps strategy.

13

1.3.6 High Risk Code and High Risk Changes

If possible, all code should be reviewed. However, what if this is not possible? One needs to ensure that high risk code and high risk change is always reviewed. Listed below are candidates for these types of modules.

High risk code:

 Network-facing APIs  Plumbing (framework code, security libraries….)  Critical business logic and workflows  Command and control and root admin functions  Safety-critical or performance-critical (especially real-time) sections  Code that handles private or sensitive data  Code that is complex  Code developed by many different people  Code that has had many defect reports – error prone code

High risk changes:

 Code written by a developer who has just joined the team  Big changes  Large-scale refactoring (redesign disguised as refactoring)

1.4 Testing

Although static analysis and code review prevent many errors, these activities will not catch them all. Mistakes can still creep into the production environment. The untested code must be exercised in a testbed and all changes made in the testbed first. This is another best practice. Testing, except for unit level, is performed following a formal delivery of the application build to the QA team after most, if not all, construction is completed. Unit tests are just as much about improving productivity as they are about catching bugs, so proper unit testing can speed development rather than slow it down. Unit testing is not in the same class as integration testing, system testing, or any kind of adversarial "black-box" testing that attempts to exercise a system based solely on its interface contract. These types of tests can be automated in the same style as unit tests, perhaps even using the same tools and frameworks. However, unit tests codify the intent of a specific low-level unit of code. They are focused and they are fast. When an automated test breaks during development, the responsible code change is rapidly identified and addressed. This rapid feedback cycle generates a sense of flow during development, which is the ideal state of focus and motivation needed to solve complex problems.

As software grows, defect potential increases and defect removal efficiencies decrease. The defect density at release time increases and more defects are released to the end-user(s) of the software product. Larger software size increases the complexity of software and, thereby, the likelihood that more defects will be injected. For testing, a larger software size has two consequences:  The number of tests required to achieve a given level of test coverage increases exponentially with software size.  The time to find and remove a defect first increases linearly and then grows exponentially with software size.

14

As software size grows, software developers would have to improve their defect potentials and removal efficiencies to simply maintain existing levels of released defect densities. The raw metric is only relevant when too low and requires further analysis when high.

1.4.1 Number of Test Cases

Although there are many definitions of software quality, it is widely accepted that a project with many defects lacks quality. Testing is one of the most crucial tasks in software development that can increase software quality. A large part of the testing effort is in developing test cases. The motive for writing a test case should be the complete and correct coverage of a requirement which could require five or fifty test cases. The number of test cases is basically irrelevant for this purpose and can even be a damaging distraction. A large number of test cases could artificially inflate confidence that the software has been adequately tested. There is also no standard on what constitutes one test case. A tester can create one large test case or 200 smaller test cases. It is good practice to write a separate test case for each functionality. Some testers even break test cases down further into discrete steps. Thus, the number of test cases cannot assure a requirement’s cover. It is the content of the test cases that covers a requirement. The number of test cases also does not indicate the quality of the test cases. Choosing the right techniques and prioritizing the right test cases can provide significant economic benefits. Therefore, it is important to analyze test case quality. There are many facets to test case quality such as the number of revealed faults and its efficiency or the time spent to reveal a fault. The most common and oldest are coverage measures as a direct measure of test case quality [FRAN, Hutchins]. Interesting research results on test coverage are presented in a paper by Mockus, Nagappan, and Dinh- Trong [MOCK]. Key observations and conclusions from the paper are the following:

 "Despite dramatic differences between the two industrial projects under study we found that code coverage was associated with fewer field failures.” This strongly suggests that code coverage is a sensible and practical measure of test effectiveness."

 The authors state “an increase in coverage leads to a proportional decrease in fault potential."

 "Disappointingly, there is no indication of diminishing returns (when an additional increase in coverage brings smaller decrease in fault potential)."

 "What appears to be even more disappointing, is the finding that additional increases in coverage come with exponentially increasing effort. Therefore, for many projects it may be impractical to achieve complete coverage." From this paper, it can be concluded that more coverage means fewer bugs, but this comes with increasing cost. Although there are strong indications that there is a correlation between coverage and fault detection, only considering the number of faults may not be sufficient. Code coverage does not guarantee that the code is correct, and attaining 100 percent code coverage does not imply the system will have no failures. It means that bugs can be found outside the anticipated scenarios.

15

1.5 Agile Process and QA

The development process has been transformed to Agile and the code is being developed iteratively. The product owners are maintaining the backlog and the development team is completing chunks of the product in two or three week increments, or sprints. The QA process follows the released sprint. The process just described is partially stuck in Waterfall mode (in the mindset of developers) if the QA department is lagging a sprint behind the development team. Often, the developers in this scenario consider their work done when they have deployed their changes to a QA environment for testing purposes. This is “throwing it over the wall” again, but in smaller increments. There can be many factors working against this setup: loosely defined acceptance criteria, outdated quality standards, time- consuming regression tests, slow and error-prone deployments to the QA environment, and rigid organization charts that can all derail the development. Note that the setup described is not unusual and has been attempted before, so it is important to revisit it and identify some of the common mistakes made by organizations during their transition to Agile.

There are two main ways that Agile provides a basis for system quality verification. They are the acceptance criteria and the definition of done. Acceptance criteria are normally expressed in Gherkin, the language used in Cucumber to write tests. This structured format defines the functionality, performance or other non-functional aspects that will be required of the software to be accepted by the business and/or stakeholders. Naturally, the acceptance criteria should be defined before beginning the development effort on the functionality. The team and testers should all be somewhat involved in producing the criteria. With the acceptance criteria in hand, the developers now understand what needs to be built by knowing how it will be tested. Possessing the satisfaction criteria implies that the developers build software that only yields the desired outcomes.

The second way of checking quality is through the definition of done. Done is a list of quality checks that have to be satisfied before a piece of functionality can be considered done. Included in the list are also the non-functional quality requirements that the team must always adhere to when working through the backlog, sprint after sprint. The acceptance criteria ensures that the software is built to deliver the expected value and the definition of done ensures that the software is built with quality in mind.

In all development models, testing constitutes a majority of the schedule. Some suggest if this is not the case, then quality is most likely suffering. In an Agile process, change occurs rapidly, thus regression testing is a frequent occurrence to determine if changes have not regressed the software. With every new release, as more functionality is added, the amount tested and retested expands. The only sensible approach is to automate tasks which can be automated. Developers should be writing unit tests throughout development. There are integration tests verifying that the internal components work together properly, and the new code integrates well with external components. These are normally automated. Acceptance tests are tests that verify that the acceptance criteria are being satisfied, and automating these tests would prevent manually running regression tests with each new release. The timing of test automation is also important. The ideal process would be to automate the acceptance tests, Acceptance Test-Driven Development (ATDD), before the development effort has even begun. There are tools that translate the acceptance criteria’s Gherkin statements into the tests. Even if ATDD is not followed, automating the acceptance tests should be part of the done definition. Manually triggering tests to execute and manual deployments are also common pitfalls. The best practice is early visibility to quality issues by being notified immediately after committing the offending code into source control. A continuous integration (CI) build that runs your suite of automated tests enables the developer to look at the issue with all the

16 details of the problem fresh in his/her mind.

As others have experienced, combining acceptance test and development into one gated check-in saves cost and rework. The code cannot be committed into source control without first passing all required tests of the CI build. Some of the software can be environment-specific, and the tests must be executed against an environment that mirrors production as closely as possible. Combining CI build and automated deployment provides continuous deployment, at least to the test environment. The latest code is automatically deployed to the test environment after each successful commit to source control enabling the test environment to continuously execute against the latest work by the developers. Bugs are made visible significantly sooner in the development cycle, including the fickle “it works on my machine” bugs. The best in practice method is to tightly integrate the development and QA efforts. Testing should be incorporated into the core development cycle such that no team can ever call anything “done” unless it is fully vetted by thorough testing. There are some aspects of quality assurance that are challenging for a development team that are primarily focused on delivering new functionality of high quality. These are security and load testing. These types of tests, although they can be automated, are so costly in time and processing power to run that they are not a part of any automated test suite that executes as part of continuous integration. Another category of testing that is definitely best suited for dedicated QA testers is exploratory testing. Automated tests can only catch bugs in the predictable and designed behavior of an application, while exploratory tests catch the rest. The QA department should coordinate and refine these practices, provided that the testers themselves are allocated to the development team whose code is being tested.

1.5.1 Agile Quality Assessment (QA) on Scrum There are many challenges when applying an Agile quality assessment. The following questions must be assessed to determine the process:

 Is QA part of the development team?  Can QA be part of same iteration as development?  Who performs QA? (Separate team)  Does QA cost more in Agile as the product fluctuates from sprint to sprint?  Can Agile QA be scaled?  Is a test plan required?  Who defines the test cases?  Are story acceptance tests enough?  When is testing done?  When and how are bugs tracked?

Much of QA is about testing to ensure the product is working right. Automation is QA’s best friend by providing repeatability, consistency and better test coverage. Since sprint cycles are very short, QA has little time to test the application. QA performs full functionality testing of the new features added for a particular sprint as well as full regression testing for all of the previously implemented functionality. As the development proceeds, this responsibility grows and any automation will greatly reduce the pressure. Early in the transition to Agile, the process may have less-than-optimal practices until the root cause can be addressed. For example, a sprint dedicated to regression testing is not reflective of an underlying

17

Agile principle. This sprint is sometimes labeled as a hardening sprint and is considered an anti-pattern. For example, an Informit publication related a story of a company that struggled with testing everything in a sprint because of a large amount of legacy code with no automated tests and a hardware element that required manual testing. Until more automation could be implemented, a targeted regression testing sprint was initiated at the end of each sprint, and another sprint was added before each release for a more thorough regression testing session with participation by all groups. To erase the issue of no legacy test automation, an entire team was assigned to automate tests. Meanwhile, the other teams were trained in test automation techniques and tools to start creating automated tests during current sprints. Finally, the test suites were automated, the dedicated test team was no longer required and current Scrum teams were automating tests. The result was that the time required for regression testing was cut in half and the hardening sprints were greatly reduced. A presentation, “Case: Testing in Large-Scale Agile Development”, by Ismo Paukamainen, senior specialist - test and verification at Ericsson was given at the FiSTB Testing Assembly 2014 [PAUK]. In this presentation, he describes and outlines Ericson’s transformation from a RUP to an Agile process. Naturally, parts of the presentation focused on continuous integration which provides continuous assurance of system quality. It appears that the process was very good in functional performance quality, but not as good in the areas labeled as non-functional. As he states: “Before Agile, the system test was a very late activity, having a long lead-time. It was often hard to convince management about the needs for system tests requiring resources, human and machine for many weeks. This was because the requirements for the product are most often for the new functionality, not for non-functional system functions which are in a scope of system tests. The fact is that only ~5% of the faults found after a release are in the new features, the rest are in the customer perceived quality area.” At the conclusion of the talk, he also had five insightful takeaways about the transition:

 Test competence: If not spread equally, think about other ways to support teams (e.g., in test analysis and sprint planning. Product owners should take responsibility to check that there are enough tests for a sprint. A dedicated testing professional position in a cross functional team is recommended.  Fast feedback: In Waterfall, the aim was to make as much testing as possible in a lower integration level. Then, the testing was earlier and it was easier to find (and fix) faults closer to the designer. In Agile, the aim is to get feedback as fast as possible which means that the strategy is no more to run a mass of tests in the lower level, but to run it on a level that gives the fastest feedback. So, it might be that running tests on a target environment (= production-like) may serve better in the sense of feedback and the lower level is needed only to verify some special error cases that are maybe not possible to execute on target.  Test Automation is a must in Agile. Use common frameworks and test cases as much as possible. Try to avoid extra maintenance work around automation (for example, Continuous Integration).  Independent Test Teams is a good way to support cross-functional teams, especially to cover agile testing. To make non-functional system tests in cross-functional teams would mean: i) Possible overlapping testing, ii) Need for more test tools and test environments. iii) It is also a competence issue and iv) May be too much to do within sprints. Independent test teams need to be co-located with cross functional teams and have a good communication with them. A sense of community!

18

 Raise Your Organizational Awareness of the Product Quality: Monitor the system quality (Robustness, Characteristics, Upgrade …) and make it visible through the whole organization.

The desired software is broken down into named features (requirements, stories), which are part of what it means to deliver the desired system. For each named feature, there are one or more automated acceptance tests which, when they pass, will show that the feature in question is implemented. The running tested features (RTF) demonstrates at every moment in the project how many features are passing all of their acceptance tests. Automated testing is also a factor in quality producing environments; therefore, measuring automated unit and acceptance test results is another important measure.

1.6 Product Backlog

A common Agile approach to change management is a product backlog strategy. A foundational concept is that requirements, and defect reports, should be managed as an ordered queue called a "product backlog." The contents of the product backlog will vary to reflect evolving requirements, with the product owner responsible for prioritizing work on the backlog based on the business value of the work item. Just enough work to fit into the current iteration is selected from the top of the stack by the team at the start of each iteration as part of the iteration planning activity. This approach has several potential advantages. First, it is simple to understand and implement. Second, because the team is working in priority order, it is always focusing on the highest business value at the time, thereby maximizing potential return on investment (ROI). Third, it is very easy for stakeholders to define new requirements and refocus existing ones.

There are also potential disadvantages. The product backlog must be groomed throughout the project lifecycle to maintain priority order, and that effort can become a significant overhead if the requirements change rapidly. It also requires a supporting strategy to address non-functional requirements. With a product backlog strategy, practitioners will often adopt an overly simplistic approach that focuses only on managing functional requirements. Finally, this approach requires a product owner who is capable of managing the backlog in a timely and efficient manner.

Section 2: Software Product Quality and its Measurement

2.1 Goal Question Metric Model

Many software metrics exist that provide information about resources, processes and products involved in software development. The introduction of software metrics to provide quantitative information for a successful measurement program is necessary, but it is not enough. There are other important success factors that must be considered when selecting metrics. Foremost is that the metrics must quantify performance achievements towards measurement goals. Basili created the Goal Question Metric (GQM) interpretation model to assist with the outlining goals, subgoals and questions for a measurement program [BASI]. Table 1 consists of a GQM definition template employing a DevOps concept for software development. The development process affects the nature and timing of the metrics.

Table 1: Main Goal of Software Development

Analyze Software Development

19

For the purpose of Assessing and Improving Performance

With respect to Software Quality

From the viewpoint of Management, Scrum Master and Development Team

In the context of DevOps Environment

2.2 Generic Software Quality Models

The main goal of assessing and improving software development with respect to quality can be broken down into three aspects. The sub-goals of functional, structural and process quality improvement form the basis to derive the questions and metrics for the GQM. Dividing software quality into three sub-goals allows us to illuminate the trade-off that exists among competing goals. In general, functional quality reflects the software’s conformance to the functional requirements or specifications. Functional quality is typically enforced and measured through . Software structural quality refers to achieving the non-functional requirements to support the delivery of the functional requirements, such as reliability, efficiency, security and maintainability. Just as important as the first two sub-goals which receive the majority of the quality dialog, process quality is a process that consistently delivers quality software on time and within budget. Table 2 breaks down the three sub-goals into more measurable components.

Table 2: Three Sub-Goals Broken Down into Measurable Components

Question Property

Functional Does the system deliver the business value enhancement rate planned? How many user requirements were delivered in the sprint? Does the solution do the right thing? How many defect removal rate bugs were removed in the sprint? Structure How modifiable (maintainable) is the software? Maintainability: How modular is the software? duplication How testable is the software? unit size / complexity What is the performance efficiency of the Efficiency: time-behavior, software? resource utilization, capacity How secure is the software? Security: confidentiality, integrity, non-repudiation, accountability, authenticity How usable is the software? Usability: learnability, operability, user error protection, user interface aesthetics, accessibility How reliable is the software? Reliability Process What is the capacity of the software Velocity development process? What is the cycle/lead time? Cycle time/ lead time How many bugs were fixed before delivery? Defect removal effectiveness How can we improve the delivery of business?

20

value?

Structural quality is determined through the analysis of the system, its components and source code. Software quality measurement is about quantifying to what extent a system or software possesses desirable characteristics. This can be performed through qualitative or quantitative means or a mix of both. In both cases, for each desirable characteristic, there are a set of measurable attributes, the existence of which in a piece of software or system tend to be associated with this characteristic. Historically, many of these attributes have been extracted from the ISO 9126-3 and the subsequent ISO 25000:2005 quality model, also known as SQuaRe. Based on these models, the Consortium for IT Software (CISQ) has defined five major desirable structural characteristics needed for a piece of software to provide business value: Reliability, Efficiency, Security, Maintainability and (adequate) Size. In Figure 5, the right side five characteristics that matter for the user or owner of the business system depend on left side measurable attributes. Other quality models have been created such as depicted in Figure 6 from Fenton. To understand the professional meaning of code quality, the complete study of these concepts is required. However, these models do not lend themselves naturally to practical development environments and we need to explore more deeply what impacts business value.

Figure 5: Relationship Between Desirable Software Characteristics (right) and Measurable Attributes (left) [WIKI]

21

Figure 6: Software Quality Model [FENT]

2.3 Comparing Traditional and Agile Metrics

Traditional software development and Agile methods actually have the same starting point. Each process plans to develop a product of acceptable quality, applying a specific amount of effort within a certain timeframe. The approach and processes differ, but the goal stated above is still the same. Traditional software methods apply metrics to plan and forecast, monitor and control, and to integrate performance improvement within the process, and Agile also requires metrics with these same capabilities. Agile clearly differs from the traditional approach in that traditional software development metrics track a plan through evaluating cost expenditures, whereas Agile development metrics do not track against a plan. Agile metrics attempt to measure the value delivered or the avoidance of future costs. Another difference between Agile and traditional is the units of measure. Table 3 is a matrix comparing the core metric units of Agile to traditional software development [NESM]. Table 3: Comparison of Core Metrics for Agile and Traditional Development Methods

Core Metric Agile Traditional Product (size) Features, stories Function points (FP), COSMIC function points, use case points

Quality Defects/iteration, defects, MTTD Defects/release, defects, MTTD

Effort Story points Person months

Time Duration (months) Duration (months) Productivity Velocity, story points/iteration Hours/FP

22

In Table 3, Agile uses a subjective unit, a story point to measure effort, making comparisons between teams, projects and organizations impossible. Traditional methods use the standardized units of measure, function points (FP) and COSMIC function points (CFP). Both FP and CFP are objective and are recognized international standards. Several estimation and metric tools use the metric hours/FP for benchmarking purposes. A noticeable characteristic for Agile is the absence of benchmarking metrics or any other form of external comparison. The units of measure used for product (size) and productivity are subjective and apply exclusively to the project and team in question. There is no possibility to compare development teams or tendering contractors on productivity. So, the selection process for a development team based on productivity is virtually impossible [NESM]. 2.4 NESMA Agile Metrics The Netherlands Software Metrics User Association (NESMA) began in 1989 as a reaction to the counting guidelines of the International Function Point Users Group (IFPUG) and became one of the first FPA user organizations in the world. Their NESMA standard for functional size measurement became the ISO/IEC standard. In 2011, the organization shifted from an FPA user group to an organization that provides information about the applied use of software metrics: estimation, benchmarking, productivity measurement, outsourcing and project control. NESMA conducted a search of the web for recommended Agile metrics, a so-called state-of-the-practice survey. They divided the survey into three main areas of interest: planning and forecasting, monitoring, and control and performance improvement. The following three tables, Table 4, 5 and 6, transcribe information from the website (http://nesma.org/2015/04/Agile- metrics/) 2.4.1 Metrics for Planning and Forecasting Table 4: Metrics for Planning and Forecasting

Metric Purpose How to measure Number of features Insight into size of product The product comprises features that in (and entire release). Insight turn comprise stories. Features are into progress. grouped as “to do”, “in progress” and “accepted”.

Number of planned stories Same as number of features. The work is described in stories which are per iteration/release quantified in story points.

Number of accepted stories To track progress of the Formal registration of accepted stories per iteration/release iteration/release

Team velocity See monitoring and control

LOC Indicates amount of According to the rules agreed upon. completed work (progress) calculation of other metrics i.e. defect density

23

2.4.2 Dividing the Work into Manageable Pieces In order to plan and forecast, the development process requires the work to be divided into manageable pieces. It is essential in larger organizations that these pieces are organized, scaled and of a consistent hierarchy if they are going to be used for measurement. There are two important abstractions used to build software: features and components. Features are system behaviors useful to the customer. Components are distinguishable software parts that encapsulate functions needed to implement the feature. Agile’s delivery focuses on features (stories). Large-scale systems are built out of components that provide for the separation of concerns and improved testability, providing a base for fast system evolution. In Agile, should the teams be organized around features, components or both? Getting it wrong can lead to brittle system (all features) or a great design with future value (all components). Previously, large-scaled developments followed the component organization as depicted in Figure 7. The problem with this organization is that most new features create dependencies that require cooperation between teams, thereby creating a drag on velocity because the teams spend time discussing and analyzing dependencies. Sometimes component organization may be desired when one component has higher criticality, requires rare or unique skills and technologies, or is heavily used by other components or systems. Feature team organization, pictured in Figure 8, operates through user stories and refactors.

Figure 7: An Agile Program Comprised of Figure 8: An Agile Program Comprised of Feature Component Teams [SCAL] Teams For large developments, the organization is not as clear cut. Some features are large and are split into multiple user stories. It is over simplistic to think of all teams being either component or feature-based. To ensure the highest feature throughput, the SAFe (Scaled Agile Framework) guidelines suggest a mix of feature and component teams with the feature team possessing the highest percentage at about 75-80%. This split is dictated by the number of specialized technologies or skills required to develop the product. Depending on the hierarchy, features or stories, the unit of work used in Table 5 may mask important details if not counted uniformly. 2.4.3 Metrics for Monitoring and 5: Metrics for Monitoring and Control Metric Purpose How to measure Iteration burn-down Performance per iteration; Effort remaining (in hrs) for the current

24

Are we on track? iteration (effort spent/planned expresses performance).

Team Velocity per iteration To learn historical velocity Number of realized story points per for a certain team. Cannot be iteration within this release. Velocity is used to compare different team and project-specific. teams.

Release burn-down To track progress of a release Number of story points “to do” after from iteration to iteration. completion of an iteration within the Are we on track for the entire release (extrapolation with certain release? velocity shows the end date).

Release burn-up How much ‘product’ can be Number of story points realized after delivered within the given completion of an iteration. time frame?

Number of test cases per To identify the amount of Number of test cases per iteration iteration testing effort per iteration. To recorded as sustained, failed, and to do. track progress of testing.

Cycle time (team’s To determine bottlenecks of Number of stories that can be handled per capacity) the process; the discipline discipline within an iteration (i.e. analysis- with the lowest capacity is the UI-design-coding-unit test –system test). bottleneck

Little’s Law – cycle times Insight into duration; we can Work in progress (# stories) divided by are proportional to queue predict completion time based the capacity of the process step. length on queue length.

One metric not mentioned previously is Little’s Law which states that the more items that are in the queue, the longer the average time each item will take to travel through the system. Therefore, managing the queue (backlog) is a powerful mechanism for reducing wait time since long queues result in delays, waste, unpredictable outcomes, disappointed customers and low employee morale. (See section 1.6, Product Backlog.) However, everyone realizes that variability exists in technology. Some companies limit utilization to less than 100% so a development has some flexibility, which is counterintuitive to most models that suggest setting resources to 100% utilization. Also, there is a well-known belief that work spans time allotted. To offset less than 100% utilization, some have instituted a Hardening Innovation Planning (HIP) sprint to promote a new innovation or technology, find a solution to a nagging defect or identify a fantastic new feature. 2.4.4 Metrics for Improvement (Product Quality and Process Improvement)

Table 6: Metrics for Improvement (Product Quality and Process Improvement)

Metric Purpose How to measure

Cumulative number of defects To track effectiveness of testing Logging each defect in defect

25

management system

Number of test sessions To track testing effort and Extraction of data from the defect compare it to the cumulative repository number of defects

Defect density To determine the quality of The cumulative number of software in terms “lack of defects divided by KLOC defects”

Defect distribution per origin To decide where to allocate By logging the origin of defects quality assurance resources in the defect repository and extract the data by means of an automated tool

Defect distribution per type To learn what type of defects are By logging the type of defects in the most common and help avoid the defect repository and extract them in the future the data by means of an automated tool

Defect cycle time Insight in the time to solve a Opening date of defect minus the defect (speed of defect resolution date (usually the resolution) closing date in the defect repository)

As seen from Tables 4, 5, and 6, Agile metrics are essentially the same metrics as within traditional development, except that they use the Agile units (features) and concepts. 2.5 Another State-of-the-Practice Survey on Agile Metrics Another state-of-the-practice survey divided the Agile metrics into three main areas of interest: planning and forecasting, monitoring and control and performance improvement [GALE]. Most researchers and consultants claim that there are four distinct areas of interest to collect for Agile development:

 Predictability  Value  Quality  Team Health – can be based on an Agile maturity survey How do categories of predictability (See section 2.2.2 Survey Results: Predictability), value, quality and team health overlap with the NESMA categories? Predictability maps to planning and forecasting category, value maps to monitoring and control and quality maps to performance improvement. It appears that Agile community has not come to a consensus on what is team health. However, since development teams are the foundation of production, it does appear that the more teams are organized and focused, the better the outcome.

Based on the three distinct areas, predictability, value and quality, a list of what to measure during Agile development was compiled. This “What to measure?” list consists of twelve categories:

26

1. Events that halt a release, such as continuous integrations or continuous deployment stops (quality based metric of type outcome) 2. Number and types of corrective actions per team or across teams (quality, outcome) 3. Number of stories delivered with zero bugs 4. Number of stories reworked (value, output) 5. Percentage of technical debt addressed with a target >30% (value or quality, outcome)  Coding standards violations  Code violations  Dead code  Code dependencies (coupling) 6. Velocity per team where trending is most important. Velocity is not used to measure productivity but to derive duration of a set of features. (predictability, output). 7. Delivery predictability per story point, average variance improving across teams (predictability, output) 8. Release burn-down charts – display both story points completed and those added by iteration. 9. Percentage of test automation includes UI level, component/middle tier and unit level coverage where trending is most important (quality, output) 10. Organizational commit levels, the more that participate, the better the value (predictability, output) 11. New test cases added per release (quality, outcome) 12. Defect Cycle Time is useful. We want to reduce the time from defect detection to defect fix. This not only improves the business experience, but reduces the code written on top of faulty code, and ensures issues are fresher in developer minds and faster to fix. How does NESMA’s state-of-the-practice influence the list labeled as “What to measure?”? The state-of- the-practice is a general set of metrics used by Agile environments. The “What to measure?” depends on more general measures, such as number of stories, but then identifies them by a particular attribute or event such as the number of stories reworked. 2.6 Metric Trends are Important Almost every single article reinforces the fact that trending is much more important than any specific data point [FOWL]. Used as a target, a metric is the only means to communicate a goal. In most cases, it is an arbitrary number for which excessive amounts of time are used to determine its value or in working to move toward this value or goal. When an attribute such as quality is turned into a number, it is highly interpretive and any figure is relative and arbitrary. As Martin Fowler points out, there is a significant difference between a code coverage at 5% and at 95%, but what about between 94% and 95%? A target value such as 95% informs developers when to stop, but what if that additional 1% requires a significant amount of effort to achieve? Should the extra effort be provided and does it bring additional value to the product? Focusing on trends provides a feedback on real data and creates an opportunity for a reaction. Moving in either direction, a team can ask what is causing this change. A trend analysis produces actions earlier than concentrating on an individual number. Arbitrary absolute numbers can create a feeling of helplessness especially when events outside of a team’s control prevent progress. Trends focus on moving in the right direction rather than being hostage to external barriers. Since trends are important, Agile reporting should use shorter periods of reporting to have more opportunity to react and change. With any type of Agile methodology, it is important to reinforce lean and Agile principles, such as concentrating on working software where numbers are not as important as trends. The project is already

27 collecting velocity and burn-down numbers. It is simple: the more user requirements delivered to the customer, the greater the functional completeness (enhancement rate). The benefit, which users receive from software usage, increases with the degree of the software's functional completeness. The delivery rate of user requirements is also considered to be the throughput of the software development process.

2.7 Defect Measurement

In software, the narrowest sense of product quality is commonly recognized as a lack of defects or bugs in the product. The number of delivery defects is known as an excellent predictor of customer satisfaction, thus it is important to uncover trends in the defect removal processes. Using this viewpoint, or scope, three important measures of software quality are: defined as the number of injected defects in software systems, per size attribute. releasing the software to intended users. the number of released defects in the software, per size attribute. Defect potential refers to the total quantity of bugs or defects that will be found in five software artifacts: requirements, design, code, documents, and “bad fixes” or secondary defects. Defect potentials vary with application size, and they also vary with the class and type of software. Defect potentials are calibrated through function points. Organizations with combined defect removal efficiency levels of 75% or less can be viewed as exhibiting professional malpractice. In other words, they are below acceptable levels for professional software organizations.

Testing alone is insufficient for optimal defect removal efficiency. Most forms of testing can only achieve about 35% defect removal efficiency (DRE), and seldom top 50%. DRE is defined as

DRE = E / (E+D) where E is the number of errors found before delivery of the software to the end user and D is the number of defects found after delivery. To achieve a high level of cumulative defect removal, many forms of defect removal techniques need to be done. In a blog (https://www.linkedin.com/grp/post/2191046- 105467445), Caper Jones provides an analysis where he revisited 21 famous software problems such as the Therac 25 radiation poisoning, the Wall Street crash of 2008, the McAffee anti-virus bug of 2010, the Kidder Capitol stock trade problem of August 2012, and others. All of these systems had been tested. None of the famous software problems had been found only through testing. He suggests that pre-test inspections and static analysis would have found most. Below is the defect removal efficiency rate for various methods based on commercial applications.

Measuring Defect Removal Efficiency [BLAC] Patterns of Defect Prevention and Removal Activities Prevention Activities Prototypes 20.00% Pretest Removal Desk checking 15.00% Requirements review 25.00% Design review 45.00% Document review 20.00%

28

Code inspections 50.00% Usability labs 25.00% Subtotal 89.48% Testing Activities Unit test 25.00% New function test 30.00% Regression test 20.00% Integration test 30.00% Performance test 15.00% System test 35.00% Field test 50.00% Subtotal 91.88% Overall Efficiency 99.32%

Defect tracking and its analysis has traditionally been used to measure software quality throughout the lifecycle. However, in Agile methodologies, it has been suggested that pre-production defect tracking may be detrimental to software teams [TECH]. Many suggest that pre-production tracking makes it difficult to determine a true value of software quality. Pre-production defect tracking (especially resulting from QA) is still very important to track. However, the focus of the activity should be shifted to prevention. All defects should be measured by phase of origin (requirements, design, code, user documents and bad fixes) so that the cause and ways to improve the process can be identified. As previously stated, for more than 40 years, customer satisfaction has had a strong correlation with the volume of defects in applications when they are released to customers. Released defect levels are a product of defect potentials and defect removal efficiency. Jones and Bonsignour reflect that the Agile community has not yet done a good job of measuring defect potentials, defect removal efficiency, delivered defects or customer satisfaction [JONEa]. A development group that does not reach defect removal efficiency rate of 85% or above will not have a good customer satisfaction rating. For defects, identify areas in the code that have the most bugs, the length of time to fix bugs and the number of bugs each team can fix during an indicated time span. Track the bug opened/closed ratios to determine if more bugs are being uncovered than in previous iterations or if a team is falling behind. This may determine a need to review and fix rather than attempting to deliver a new feature. Determine the reasons for any changes in trend. Collect post sprint defects, QA defects, and post release defect arrival. Complete a Root Cause analysis. For example, determine why a particular defect escaped from testing or why a defect was injected into the code. It is especially insightful when a defect count or trend is matched with a QA activity. To compare between teams, systems or organization, defect density (the number of bugs per LOC or another size metric such as function point) is used. Defect density is a recognized industry standard and a best practice. Current defect density numbers can be compared against data retrieved from organizations such as Gartner or International Software Benchmarking Standards Group (ISBSG) normally for a fee. The true defect density is not known until after the release, and for this reason Microsoft uses code churn to predict defect density. Moreover, the defect data must filter incidents to get defects. Incidents can be labeled in the data store as: Change Request Agreed, Deferred, Duplicate, Existing Production Incident, Merged with another Defect, No Longer an Issue, Not a fault, Not in Scope of Project, Resolution Implemented, Referred to another project, Third Party Fix, Risk accepted by the business or Workaround accepted by the business or other customized exceptions. There are also problems associated when comparing defect density against outsiders. Every tool has its own definition of size (LOC). It is easy for

29 projects to add more code to make the LOC metric look better, and comparisons between code languages is meaningless without an agreed upon LOC equivalency table. Moreover, source code is not always available, such as in third party applications. Therefore, benchmarking against yourself may be the most effective way. Figure 9 compares Agile to Waterfall defect removal rate in Hewlett Packard projects [SIRI]. The Agile process has a more sustainable defect removal rate throughout the lifecycle. The Waterfall process displays a late peak with a gradual decline. This information was collected from two different product releases created by the same team but using two different processes. Note that it is easier to compare and observe defect trends in the Agile project and, therefore, it is easier to introduce modifications. Defects are important to study. A Special Analysis Report on Software Defect Density from the ISBSG reveals useful information about defects in software, both in development and in the initial period after a system has gone into operation:

 The split of where defects are found, i.e. in development or in operation, seems to follow the 80:20 rule. Roughly 80% of defects are found during development, leaving 20% to be found in the first weeks of systems’ operation.  Fortunately, in the case of extreme defects, less than 2.5% were found in the first weeks of systems’ operation.  Extreme defects make up only 2% of the defects found in the Build, Test and Implement tasks of software development.  The industry hasn’t improved over time. Software defect densities show no changing trend over the last 15 years.

Figure 9: Agile to Waterfall Defect Removal in Hewlett-Packard Projects

Maintainability is part of every quality model. Especially in the light of DevOps development which stresses testing in its process, maintainability is an important quality characteristic. This system will be around for a long time and the traditional assumption that existing systems will decay, become more difficult and expensive to maintain should be avoided. The system must deliver new and better services at a reasonable cost. As new features to the system, they must be testable. The symptoms of poor testability are unnecessary complexity, unnecessary coupling, redundancy and not relating the software model to the physical model. Also when these conditions exist, they make automated testing more difficult. In the presence of these symptoms, tests either do not get created or have a lower probability of

30 being executed because of the effort and time commitment. Developers cannot be assured that the system delivers the value if tests to do not exist or are not executed.

The process has a prevailing influence over the quality of the code. One of the myths of Agile is that an iterated set of user stories will emerge with a coherent design requiring at most some refactoring to single out commonalities. In practice, these stories do not self-organize. In previous development experiences, when adding new functionality a system tends to become more complex and thus the law of increasing entropy emerged. Refactoring and technical debt are inextricably linked in the Agile space. Refactoring is a method of removing or reducing the presence of technical debt. However, not all technical debt is a direct refactoring candidate. Technical debt can stem from documentation, test cases or any deliverable. Developers, product managers and researchers disagree about what constitutes technical debt. The simplest definition found is that technical debt is the difference between what was promised and what was actually delivered, including technical shortcuts made to meet delivery deadlines. An easy way to document technical debt is to raise an issue within the system (e.g., Jira). The issue can be documented with different priorities, such as those that block future functionality or hamper implementation. If a problem is small, then it can be added to a sprint if the sprint’s focus has been completed. This bookkeeping helps monitor technical debt. The process of refactoring must be incorporated, and as stated previously, reviews are better at uncovering evolvability defects.

Technical debt has become a popular euphemism for bad code. This debt is real and we incur debt both consciously and accidentally. Static analysis alone cannot fully calculate debt and it may not be always possible to pay (fix) debt in the future. Modules are built on top of the original technical debt which creates dependencies that eventually become too ingrained and too expensive to fix. Some researchers suggest that there are seven deadly sins in bad code, each one representing a major axis of quality analysis: bad distribution of the complexity, duplications, lack of comments, coding rules violations, potential bugs, no unit tests or useless ones and bad design. Many of these can be mitigated with the proper techniques and tools. The SonarQube default project dashboard tracks and displays each of these deadly sins.

Study after study has shown poor requirements management is the leading cause of failure for traditional software development teams. When it comes to requirements, Agile software developers typically focus on functional ones that describe something of value to end users—a screen, report, feature, or business rule. Most often these functional requirements are captured in the form of user stories, although use cases or usage scenarios are also common, and more advanced teams will iteratively capture the details as customer acceptance tests. Over the years, Agilists have developed many strategies for dealing with functional requirements effectively, which is likely one of the factors leading to the higher success rates enjoyed by Agile teams. Disciplined Agile teams go even further, realizing that there is far more to requirements than just this, and that we also need to consider nonfunctional requirements and constraints. Although Agile teams have figured out how to effectively address functional requirements, most are struggling with nonfunctional requirements. The definition of software quality is very diverse as seen in Figures 5 and 6. However, it is widely accepted that a project with many defects lacks quality. The simplest measure of assessing software quality is by the frequency of critical or blocker bugs discovered post-release, as was outlined in the Zeroturnaround survey in Section 1.2.2. The problem with this assessment is that a measure of quality was completed after the fact. It is not acceptable to postpone the assurance of software quality after release and, as outlined several times, the cost of removing bugs later only increases. Is there a direct method to quantify quality pre-release? There is no single metric that defines good versus bad software.

31

Software quality can be measured indirectly from attributes associated with producing quality software. From the Zeroturnaround survey, seven key points of high-performing DevOps culture, those environments that produced good quality systems on time, were outlined. Five of seven are characteristics that can be copied and measured directly. These high quality producing environments deploy daily, handle non-functional requirements during every sprint, exploit automated testing, mandate strict version control and perform peer code review. The remaining two trends of successful DevOps teams, one of implementing end-to-end performance monitoring and metrics and the other of allocating more cycle time to the reduction of technical debt, are much more challenging. What can be measured is code churn, static analysis findings, test failures, coverage, performance, bugs and bug arrival rates and an indication of size. Any will be criticized for its effectiveness, especially if one is searching for a silver bullet. All metrics are somewhat helpful. All metrics measure a particular attribute of the software. Also, the manner in which they are applied may not be perfect. Many practitioners use a metric for a purpose which it was never intended. The McCabe metric (also known as ) original purpose was measuring the effort to develop test cases for a component or module. Every piece of code contains sections of sequence, selection and iteration, and this metric quantifies the number of linearly independent paths through a program's source code. To perform basis path testing proposed by McCabe, each linearly independent path through the program must be tested; in this case, the number of test cases will equal the cyclomatic complexity of the program. Therefore, a module with a higher McCabe metric value requires more testing effort than a module with a lower value since the higher value indicates more pathways through the code. A higher value also implies that a module may be more difficult for a programmer to understand since the programmer must understand the different pathways and the results of those pathways.

For example, a cyclomatic complexity analysis can have problems stemming from recursion and fall- through. If one of the project goals is good performance, recursion should be avoided. Fall-through where one component passes control down to another such that there is no single entry/exit point also affects the metric. Les Hatton claimed (Keynote at TAIC-PART 2008, Windsor, UK, Sept 2008) that McCabe's Cyclomatic Complexity number has the same predictive ability as lines of code. The selected threshold for cyclomatic complexity is based on categories established by the Software Engineering Institute, as follows:

Cyclomatic Risk Evaluation... Complexity 1-10 A simple module without much risk 11-20 A more complex module with moderate risk 21-50 A complex module of high risk 51 and greater An untestable program of very high risk

2.8 Defects and Complexity Linked

In the section “High Risk Code and High Risk Changes”, one of the bullets suggests that code that is complex is high risk. Identifying the most complex code and monitoring it to determine the rate of change assists developers in deciding where to focus efforts in review, testing and refactoring. Software

32 complexity encompasses numerous properties all of which affect the external and internal interactions of the software. Higher levels of complexity in software increase the risk of unintentionally interfering with interactions and increases the chance of introducing defects either when creating or making changes to the software. Many measures of software complexity have been proposed. Perhaps the most common measure is the McCabe essential complexity metric. This is also sometimes called cyclomatic complexity. It is a measure of the depth and quantity of routines in a piece of code. Using cyclomatic complexity measured by itself, however, can produce the wrong results, because there are numerous other properties that introduce complexity and not just the control flow of the software. Another important perspective comes from understanding the change in the complexity of a system over time. Identifying components that

 cross a defined threshold of complexity and are thus candidates

 suddenly change in complexity and that are forecast to continue with that trend

 increase in complexity where they were not expected, as a possible indication of poor programming or design It is important to manage control flow code complexity for testability. The completeness of test plans is often measured in terms of coverage. There are several levels or dimensions of coverage to consider:

 Function, or subroutine coverage – measures whether every function or subroutine has been tested

 Code, or statement coverage – measures whether every line of code has been tested

 Branch coverage – measures whether every case for a condition has been tested, i.e., tested for both true and false

 Loop coverage – measures whether every case of loop processing has been tested, i.e. zero iterations, one iteration, many iterations

 Path coverage – measures whether every possible combination of branch coverage has been tested Large programs can have huge numbers of paths through them. A program with a mere 20 (n) control point statements (IF, FOR, While, CASE) can have over one million different paths through it (paths = 2n). Removing redundant conditions and organizing necessary conditions in the simplest possible way to help minimize control flow complexity, and thus minimize both the probability of defects and the required testing effort.

Control flow is not the only aspect of concern when managing testability. Managing the data flow and its impact on the complexity of code implementation are also a concern. Several methods exist to measure the use, organization or allocation of data.  Span between data references is based on the position of the data references and the number of statements between each reference or the span. The larger the span, the more difficult it becomes for the developer to determine the value of a variable at a particular point, and the more likely to have defects and to require more testing.  Particular data can possess different roles or usages within different or the same modules. These roles are: input needed to produce a module’s output; data changed or created within a module,

33

data used for control and data passing through unchanged. Researchers have observed that the type of data usage contributes to complexity in different amounts, with data used as control contributing the most. By considering these data flow complexity factors when designing the program code, the ultimate testability and quality of the program can be increased.

After identifying these complex parts, developers can:

• Remove hard coding. • Revisit other design aspects and determine if it needs to be upgraded. • Have managed code walk through to inspect it for defects. • Refactor the section of code to simplify it; possibly break it into smaller, more manageable and more testable pieces. • Seek alternative design solutions that avoid those parts. • Adjust your programmer resource plan to place your most reliable programmers on those challenging programs. • Allow for additional time and resources for more extensive testing.

The “Pareto Principle”, more commonly known as the 80/20” rule, is a relation that describes causality and results. It claims that roughly 80% of output is a direct result of about 20% of the input. It is generally known that 80% of the problems are located in 20% of the code. This phenomenon was also observed by the study of ISBSG. A problem that all developers would like to know more about is where is the risk? Which component or module is vulnerable or defect-prone? To assist in this quest, many software fault prediction models have been proposed. These models consist of various sets of metrics, both static and dynamic, to predict software fault-proneness. The problem is that the metric only partially reflects the many aspects that influence software fault-proneness. Even though much effort has been directed to this effort, none of the software fault prediction techniques has proven to be consistently accurate [BISH]. It is known that no single metric can predict bugs, that testing itself demonstrates bugs but does not prove their absence. We have also found the enhanced data is helpful (product and process) in prediction.

We outlined the importance of linking the metrics to the organization or project objectives to demonstrate achievement outlined objectives. There are other important factors that also must be considered during metrics adoption. In an Agile environment, much of the decision-making is delegated to the development teams. Development teams require information to assist them during their daily operations. Measurement needs to be integrated into the workflow to provide this assistance and to avoid the task of simply collecting data.

Capers Jones, who has been collecting software data for more than thirty years, makes this comment about metrics [JONEe]: “Accurate measurements of software development and maintenance costs and accurate measurement of quality would be extremely valuable. But as of 2014, the software industry labors under a variety of non-standard and highly inaccurate measures compounded by very sloppy measurement practices. For that matter, there is little empirical data about the efficacy of software standards themselves.” Even with the metric inconsistency problem mentioned above, internal consistent metrics are important for internal benchmarks and comparisons. Metrics on size, productivity and quality are the key ones to concentrate on. In order to analyze their value, consistency is the key. Metrics are like words in a

34 sentence; together they create sense and meaning. Over analyzing data provided by one metric or one set of metrics (such as productivity) can be harmful to other aspects of the software, such as quality.

2.9 Performance, a Factor in Quality

Performance directly translates into utility for the end user. There are different levels of performance as seen in the properties of Table 1, such as time-behavior, resource utilization and capacity. Ultimately, performance is about increasing user response times and reducing latency throughout the system while retaining functional accuracy and consistency. Performance can be measured through low-level code performance and benchmarking and, at a higher level, by establishing a balance between the resource and requirements.

The first focus will be on Java resource utilization and system requirements. In general, there are four main resources which are the keys to the performance of any executing system. They are the CPU computing power, the memory (both cache and RAM), the IO/Network and the database. Database access is separated from IO/Network resources because it can greatly affect performance and is often responsible for IO bottlenecks. The functional performance requirements can also be placed into four categories. They are throughput, scalability, responsiveness and latency, and resource consumption. Throughput is normally rated as the maximum number of concurrent users the system can accommodate at once. When the number of users grows, how the system responds to the increasing number of requests is a measure of its scalability. Responsiveness is the length of time the system takes to first respond to a request and latency is the time required to process the request. Not only must the system serve users, other tasks will consume resources and influence the throughput. In general, most performance problems can be postulated by one or more of these terms. For example, if speed is the issue, most likely latency is the problem.

Performance tuning requires effort. In a survey instituted by a tool vendor [PLUM], it was found the 76% of the respondents struggle most with trying to gather, make sense of, reproduce, and link to the root cause the evidence required for performance analysis. In Figure 10, a pie chart is presented identifying the most time-consuming part of the process [ZERO]. The survey also asked how long it took to finally detect the root cause to solve the performance issue. The average time finding and fixing a root cause is 80 hours.

There are three major types of performance tools to identify performance problems or to optimize performance. They fall into tool categories that monitor, profile or test. Java profiling and monitoring tools measure and optimize performance during runtime. Performance testing identifies areas of heavy resource utilization. There are many Java monitoring tools or Application Performance Management (APM) tools. The issue with monitoring is that many production environments are a complex mixture of services very carefully balanced to work together. Plus, with applications shifting to the cloud and dramatically different enterprise architectures, APM tools are challenged to provide real performance benefits across systems with virtual perimeters. A blog posted on June 10, 2014 from profitbricks identifies 38 APM tools (https://blog.profitbricks.com/application-performagement-tools/). This short list contains some of the most comprehensive APM tools available. Tool #5 is Compuware APM. Another article from Zeroturnaround which has its own APM tool, New Relic APM for web applications, also lists this tool for complex applications. The Zeroturnaround article lists the name as Dynatrace, a more recent name change. This APM has the largest market share as published by a Gartner report [GART]. One of Dynatrace’s selling points is that it eliminates false positives and erroneous alerts, thereby reducing the cost of deploying and managing the application. Another tool in the same category, mentioned in both sources, is AppDynamics. It is listed as tool #2 in profitbricks, and the basic tool is free while the pro tool has a cost.

35

Figure 10: Most Time-Consuming Part of Performance Tuning

Monitoring tools that identify memory leaks, garbage collection inefficiencies and locked threads can also be used. These are less powerful, usually work through the JVM, and are less costly. One such tool is Plumbr, which runs as a Java agent on the JVM. It is used as an overall monitoring tool. Java Mission Control is a performance monitoring tool by Oracle and is free. It has a nice, simple, configurable dashboard for viewing statistics of many JVM properties.

Many of the monitoring tools assist in identifying when a performance problem exists. Engineers must then find the cause and eliminate the issue. There are many tools or sources used for this evidence gathering phase. Many engineers used the application log or heap and thread dumps as evidence. JVM tools such as jconsole, jmc, jstat and jmap can be used. At an average, an engineer uses no less than four different tools to gather enough evidence to solve the performance problem. Other specialized tools, such those offered by JClarity (Illuminate and Censum), can be used to identify the problem. Illuminate is a performance monitoring tool, while Censum is an application focused on garbage collection logs analysis. Takipi was created for error tracking and analysis, informing developers exactly when and why production code breaks. Whenever a new exception is thrown or a log error occurs, the Takipi software captures the exception and shows the variable state which caused it, across methods and machines. Takipi will lay this information over the actual code which executed at the moment of the error so developers can analyze the exception as if they were there when it happened.

Code profilers gather information about low-level code events to focus on performance questions. Many of them use instrumentation to extract this information. A popular and frequently mentioned profiler is YourKit and is one of the most established leaders in Java profiling. Another profiler tool is JProfiler. It can also display call graph where methods are represented by colored rectangles providing instant visual feedback about where slow code resides.

Application monitoring tools point out problems in performance, profilers provide low level insight and highlight individual parts, and performance testing tools tell us that the new solution is better than the previous one. Apache JMeter is an open source Java application for loading test functional behavior and measuring performance. A section of Zeroturnaround article [ZERO] is a section labeled “Performance Issues in action” where the reader is led through an application (Atlassian Confluence) using the performance tools. It uses JMeter to create and gather profile data, YourKit to display and analyze the

36 profile data, and XRebel to diagnose other mostly http performance issues. It is an excellent exercise for those not familiar with these types of tools.

Teams are constantly delivering code. SonarQube can be used to analyze the frequency of changes, the size of changes, and to correlate this information with error data to assist in understanding whether the code is being changed too much or too quickly to be safe. Another metric which measure changes is code churn. Code churn is a measure of the amount of code change taking place within a software unit over time. It is easily extracted from a system’s change history, as recorded automatically by a version control system. Code churn has been used to predict fault potential where large and/or recent changes contribute the most to fault potential. Microsoft uses code churn as an early prediction of system defect density using a set of relative code churn measures that relate the amount of churn to other variables such as component size and the temporal extent of churn [NAGA].

2.10 Security, a Factor in Quality

If you think Java is relatively safe, just think about the yearly security reports beginning in 2010. Java became the main vehicle for malware attacks in the third quarter of 2010, when the attacks increased 14- fold, according to Microsoft's Security Intelligence Report Volume 10 [MICR]. In 2012, Kaspersky Lab, a leading anti-virus company, labeled it the year of Java vulnerabilities. Kaspersky reported that in 2012, Java security holes were responsible for 50% of attacks while Windows components and Internet Explorer were only exploited in 3% of the recorded incidents [KASP]. Cisco's 2014 Annual Security Report puts the blame on Oracle's Java as a leading cause of security woes and reported that Java represented 91% of all indicators of compromise in 2013 [CISC]. Perhaps the main reason Java is such a target is the same reason why it is popular with enterprises and developers; namely, it is portable and works on any operating system. Moreover, patching a large Java application, such as the JRE, is difficult and there is the possibility that the patch could break the functionality within the application.

Why focus on application security? Estimates from reliable sources report that anywhere from 70% to 90% of the security incidents are due to application vulnerabilities. Moreover, only the application security inside the application itself can stand a chance at preventing sophisticated attacks.

A report from the SANS Institute “2015 State of Application Security: Closing the Gap” can provide a current general view of application software security [SANSa]. The report was driven by a survey given to 435 qualified respondents answering questions about application security and its practices. Respondents were divided into builders and defenders, with 35% being builders and 65% defenders. The most interesting and important of the questions focused on security standards, the shift of security responsibilities within development, the list of current practices, the risk of third party applications and the rate of repairs using secure development life-cycle practices. These topics will be discussed in Sections 2.10.1 -2.10.5.

2.10.1 Security Standards

Many security standards and requirements frameworks have been developed in attempts to address risks to enterprise systems and the resident critical data. A survey question asked the participants to select the security standards or models followed by their organization. Ten standards or guidelines were explicitly listed as seen in Figure 11. The Open Web Application Security Project (OWASP) Top 10, a community- driven, consensus-based list of the top 10 application security risks, with lists available for web and mobile applications is by far the leading application security standard, followed by builders who participated in the survey [OWAS].

37

Figure 11: Application Security Standards in Use

The survey report provided a few reasons for the overwhelming reliance on the OWASP Top 10. First of all, the OWASP Top 10 is the shortest and simplest of the software security guidelines to understand since there are only 10 different areas of concern. Also, most static analysis and dynamic analysis security tools report vulnerabilities in OWASP Top 10 risk categories, making it easy to demonstrate compliance. The OWASP Top 10, like the MITRE/SANS Top 25 [MITR], is also referenced in several regulatory standards. After the OWASP Top 10, much more comprehensive standards, such as ISO/IEC 27034 and NIST 800-53/64, often required in government work, are used as security guidelines. Fewer institutions use the more general coding guidelines and process frameworks such as CERT Secure Coding Standards, Microsoft’s SDL and BSIMM (Building Security In Maturity Model).

The problem with standards and guidelines is that much of the effort has essentially become exercises in reporting on compliance and has actually diverted security program resources from the constantly evolving attacks that must be addressed. The National Security Agency (NSA) recognized this problem and began an effort to prioritize a list of the controls that would have the greatest impact in improving risk posture against real-world threats. The SANS Institute coordinated the input and formulated the Critical Security Controls for Effective Cyber Defense [SANSb]. This set of compiled information has much valuable information with a strong emphasis on "What Works" - security controls where products, processes, architectures and services are in use that have demonstrated real world effectiveness. Section 6 of the Critical Security Controls for Effective Cyber Defense report is directly on Application Software Security (CSC 6). There are eleven suggestions to implement in CSC 6.

Many of the SANS’ survey respondents (47%) indicated that their application security program needed improvement. There were some organizations that also rated themselves as above average. However, this may be due to the recent slew of security of breaches which did not directly impact them, perhaps giving them a false sense of confidence.

38

2.10.2 Shift of Security Responsibilities within Development

The majority (59%) of builder respondents followed lightweight Agile or Lean methods (mostly Scrum), 14% still used the Waterfall method and a smaller percentage followed more structured development approaches. More of the survey organizations are considering adopting DevOps and SecDevOps practices and approaches to share the responsibilities for making systems secure and functional among builders, operations and defenders. These methods are viewed as a radically different way of thinking about and doing application security. Currently, to produce secure code, most are proceeding externally through pen testing and compliance reviews. Many concur that defenders need to work collaboratively with builders and operations teams to embed iterative security checks throughout software design, development and deployment. The main takeaway is that “builders, defenders and operations should be sharing tools and ideas as well as responsibility for building and running systems, while ensuring the availability, performance and security of these systems” [SANSa]. Application security should be everyone’s duty. Eighty-four percent of successful security breaches have been accomplished through application software vulnerabilities [PUTM]. Within the development cycle, Agile and application security appear to possess conflicting goals. Developers work diligently to provide value and meet release deadlines, while security is concerned with the potential exposure and negative impact that applications can generate for the business and their customers. Agile developers adopt changes as part of their development environment culture, but even adding a checkpoint for security could be perceived as an impediment to productivity, especially during a tight schedule period. Many application builders are also unaware of inherent security issues in their code. Mandating scanning the code for vulnerabilities and fixing the issues will not create a culture that contributes to secure application development. Developers also need to be educated in the best practice for producing secure code. Everyone must contribute to the fine-tuning of the process, determining the best points to perform security reviews and code scanning functions. “Getting security right means being involved in the software development process” [MCGR]. The Closing the Gap report [SANSa] listed four important areas to include throughout the development lifecycle for effective application security. These are: • “Design and build. Consider compliance and privacy requirements; design security features; develop use cases and abuse cases; complete attack surface analysis; conduct threat modeling; follow secure coding standards; use secure libraries and use the security features of application frameworks and languages. • Test. Use dynamic analysis, static analysis , interactive application, security testing (IAST), fuzzing, code reviews, pen testing, bug bounty programs and secure component life-cycle management. • Fix. Conduct vulnerability remediation, root cause analysis, web application firewalls (WAF) and virtual patching and runtime application self-protection (RASP). • Govern. Insist on oversight and risk management; secure SDLC practices, metrics and reporting; vulnerability management; secure coding training; and managing third-party software risk.” No indication was provided of how to include these into the various development lifecycles. In the Waterfall methodology, there is a one-to-one mapping. However, in Agile, these must be adapted to be iterative and incremental.

39

2.10.3 List of Current Practices

In this section, the focus is on the list of current practices compiled from the builders’ responses (Table 7). Risk assessment is the leading practice for all types of applications except for web applications. Penetration testing is the second leading practice for internal apps. Currently applications are the biggest source of data breaches. NIST claims that 90% of security vulnerabilities exist at the application layer. Risk assessment or analyzing how hackers might conduct attacks can provide developers with a better idea of specific defenses. To insure that an application has no weak points, penetration testing is used. The article “37 Powerful Penetration Testing Tools for Every Penetration Tester” is a good resource in identifying the scope and features of current penetration tools [SECU]. Practicing secure coding techniques is another method to keep applications from getting hacked. The SANS Software Security has a course designed specifically for Java, DEV541: Secure Coding in Java/JEE: Developing Defensible Applications, https://software-security.sans.org/course/secure-coding-java-jee-developing-defensible- applications.

Table 7: Builders’ Application Security Practices

2.10.4 Risk of Third Party Applications

The survey reports that 79% of the builder respondents use open source or third-party software libraries in their applications. This agrees with a 2012 CIO report that over 80% of typical software applications are open source components and frameworks consumed in binary form [OLAV]. The CIO report also details that many organizations regularly download software components and frameworks with known security vulnerabilities, even if newer, patched versions of the components or frameworks were available. Many of these contain such well-known vulnerabilities as HeartBleed, ShellShock, POODLE and FREAK. A thorough assessment must be made when using or procuring applications.

40

2.10.5 Rate of Repairs

In the survey, 26% of defenders took two to seven days to deploy patches to critical applications in use, while another 22% took eight to thirty days, and 14% needed thirty-one days to three months to deploy patches satisfactorily (Figure 2). Serious security vulnerabilities are important to repair as quickly as possible. Observing the survey responses, it appears that most respondents need assistance in this effort.

Figure 12: Time to Deploy Patches

Developers require fundamental software security knowledge to understand the vulnerability and fix the code properly, test for regressions, and build and deploy the fix quickly. Perhaps more importantly, the vulnerability must be analyzed for root cause, and this must be addressed to hamper a vicious and likely dangerous cycle. 2.10.6 Other Code Security Strategies There are other useful strategies to assist in developing secure code. Michael Howard provided lessons learned from five years of building more secure code. For security reviews, Microsoft ranks modules (code) by their potential for vulnerabilities by age [HOWE]. As the code base becomes larger, analysis tools are required. Analysis tools can help determine the amount of review and testing to provide. For example, analyzing one component produces 10 warnings and analyzing a component of similar size where the analysis yields 100 warnings, indicates that the second component is in greater need of review. You can use the output of the analysis to determine overall code riskiness. Microsoft and many other companies apply tools at check-in time to catch bugs early and execute them at fairly recent intervals to deal with any new issues quickly. They have learned that executing the tools only every few months leads to developers having to deal with hundreds of warnings at a time. For every security vulnerability identified, a root cause analysis is performed. The analyst also determines why an actual issue was not discovered by tools. There are three possible reasons: the tool did not find the vulnerability, the tool found it but mistakenly triaged the issue as low priority, and the tool did actually find the issue and humans ignored it. Such an analysis allows the fine-tuning of tools and their use. There is a great deal of manual work involve in security assessment, so we need to strive for automation where possible. Build or buy

41 tools that scan code and upload the results to a central site for analysis by security experts. There are some tools that actually can combine the results of different tool outputs. 2.10.7 Design Vulnerabilities Many security vulnerabilities are not coding issues at all and are design issues, therefore, Microsoft mandates threat modeling and attack surface analysis as part of the Security Development Lifecycle (SDL) Process. Part of the lessons learned is that “It's essential to build threat models to uncover potential design weaknesses and determine your software's attack surface. You need to make sure that all material threats are mitigated and that the attack surface is as small as possible.” Microsoft continues to review for features in its products that are not secure enough for the current computer environment and deprecates those deemed to be insecure. Design-level vulnerabilities are the hardest defect category to handle. Design-level problems accounted for about 50% of the security flaws uncovered during the Microsoft's "security push" in 2002 [HOGL]. Unfortunately, ascertaining whether a program has design-level vulnerabilities requires great expertise, which makes finding such flaws not only difficult, but particularly hard to automate. Examples of design- level problems include error handling in object-oriented systems, object sharing and trust issues, unprotected data channels both internal and external, incorrect or missing access control mechanisms, lack of auditing/logging or incorrect logging, and ordering and timing errors (especially in multithreaded systems). These sorts of flaws almost always lead to security risk. Security issues, not syntactic or code related (such as business logic flaws), cannot be detected in code and need to be identified by performing threat models and abuse case modeling. Software security practitioners perform many different tasks to manage software security risks, including • creating security abuse/misuse cases; • listing normative security requirements; • performing architectural risk analysis; building risk-based security test plans; • using static analysis tools; • performing security tests; • performing penetration testing in the final environment; and • cleaning up after security breaches. Three of these are closely linked, architectural risk analysis, risk-based security test planning, and security testing, because a critical aspect of security testing relies on probing security risks. If we hope to secure a system, it is important to also work on the architectural or design risk. Over the last few years, much progress has been made in static analysis and code scanning tools. However, the same cannot be said of architectural risk. There are some good process frameworks such as Microsoft's STRIDE model. However to obtain the kinds of results expected, these models require specialists and are difficult to transform to widespread practices. To assist in developing secure software during the design phase, the SwA Forum and Working Groups developed a pocket guide [SOFT] which includes the following topics:

 Basic Concepts  Misuse Cases and Threat Modeling  Design Principles for Secure Software

42

 Secure Design Patterns o Architectural-level Patterns o Design-level Patterns  Multiple Independent Levels of Security and Safety (MILS)  Secure Session Management  Design and Architectural Considerations for Mobile Applications  Formal Methods and Architectural Design  Design Review and Verification  Key Architecture and Design Practices for Mitigating Exploitable Software Weaknesses  Questions to Ask Developers The above activities combined with secure coding techniques will enable more secure and reliable software. Section 3: Assessments of Development Methods and Project Data

In this report, five resources were used to provide comparisons/assessments of the development methodology and/or project data. These are discussed in sections 3.1 to 3.5. 3.1 Namcook Analytics Software Risk Master (SRM) tool. (Estimation report is attached to this report in Appendix A.) 3.2 A table from a Crosstalk article by Capers Jones modified for NPAC. 3.3 Scoring method of factors in software development in Software Engineering Best Practices, Capers Jones (also in the excel file as a separate spreadsheet) 3.4 DevOps self-assessment by IBM (See assessment results in Appendix B.) 3.5 The list of “Thirty Software Engineering Issues that have stayed constant for 30 years”

Additional information on software quality is contained in section 1.5 of this report.

3.1 The Namcook Analytics Software Risk Master (SRM) tool

The Namcook Analytics Software Risk Master (SRM) tool predicts requirements size in terms of pages, words and diagrams. It also predicts requirements bugs or defects and “toxic requirements” which should not be included in the application. A toxic requirement is one that causes harm later in development and/or maintenance. Reproduced below from the Namcook website are samples of typical SRM predictions for software projects between 100 function points (Tables 8 and 9) and 100,000 function points (Tables 10 and 11).

Table 8: Metrics for Projects with 1000 Function Points Requirements creep to reach 1,000 function points = 149 Requirements pages = 275 Requirements words = 110,088 Requirements diagrams = 180 Requirements completeness = 91.44% Requirements reuse = 25% Requirements bugs = 169 Toxic requirements = 4 Requirements test cases = 667 Reading days (1 person) = 4.59 Amount one person can understand = 93.27 Financial risks from delays, overruns = 22.33%

43

Table 9: 1,000 Function Points Requirements Methods Interviews Joint Application Design (JAD) Embedded users UML diagrams Nassi-Schneidewind diagrams FOG or FLESCH readability scoring IBM Doors or equivalent Requirement inspections Agile Iterative Rational (RUP) Team Software Process (TSP)

Table 10: Metrics for Projects with 10,000 Function Points Requirements creep to reach 10,000 function points = 2,031 Requirements pages = 2,126 Requirements words = 850,306 Requirements diagrams = 1,200 Requirements completeness = 73.91% Requirements reuse = 17% Requirements bugs = 1,127 Toxic requirements = 29 Requirements test cases = 5,472 Reading days (1 person) = 35.43 Amount one person can understand = 12.08% Financial risks from delays, overruns = 42.50%

Table 11: 10,000 Function Points Requirements Methods Focus groups Joint Application Design (JAD) Quality Function Deployment (QFD) UML diagrams State change diagrams Flow diagrams Nassi-Schneidewind charts Dynamic, animated 3D requirements models FOG or FLESCH readability scoring IBM Doors or equivalent State change diagrams Text static analysis Requirements inspections Automated requirements models (RUP) Team Software Process (TSP)

In general, “greenfield requirements” for novel applications are more troublesome than “brownfield” requirements which are frequently replacements for aging legacy applications whose requirements are well known and understood. In total, requirements bugs or defects are approximately 20% of the bugs

44 entering the final released application. “Requirements bugs are resistant to testing and the optimal methods for reducing them include requirements defect prevention and pre-test requirements inspections. The use of automated requirements models is recommended. The use of automated requirements static analysis tools is recommended. The use of tools that evaluate readability such as the FOG and FLESCH readability scores is recommended.” The last quote and the tables were from [JONEc].

Dolores Zage registered as a user on the Namcook.com site and was able to use the SRM demo application on the website. Figure 13 below is a listing of the input that was selected to produce the estimation report. Average settings were used for project staffing details, which are an even mix of experts and novices. For the project scope, standalone PC had to be selected because other settings caused a PHP error. Interesting is the fact that no size factor was requested. With the given knowledge of the project and the limitations of the application, the SRM tool calculated that NPAC would be 465.06 function points or about 53,330 LOC.

SOFTWARE TAXONOMY AND PROCESS ASSESSMENT REPORT

General Information:

Today's date - 08/18/2015 Industry or NAIC Code - telecommunications Company - BSU Country - IN, USA Project Start Date - 18-AUG-2015 Planned Delivery Date - Unknown Actual Delivery Date - Unknown Project Name - numbers Data Provided By - Dolores Project Manager - NA

Project Staffing Details:

Project Staffing Schedule - Normal staff; normal schedule Client Project Experience - Average experienced clients Project Management Experience - Average experienced management Development Team Experience - Even mix of experts and novices Methodology Experience - Even mix of experts and novices Programming Language Experience - Even mix of experts and novices Hardware Platform Experience - Even mix of experts and novices Operating System Experience - Even mix of experts and novices Test Team Experience - Even mix of experts and novices Quality Assurance Team Experience - Even mix of experts and novices Customer Support Team Experience - Even mix of experts and novices Maintenance Team Experience - Even mix of experts and novices

Project Taxonomy Input:

Project Nature - New software application development Work Hours per month - 132 Project Scope - Standalone program:PC Project Class - External program developed under government contract (civilian) Primary Project Type - Communications or telecommunications Secondary Project Type - Service oriented architecture(SOA)

45

Problem Complexity - Majority of avg, a few complex problems, algorithms - 7 Code Complexity - Fair structure with some large modules - 7 Data Complexity - More than 20 files, large complex data interactions - 11 Development Methodology - Agile, Internally Created Development Methodology Value - 10 Current CMMI Level - Level 4: Managed Primary Programming Language - Java - 6.00 - 90% Secondary Programming Language - SQL - 25.00 - 10% Effective Programming Level - 7.9 Number of maintenance sites - 1 Number of initial client sites - 80 Annual growth of client sites - 15 Number of application users - 1000 Annual growth of application users - 10

Testing Methodologies:

Defect Prevention - QFD; Pretest Removal - Desk Check; Static Analysis; Inspections; Test Removal - Unit; Function; Regression; Component; Performance; System; Acceptance;

projectsave

Back Report Print Start Again

Figure 13: SRM Tool Settings The entire estimation report is in Appendix A. Based on pretest removal and test removal selection in the tool, the defect removal efficiency was 99.4% as reported in the estimation report. 3.2 Crosstalk Table

The data for Table 12 stems from approximately 600 companies, of whom 150 are Fortune 500 companies. The table divided projects into excellent, average and poor categories. All of the projects were of function point size 1000 and coded in Java. These data can be extrapolated for comparisons with NPAC data. For convenience, the data were transferred to an Excel spreadsheet into which the NPAC data can be inserted. The closer NPAC data align with the excellent category, the higher the probability that NPAC can be rated as excellent.

Note: Extra explanations are denoted below Table 12 for the numbers in parentheses within the table cells.

Table 12: Comparisons of Excellent, Average and Poor Software Results

Topics Excellent Average Poor NPAC

Project Info Size in function points 1000 1000 1000 (1)

Programming Language Java Java Java Java Language level 6.25 6.0 5.75 (2)

46

Source statements per function point 51.2 53.33 55.75 (3)

Certified reuse percent 20% 10% 5% (4)

Quality Defect per function point 2.82 3.47 4.27 4.95 (5)

Defect Potential 2818 3467 4266 (6)

Defects per KLOC 55.05 65.01 76.65

Defect Removal Efficiency 99% 90% 83%

Delivered Defects 28 347 725

High Severity Defects 4 59 145

Security Vulnerabilities 2 31 88

Delivered per function point .03 0.35 0.73

Delivered per KLOC .55 6.5 13.03

Key Quality Control Methods Formal estimates of defects YES NO NO

Formal inspections of deliverables YES NO NO

Static Analysis of Code YES YES NO

Formal Test Case Design YES YES NO

Testing by certified test personnel YES NO NO

Mathematical test case design YES NO NO

Project Parameter Results Schedule in calendar months 12.02 13.8 18.2

Technical staff + management 6.25 6.67 7.69

Effort in staff months 75.14 92.03 139.96

Effort in staff hours 9919 12147 18477

Cost in dollars $751,415 $920,256 $1,399,770

Cost per function point $751.42 $920.26 $1,399.77

Cost per KLOC $14,676 $17,255 $25,152

Productivity Rates Function points per staff month 13.31 10.87 7.14

47

Work hours per function point 9.92 12.15 18.48

Lines of code per staff month 681 580 398

Cost Drivers Bug repairs 25% 40% 45%

Paper documents 20% 17% 20%

Code Development 35% 18% 13%

Meetings 8% 13% 10%

Management 12% 12% 12%

Total 100% 100% 100%

Methods, Tools, Practices Development Methods TSP/PSP (7) Agile Waterfall

Requirements Methods JAD Embedded Interview

CMMI Levels 5 3 1

Work hours per month 132 132 132

Unpaid overtime 0 0 0

Team experience experienced average inexperienced

Formal risk analysis YES YES NO

Formal quality analysis YES NO NO

Formal change control YES YES NO

Formal sizing of project YES YES NO

Formal reuse analysis YES NO NO

Parametric estimation tools YES NO NO

Inspections of key materials YES NO NO

Accurate status reporting YES YES NO

Accurate defect tracking YES NO NO

More than 15% certified reuse YES MAYBE NO

Low cyclomatic complexity YES MAYBE NO

Test coverage > 95% YES MAYBE NO

Notes on cell contents

48

1. Function point count (FP) 2. Language level – years of experience in Java 3. KLOC/FP 4. Reuse percentage 5. See 2.1 Ranges of software development quality 6. Defect Potential - Using the commercial application type data – 4.95 * FP 7. PSP (Personal Software Process) provides a standard personal process structure for software developers. TSP (Team Software Process) is a guideline for software product development teams. TSP focuses on helping development teams to improve their quality and productivity to better meet goals of cost and progress. (Watts Humphrey, precursor to DevOps)

3.2.1 Ranges of Software Development Quality

Given the size and economic importance of software, one might think that every industrialized nation would have accurate data on software productivity, quality, and demographics. This does not seem to exist. There seems to be no effective national averages for any software topic. A national repository of software quality data would be very useful to compare against, but it does not exist. One reason is that quality data are more difficult to collect than productivity data. There are so many individual development tasks focusing on identifying defects. There are defects found in requirements and defects identified by static analysis, desk checking, and testing. These counts are not always included in the quality data. Currently, the best data on software productivity and quality tends to come from companies that build commercial estimation tools and companies that provide commercial benchmark services. All of these are fairly small companies. If you look at the combined data from all 2014 software benchmark groups such as Galorath, ISBSG, Namcook Analytics, Price Systems, Q/P Management Group, Quantimetrics, QSM, Reifer Associates and Software Productivity Research, the total number of projects is about 80,000. However, all of these are competitive companies, and with a few exceptions, such as the recent joint study by ISBSG, Namcook, and Reifer, the data is not shared, compared or not always consistent.

The following data in Tables 13 and 14 are compiled quality data from Namcook consisting of about 20,000 projects and are approximate average values for software quality aggregated by application size and type.

Table 13: Quality Data Based on Project Size in Function Points

Size Defect Removal Defects Potential Efficiency Delivered 1 1.50 96.93% 0.05 10 2.50 97.50% 0.06 100 3.00 96.65% 0.10 1000 4.30 91.00% 0.39 10000 5.25 87.00% 0.68 100000 6.75 85.70% 0.97 Average 3.88 92.46% 0.37

49

Table 14: Quality Data Based on Project Type

Type Defect Removal Defects Potential Efficiency Delivered Domestic outsource 4.32 94.50% 0.24 IT projects 4.62 92.25% 0.36 Web projects 4.64 91.30% 0.40 Systems/embedded 4.79 98.30% 0.08 Commercial 4.95 93.50% 0.32 Government 5.21 88.70% 0.59 Military 5.45 98.65% 0.07 Average 4.94 93.78% 0.30

Tables 13 and 14 vary by application size and also by application type. Many suggest that for national average purposes, the value shown by type is more meaningful than size, since there are very few applications larger than 10,000 function points, and so these large sizes distort average values. The 2014 defect potentials average is about 4.94 while defect removal efficiency averages about 93.78% and delivered defects average is 0.3 if the view is cross industry. The range of defect potentials span from about 1.25 per function point to about 7.50 per function point. Ranges of defect removal efficiency span from 99.65% to a low of below 77.00%.

3.3 Scoring Method of Methods, Tools and Practices in Software Development

Software development and software project management have dozens of methods, hundreds of tools and practices. Which ones to use? One method is to evaluate and rank the many different factors using a scale. In the excel file containing Table 12 is another spreadsheet listing the various methods and practices scored with a scale that ranges from +10 to -10. A 10 implies it is very beneficial to the quality and productivity of a project. A -10 indicates that it is very detrimental. An average value is given based on size and type of projects. The data for the scoring stems from observations among about 150 Fortune 500 companies, 50 smaller companies, and 30 government organizations. The negative scores also include data from 15 lawsuits. The actual values are not as important as the distribution into the various categories. Using this method, one can display the range of impact of using the various methods, tools and practice together.

3.4 DevOps Self-Assessment by IBM

A self-assessment of DevOps practices was also done through an IBM DevOps self-assessment. (http://www.surveygizmo.com/s3/1659087/IBM-DevOps-Self-Assessment) A copy of the questions, the answers provided and the results are in the file DevOps+Self+Assessment+Results.pdf

Based on the answers to questions in the assessment, the DevOps practice is measured as scaled, reliable, consistent or practiced (as seen in Figure 14) in the five areas of Design, Construct, Configuration Management, Build, Test and Assess Quality.

50

Figure 14: Levels of Achievement in DevOps Practices

3.5 Thirty Software Engineering Issues that Have Stayed Constant for Thirty Years

In [JONEb], we find the following persistent issues in software engineering:

1. Initial requirements are seldom more than 50% complete. 2. Requirements grow at about 2% per calendar month during development. 3. About 20% of initial requirements are delayed until a second release. 4. Finding and fixing bugs is the most expensive software activity. 5. Creating paper documents is the second most expensive software activity. 6. Coding is the third most expensive software activity. 7. Meetings and discussions are the fourth most expensive activity. 8. Most forms of testing are less than 35% efficient in finding bugs. 9. Most forms of testing touch less than 50% of the code being tested. 10. There are more defects in requirements and design than in source code. 11. There are more defects in test cases than in the software itself. 12. Defects in requirements, design, and code average 5.0 per function point. 13. Total defect removal efficiency before release averages only about 85%. 14. About 15% of software defects are delivered to customers. 15. Delivered defects are expensive and cause customer dissatisfaction and technical debt. 16. About 5% of modules in applications will contain 50% of all defects. 17. About 7% of all defect repairs will accidentally inject new defects. 18. Software reuse is only effective for materials that approach zero defects. 19. About 5% of software outsource contracts end up in litigation. 20. About 35% of projects > 10,000 function points will be cancelled. 21. About 50% of projects > 10,000 function points will be one year late. 22. The failure mode for most cost estimates is to be excessively optimistic. 23. Productivity rates in the U.S. are about 10 function points per staff month. 24. Assignment scopes for development are about 150 function points. 25. Assignment scopes for maintenance are about 750 function points. 26. Development costs about $1200 per function point in the U.S (range < $500 to > $3000). 27. Maintenance costs about $150 per function point per calendar year. 28. After delivery applications grow at about 8% per calendar year during use. 29. Average defect repair rates are about 10 bugs or defects per month. 30. Programmers and managers need about 10 days of annual training to stay current.

51

3.6 Quality and Defect Removal

There are various definitions of quality and a common definition in software engineering is conformance to requirements. However, requirements themselves can have defects, and then there are requirements that are labeled as toxic. There are other “ility” qualities such as maintainability and reliability, but these cannot be measured directly. This is why quality comes down to the absence of defects. This leaves two powerful metrics for understanding software quality: 1) software defect potentials; 2) defect removal efficiency (DRE). The phrase “software defect potentials” was first used in IBM circa 1970. Defect potential includes the total of bugs or defects likely to be found in all software deliverables, such as the requirements, architecture, design, code, user documents, test cases and bad fixes. The quality benchmarks for Defect potentials on leading projects are < 3.00 per function point, combined with defect removal efficiency levels that average > 97% for all projects and 99% for mission-critical projects.

The DRE metric was also developed in IBM in the early 1970s at the same time IBM was developing formal inspections. The concept of DRE is to track all defects found by the development teams and then compare those to post-release defects reported by customers in a fixed time period after the initial release (normally 90 days). If the development team found 900 defects prior to release and customer reported 100 defects in the first three months, then the total volume of bugs was an even 1,000 so defect removal efficiency is 90%.

The U.S. average circa 2013 for DRE is just a bit over 85%. Testing alone is not sufficient to raise DRE much above 90%. To approach or exceed 99% in DRE, it is necessary to use a synergistic combination of pre-test static analysis and inspections combined with formal testing using mathematically designed test cases, ideally created by certified test personnel. DRE can also be applied to defects found in other materials such as requirements and design. Requirements, architecture, and design defects are resistant to testing and, therefore, pre-test inspections of requirements and design documents should be used for all major software projects. Table 15 illustrates current ranges for defect potentials and defect removal efficiency levels in the United States circa 2013 for applications in the 1,000 function point size range:

Table 15: Software Defect Potentials and Defect Removal Efficiency

Defect Origins Defect Defect Defects % of Potential Removal Delivered Total Requirements defects 1.00 75.00% 0.25 31.15% Design defects 1.25 85.00% 0.19 23.36% Test case defects 0.75 85.00% 0.11 14.02% Bad fix defects 0.35 75.00% 0.09 10.90% Code defects 1.50 95.00% 0.08 9.35% User document defects 0.60 90.00% 0.06 7.48% Architecture defects 0.30 90.00% 0.03 3.74% TOTAL 5.75 85.00% 0.80 100.00%

3.6.1 Error-Prone Modules (EPM)

One of the most important findings in the early 1970s by IBM was that errors were not randomly distributed through all modules of large systems, but tended to cluster in a few modules, which were termed “error-prone modules.” For example, 57% of customer reported bugs in the IBM IMS data base application were found in 32 modules out of a total of 425 modules. More than 300 IMS modules had zero defect reports. A Microsoft study found that fixing 20% of the bugs would eliminate 80% of system

52 crashes. Other companies replicated these findings and error-prone modules are an established fact of large systems.

3.6.2 Inspection Metrics

One of the merits of formal inspections of requirements, design, code, and other deliverables is the suite of standard metrics that are part of the inspection process. Inspection data routinely includes preparation effort, inspection session team size and effort, defects detected before and during inspections; defect repair effort after inspections; and calendar time for the inspections for specific projects. These data are useful in comparing the effectiveness of inspections against other methods of defect removal such as pair programming, static analysis, and various forms of testing. To date, inspections have the highest levels of defect removal efficiency (> 85%) of any known form of software defect removal.

3.6.3 General Terms of Software Failure and Software Success

The terms “software failure” and “software success” are ambiguous in the literature. Here is Capers Jones’ definition of “success”, where he attempts to quantify the major issues troubling software [JONEd]: success means < 3.00 defects per function points; > 97% defect removal efficiency; > 97% of valid requirements implemented; < 10% requirements creep; 0 toxic requirements forced into application by unwise clients; > 95% of requirements defects removed; development schedule achieved within + or – 3% of a formal plan; costs achieved within + or – 3% of a formal parametric cost estimate.

Section 4: Conclusions and Project Take-Aways We found many excellent suggestions for enabling teams to deliver quality software, but not all things will work for all teams. 4.1 Process

1. Changing processes leads to differences in software quality.

2. Mixing Agile and DevOps high-performance distinguishing characteristics can lead to rapid delivery and maximized outcomes through collaborative performance. (Section 1.1) 3. The more collaborative the process becomes, the easier it is to attain item 2. Agile and DevOps is based on teamwork and cooperation. (Section 1.1) Make the process visible and available to all teams. Delivery tasks and trends (metrics) are available to all teams. Raise awareness of product quality. Everyone is responsible and owns the trends. 4. The key points for high-performing DevOps culture: (Section 1.2.1)  Deploy daily – decouple code deployment from feature releases  Handle all non-functional requirements (performance, security) at every stage  Exploit automated testing to catch errors early and quickly- 82% of high-performance DevOps organizations use automation [Puppet].  Employ strict version control – version control in operations has the strongest correlation with high performing DevOps organizations [Puppet]. Save all products into a software configuration management (SCM) system making them readily available, merging contributions by multiple

53

authors, determining where changes have been made. Along with a SCM use a configuration management system. (Section 1.2.4)  Implement end-to-end performance monitoring and metrics  Perform peer code review  Use collaborative code review platforms such asGerrit, CodeFlow, ReviewBoard, Crucible, SmartBear and review against coding standards first and apply checklists. (Section 1.3, 1.3.1)  Apply a separate checklist for security. (Section 1.3.2)  Static analysis, using tools such as Findbugs (byte code), PMD for code, CheckStyle. Note that SonarQube can take output from these tools and present it. SonarQube also has an indicator of poor design before “human reviews”. (Section 1.3)  Monitor the code review process (Section 1.3.3)  Allocate more cycle time to reduction of technical debt.  Reviews assist in identifying evolvability code to be harder to understand and maintain. (Section 1.3.4)  Agile requires refactoring. Refactoring and technical debt are linked.

5. Key success word for Agile is continuous. Continuous testing, planning, iterations, integration and improvement resulting in continuously delivering tested working software.

6. As test coverage increases, both predictability and quality increase and automation can promote greater coverage. (Section 1.2.2.3) Raw code coverage metric is only relevant when too low and requires further analysis when high. Determine what is not covered and why. Multiple studies show about 85% of defects in production could have been detected by simply testing all possible 2-way combinations of input parameters. Free testing tool from NIST (Advanced Combinatorial Testing System) 7. Review high risk code and high risk changes for both vulnerabilities and defects. (Section 1.3.6). 8. Integrate QA into the development process. Fosters collaboration outlined in item 2. (Section 1.5) 9. Groom the product backlog. Many development teams do not have ready useable product backlog. Over 80% of teams have user stories for their product backlog, but less that 10% find them acceptable. Product backlog in high ready state can dramatically (as much as 50%) improve a team’s velocity. (Section 1.6)

10. AUTOMATE, AUTOMATE, AUTOMATE …  When the same deployment tools are used for all development and test environments, errors are detected and fixed early.  Studies have determined that there is not one best tool, underscoring the fact that quality is based on practices not on the exact tool. Tools can make a team more productive and collaborative, and enforce a practice. (Section 1.2.2.3)  More than 80% of high-performing software development organizations rely on automated tools for infrastructure management and deployment. Automated testing (checking) is a factor in quality production environments. 11. Develop a defect prevention strategy.  Defect Logging and documentation to provide key parameters for analysis (root cause -> preventive actions->improvement) and measurement.  Defect Removal Efficiency (DRE) must be over 85%, closer to 95%. (Section 2.7)

54

 From both an economic and quality standpoint, defect prevention and testing are all necessary to achieve lower costs, shorter schedules and low levels of defects.  Conduct dynamic appraisals through functional and performance testing. Coverage, coverage and more coverage.  As of 2015 there are more than 20 forms of testing. The assumed test stages include 1) unit test, 2) function test, 3) regression test, 4) component test, 5) usability test, 6) performance test, 7) system test, and 8) acceptance test. Most forms of testing have only about a 35% DRE, so at least 8 types of testing are needed to top 80% in overall testing DRE.  Static appraisals can eliminate 40-50% of the coding defects. (Section 1.3)  Defects do not just originate from code. Only 35% of the total defect potential originates from the code. Requirements accounts for 20%, design 25%, documents 12%, and bad fixes another 8%. 4.2 Product Measurements 12. No single metric can provide a complete quality measure and selecting the set of metrics that provides the essential quality coverage is also impossible.

13. It is important to understand that quality needs to be improved faster and to a higher level than productivity in order for productivity to improve. The major reason for this link is that identifying and fixing defects is overall the most expensive activity in software development. Quality leads and productivity follows. Attempting to improve productivity without first improving quality will not be successful.

14. If only one quality aspect of development is measured, it should be defects. Defects are at the core of the customer's and external reviewer’s value perception. Released defect levels are a product of defect potentials and defect removal efficiency. The phrase "defect removal efficiency" refers to one of the most powerful of all quality metrics. Fixing bugs on the same day as they are discovered can double the velocity of a team. (Section 2.7)

15. Collect just enough feedback to respond to unexpected events, and change the process as needed. Metrics on the number of test runs and passes, code coverage metrics and defect metrics should be reviewed to ensure that the code is providing the value required. The SonarQube default quality setting tracks the seven deadly sins in bad code: bad distribution of complexity, duplications, lack of comments, coding rules violations, potential bugs, no unit tests or useless ones, and bad design. (Section 2.7)

16. Next concentrate on performance and security. These features are externally visible to customers and the public. Defective and slow software will make customers demand a new product. Security problems will lead to headlines in the news. (Section 2.9)

 Application performance requires a tool for complex applications such as Dynatrace.  Use a monitoring tool such as Java Mission Control for memory leaks, collection inefficiencies and locked treads.  Use code profilers such as YourKit or JProfiler to identify and remove bottlenecks.

17. Another quality aspect on the radar should be maintainability. Maintainability is a quality attribute listed in hundreds of quality models. The system will be used and updated for an extended time and it should not become more difficult and expensive to maintain. As new services (features) are added, they should be done at a reasonable cost and also be testable. Symptoms of poor maintainability are unnecessary complexity, unnecessary coupling, redundancy and the software model not reflecting the

55

actual or physical model. Note that many of the symptoms are already on the seven sins of bad code. (Section 2.7 Section 1.3).

18. Numbers are not as important as trends (Section 2.6).

Acknowledgements

The authors thank Chris Drake, Michael Iacovelli and Frank Schmidt at iconectiv for the valuable insights and suggestions regarding this work that they shared with us through numerous teleconferences during the summer of 2015. This research is also supported by the National Science Foundation under Grant No. 1464654.

Appendix A Namcook Analytics - Estimation Report

Copyright © by Namcook Analytics. All rights reserved.

Web: www.Namcook.com

Blog: http://Namcookanalytics.com

Part 1: Namcook Analytics Executive Overview

Project name: numbers

Project manager: NA

Key Project Dates:

Today's Date 08-18-15 Project start date: 08-18-15 Planned delivery date: 08-17-16 Predicted Delivery date 07-07-16 Planned schedule months: 12.01 Actual schedule months: 10.82 Plan versus Actual: -1.19

56

Key Project Data:

Size in FP 465.06 Size in KLOC 53.33 Total Cost of Development 375,586.16

Part 2: Namcook Development Schedule

Project Development

Staffing Effort Schedule Project $ per Wk Hrs

Months Months Costs Funct. Pt. per FP

Requirements 1.32 4.70 3.56 $46,994.93 $101.05 1.33 Design 1.76 6.66 3.78 $66,576.16 $143.16 1.89 Coding 3.30 9.53 2.89 $95,340.54 $205.01 2.71 Testing 2.97 7.01 2.36 $70,078.58 $150.69 1.99 Documentation 0.91 1.45 1.60 $14,525.71 $31.23 0.41 Quality Assurance 0.83 1.82 2.19 $18,157.13 $39.04 0.52 Total Project 0.87 6.39 10.82 $63,913.11 $137.43 1.81 3.47 37.56 16.38 $375,586.16 $807.61 10.66

Gross schedule months 16.38 Overlap % 0.66 Predicted Delivery Date 10.82 07-07-16 Client target delivery schedule and date 12.01 08-17-16

Difference (predicted minus target) -1.19

Odds 70% Odds 50% Odds 10% 05-13-16 03-28-16 02-12-16

Features deferred to meet Productivity FP/Month LOC/Month WH/Month schedule:

57

Productivity (+ 12.38 660.39 10.66 Function Pts. (38) reuse) Productivity (- 10.11 539.07 13.06 Lines of code (2,008) reuse) % deferred -9.92%

Estimates for User Development Activities

User Activities Staffing Schedule Effort Costs $ per FP User requirements team: 0.72 5.41 3.87 $0 $0.00 User prototype team: 0.58 2.70 1.57 $0 $0.00 User change control team: 0.62 10.82 6.71 $0 $0.00 User acceptance test team: 1.03 1.62 1.68 $0 $0.00 User installation team: 0.81 0.81 0.66 $0 $0.00 0.75 4.27 14.48 $0 $0.00

Number of Initial Year 1 Users: 1,000 12.00

Number of users needing training: 900 0.05 41.86 $0 $0.00 TOTAL USER COSTS 56.34 $0 $0.00

$ per function point $0.00 % of Development 0.00%

Staffing by Occupation

Occupation Normal Peak

Groups Staffing Staffing

1 Programmers 4 5 2 Testers 3 5 3 Designers 1 2 4 Business analysts 1 2 5 Technical writers 1 1 6 Quality assurance 1 1

58

7 1st line management 1 2 8 Data base administration 0 0 9 Project office staff 0 0 10 Administrative staff 0 0 11 Configuration control 0 0 12 Project librarians 0 0 13 2nd line managers 0 0 14 Estimating specialists 0 0 15 Architects 0 0 16 Security specialists 0 0 17 Performance specialists 0 0 18 Function point specialists 0 0 19 Human factors specialists 0 0 20 3rd line managers 0 0 20 TOTAL 14

Risks Odds Cancellation 12.81% Negative ROI 16.23% Cost Overrun 14.09% Schedule Slip 17.08% Unhappy Customers 36.00% Litigation 5.64% Average Risk 16.98% Financial Risk 23.65% Less than 15% = Acceptable

15% - 35% = Caution

Greater than 35% = Danger

Part 3: Namcook Quality Predictive Outputs

59

Software Quality

Defect Potentials Potential Requirements defect potential 380 Design defect potential 365 Code defect potential 572 Document defect potential 79 Total Defect Potential 1,396

Defect Prevention Efficiency Remainder Bad Fixes JAD 0% 1,396 0 QFD 27% 1,019 10 Prototype 0% 1,029 (0) Models 0% 1,029 0 Subtotal 26% 1,029 10

Pre-Test Removal Efficiency Remainder Bad Fixes Desk check 26% 761 21 Pair programming - not used 0% 782 21 Static analysis 55% 361 10 Inspections 88% 45 1 Subtotal 96% 46 53

Test Bad Test Per Per Test Efficiency Remainder Removal Fixes Cases KLOC FP scripts Test Planning

Unit 31% 32 1 480 19 1 66 Function 33% 22 1 522 21 1 69 Regression 12% 20 1 235 9 1 46 Component 30% 14 1 313 13 1 53 Perfomance 11% 13 0 157 6 0 38 System 34% 9 0 496 20 1 67

60

Acceptance 15% 8 0 106 4 0 31 Subtotal 82% 8 4 2,308 93 5 144 Defects delivered 8 High severity 1 Security flaws 1 High severity % 14.92% Deliv. Per FP 0.02 High sev per FP 0.00 Security flaws per FP 0.00 Deliv. Per KLOC 0.34 High sev per KLOC 0.05 Security flaws per KLOC 0.02 Cumulative

99.40% Removal Efficiency

Defect prevention costs $40,832.31 Pre-Test Defect Removal Costs $71,988.87 Testing Defect Removal Costs $140,837.96 Total Development Defect Removal Costs $253,659.14 Defect removal % of development 67.54% Defect removal per FP 545.43 Defect removal per KLOC 10,226.87 Defect removal per defect 110.49 Three-year Maintenance Defect Removal Costs 60,769.16 TCO defect removal costs 314,428.29 Defect removal % of TCO 0.25% Reliability (days to first defect) 29.54 Reliabilty (days between defects) 198.03 Customer satisfaction with software 96.42%

61

Part 4: Namcook Maintenance and Cost Benchmarks

Maintenance Summary Outputs for three years

Year of first release 2016 Application size at first release - function points 465 Application growth (three years) - function points 586 Application size at first release - lines of code 24803 Application growth (three years) - lines of code 31245 Application users at first release 1000 Application users after three years 1359233 Incidents after three years 4857

Staff Effort Cost per Cost per

Three-Year Totals Personnel Months Costs Function Pt. Function Pt. 1,000 1,260

Management 43.93 1581.57 7907862.60 17003.96 13498.29 Customer support 658.43 23703.33 118516657.07 254841.65 202301.52 Enhancement 0.43 15.43 77142.54 165.88 131.68 Maintenance 0.34 12.15 60769.16 130.67 103.73 TOTAL 703.12 25312.49 126562431.37 272142.16 216035.22

Namcook Total Cost of Ownership Benchmarks

Staffing Effort Costs $ per FP % of TCO

at release

Development 3.47 38 $375,586.16 $807.61 Cost per Maintenance Mgt. 43.93 1582 $7,907,862.60 $17,003.96 6.23% Customer support 658.43 23703 $118,516,657.07 $254,841.65 93.37% Enhancement 0.43 15 $77,142.54 $165.88 0.06% Maintenance 0.34 12 $60,769.16 $130.67 0.05% User Costs 0.75 14 $0.00 $0.00 0.00%

62

Total TCO 707.35 25365 $126,938,017.53 $272,949.76 100.00%

Part 5: Additional Data Points

Note: Namcook Analytics uses SRM and IFPUG function points as default values.

This section provides application size in 21 metrics.

Alternate Size Metrics Size % of IFPUG 1 IFPUG 4.3 465 100.00% 2 Automated code based function points 498 107.00% 3 Automated UML based function points 479 103.00% 4 Backfired function points 465 100.00% 5 COSMIC function points 558 120.00% 6 Fast function points 451 97.00% 7 Feature points 465 100.00% 8 FISMA function points 474 102.00% 9 Full function points 544 117.00% 10 Function points light 449 96.50% 11 IntegraNova function points 507 109.00% 12 Mark II function points 493 106.00% 13 NESMA function points 484 104.00% 14 RICE objects 2,591 557.14% 15 SCCQI function points 1,479 318.00% 16 Simple function points 453 97.50% 17 SNAP non functional size metrics 118 25.45% 18 SRM pattern matching function points 465 100.00% 19 Story points 362 77.78% 20 Unadjusted function points 414 89.00% 21 Use-Case points 217 46.67%

Document Sizing

63

Percent Complete

1 Requirements 192 76,781 94.55% 2 Architecture 46 18,268 93.24% 3 Initial design 223 89,230 87.46% 4 Detail design 379 151,472 88.77% 5 Test plans 76 30,329 91.16% 6 Development Plans 26 10,231 91.24% 7 Cost estimates 46 18,268 94.24% 8 User manuals 184 73,783 94.88% 9 HELP text 88 35,152 95.06% 10 Courses 67 26,973 94.79% 11 Status reports 47 18,887 93.24% 12 Change requests 90 35,952 99.55% 13 Bug reports 496 198,214 92.51% TOTAL 1,959 783,541 93.13%

Work hours per page - writing 0.95 Work hours per page - reading 0.25 Total document hours - writing 1,860.91 Total document hours - reading 85.01 Total document hours 1,945.92 Total document months 14.74 Total document $ 147,425.42 $ per function point 317.00 % of total development 39.25%

64

DevOps Practices Self Assessment

Please describe your purpose for completing the assessment.

I only want to self-assess my practices Enter the contact information in the fields provided. This information will be used to forward the results to you. If you included your IBM representative's information, a copy of your responses will be forwarded to your IBM representative.

Your name

Dolores

Your Email Address

[email protected]

IBM representative name

Na

IBM representative email

[email protected]

What is your company's industry?

Education

What is the geographic area of your company's primary operations?

North America

Please select the assessment experience you prefer.

I would like to manually select each practice to assess

Please select up to two adoption paths to focus your assessment. The next step will let you select from a list of practices to further refine your assessment questions.

Develop / Test (Design, Construct, Configuration Management, Build, Test, Quality Assessment)

Please select one or more practices from the list to focus your self-assessment.

Design Construct Configuration Management Build Quality Management Quality Assessment

Design is focused on producing products during a phase of the project using formal processes for review and measures of completion. The confidence of design activities to ensure scope and requirements are understood for implementation is low and effectiveness is not measured. Formal method is used to review design products for approval or to improve or correct.

Partial Developers work independently and deliver code changes, deliveries are formally scheduled and resource intensive. Integration is a planned event that impacts most development activities in an application or project.

Partial

Code deliveries and integrations are performed periodically using a common process and automation. Integrations are accomplished by individual developers and automated when possible. Coding techniques are available and used inconsistently. Common architecture standards for application coding are defined and trained. Code reviews are effective and manually initiated.

Yes

Coding best practices are defined consistently across technologies and include reviews and automated validation through scanning/testing. Consistent architecture standards are used across organization and validated in testing and reviews.

Yes

Code changes are collaboratively developed across technologies, application and teams, continuously. Developers have immediate access to relevant information for code changes to ensure iterative improvements or changes to design, requirements or coding implementation are understood. Standards in coding and reviews are measured standards across the organization. Best coding and validation practices are trained, used and verified consistently.

Yes

Source control of assets is largely a manual activity that relies heavily on individuals following processes. Performing changes on an asset by more than one team member is only accomplished through locks and access controls. Merging asset changes is accomplished on desktops manually and formally scheduled by a specialized integration team. Applying changes across versions for different releases is performed outside of the configuration management tool or process.

No

Builds are performed manually and automated across projects and environments. Build systems range from developer's IDE (usually for Dev only) to a formal centralized build server which is normally used for formal promotions to QA-Production. Build management and standards are controlled at the project level. Formal builds are scheduled following formal delivery and integration events to validate application level integration and application promotion.

No

Informal builds produced by individual developers via their desktop IDE are used for validation but never deployed to environments. A centralized build service is in place that controls the artifacts and processes used in the build. Automated build process includes code scanning and unit / security tests. The build process periodically produces a build of each application under change for testing testing or verification purposes. Build results are measured and monitored by development teams consistently across the enterprise.

Yes

A daily build of an application under change is promoted to test. Build is provided as a service that supports continuous integration, compilation and packaging by platform. Individual developers, teams and applications regularly schedule automated builds based on changes to source code. A dependency management process is in place for software builds using a dependency-management repository to trace the standard libraries and provision them at build time.

Yes

All builds could be promoted through the software delivery pipeline, if desired. Each project/team tracks changes to the build process as well as source code and dependencies. Modifying the build process requires approval, so access to the official build machines and build server configuration is restricted where compliance is a factor or where Enterprise Continuous Integration becomes a production system. Build measures are used to improve development and configuration management processes.

Yes Crash reporting is incorporated into mobile applications to provide basic measures for quality assessment. Crash reporting is used to improve basic stability of the applications.

Yes

Quality reporting is embedded into mobile applications to support user sentiment and usage patterns. Measures are used to drive changes into development teams for usability improvements, defect correction and enhancements.

Yes

Mobile application teams assess quality by monitoring social media, application repositories, and user feedback. Each monitoring source is used to define defects or enhancements to the specific application. Measures are used to determine the impact of application team improvements on user satisfaction.

No

The main objective of testing is to validate that the product satisfies the specified requirements. However, testing is still seen by many stakeholders as being the project phase that follows coding.

No

Design

Reliable

Construct

Scaled

Configuration Management

Scaled

Build

Scaled

Test

Reliable

Assess Quality

Scaled References

[BASI] Basili, V., Gianluigi Caldiera, and H. Dieter Rombach. The goal question metric approach. In Encyclopedia of Software Engineering. Wiley, 1994.

[BISH] Bisht, A., AS Dhanoa, AS Dhillon, G Singh, “A Survey on Quality Prediction of Software Systems”, isems.org.

[BLAC] Black, R. “Measuring Defect Potentials and Defect Removal Efficiency, 2008, http://www.rbcs- us.com/images/documents/Measuring-Defect-Potentials-and-Defect-Removal-Efficiency.pdf

[CHIL] Childers, B., “Geek Guide Slow Down to Speed Up, Continuous Quality Assurance in a DevOps Environment, Linux Journal, 2014

[CISC] “Cisco 2014 Annual Security Report”, http://www.cisco.com/web/offers/lp/2014-annual-security- report/index.html

[DEMI] W. Deming, Out of the Crisis. MIT Center for Advanced Engineering Study, Cambridge, MA, 1982.

[DYNA] Dynatrace, “DevOps: Hidden Risks and How to Achieve Results”, 2015. http://www.dynatrace.com/content/dam/en/general/ebook-devops.pdf

[FENT] Fenton, N., J. Bieman, Software Metrics: A Rigorous and Practical Approach, Third Edition, 2014, CRC Press, Boca Rotan, FL.

[FOWL] Fowler, M. “An Appropriate Use of Metrics”, Feb. 2013, http://martinfowler.com/articles/useOfMetrics.html

[FRAN] P. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In Proc. 6th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE’98), pages 153–162. ACM Press, 1998.

[GALE] Galen, R. “2 Dozen Wild & Crazy Agile Metics Ideas”, RGalen Consulting,

[GART] Market Share Analysis: Application Share Analysis: Application Performance Monitoring 2014, May 27, 2015 http://www.gartner.com/technology/reprints.do?id=1- 2H15OOF&ct=150602&st=sb

[GILB] Gilb, T and L. Brodie, “What’s Wrong with Agile Methods: Some Principles and Values to Encourage Quantification”, Methods and Tools, Summer 2007, accessed 6/2015 http://www.methodsandtools.com/archive/archive.php?id=58

[HOGL] Hoglund G. and McGraw G (2004): Exploiting Software: How to Break Code. Addison-Wesley, 2004.

[HOWA] Howard, M., “Lessons Learned from Five Years of Building More Secure Software, Trustworthy Computing, Microsoft, http://www.google.com/url?url=http://download.microsoft.com/download/A/E/1/AE131728- 943B-42B4-B130- C1DEBE68F503/Trustworthy%2520Computing.pdf&rct=j&frm=1&q=&esrc=s&sa=U&ved=0C

68

BQQFjAAahUKEwi61_Srx9jHAhULPZIKHZPmBSQ&usg=AFQjCNG1RERb7HJ8OakiFwEL _FLJAc6j6w

[IBM] IBM Developer Works,”11 proven practices for more effective, efficient peer code review”, accessed 6/2015 http://www.ibm.com/developerworks/rational/library/11-proven-practices-for- peer-review/

[JONEa] Jones, C and O. Bonsignour, The Economics of Software Quality, 2011, Pearson Publishing

[JONEb] Jones, C., “Software Engineering issues for 30 years”, http://www.namcook.com/index.html

[JONEc] Jones, C., “Examples of Software Risk Master (SRM) Requirements Predictions”, January 11, 2014, http://namcookanalytics.com/wp-content/uploads/2014/01/RequirementsData2014.pdf

[JONEd] Jones, C., “Evaluating Software Metrics and Software Measurement Practices”, Version 4.0 March 14, 2014, Namcook Analytics LLC http://namcookanalytics.com/wp-content/uploads/2014/03/Evaluating-Software-Metrics-and- Software-Measurement-Practices.pdf.

[JONEe] Jones, C., “The Mess of Software Metrics”, Version 2, September 12, 2014, http://namcookanalytics.com/wp-content/uploads/2014/09/problems-variations-software- metrics.pdf

[KABA] Kabanov, Jevgeni, “Developer Productivity Report 2013 – How Engineering Tools & Practices Impact Software Quality & Delivery”, Zeroturnaround, 2013, http://pages.zeroturnaround.com/RebelLabs- AllReportLanders_DeveloperProductivityReport2013.html?utm_source=Productivity%20Report %202013&utm_medium=allreports&utm_campaign=rebellabs&utm_rebellabsid=76

[KASP] Kaspersky lab, “Oracle Java surpasses Adobe Reader as the most frequently exploited software”. December 2012, http://www.kaspersky.com/about/news/virus/2012/Oracle_Java_surpasses_Adobe_Reader_as_the _most_frequently_exploited_software

[MANT] Mantyla, M.V., and Cl Lassenius, “What types of defects are really discovered in code reviews?” IEEE Transactions on Software Engineering, 2009, 35(3) 430-448.

[MCGR] McGraw, G, S. Migues, J. West, “BSIMM6”, October 2015, https://www.bsimm.com/wp- content/uploads/2015/10/BSIMM6.pdf

[MICR] Microsoft, “Microsoft Security Intelligence Report volume 10 (July – December 2010)”, http://www.microsoft.com/en-us/download/details.aspx?id=17030

[MITR] 2011 CWE/SANS Top 25 Most Dangerous Software Errors, http://cwe.mitre.org/top25

[MOCK] Mockus, A., Nachiappan Nagappan, Trung T. Dinh-Trong, "Test coverage and post-verification defects: A multiple case study" Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement, October 2009.

[NAGA] Nagappan, N., and T. Ball, “Use of Relative Code Churn to Predict System Defect Density”, Microsoft Research, 2005, http://research.microsoft.com/pubs/69126/icse05churn.pdf

69

[NESM] (http://nesma.org/2015/04/Agile-metrics/).

[OLAV] Olavsrud, T., “Do Insecure Open Source Components Threaten Your Apps?”, CIO, March 2012, http://www.cio.com/article/2397662/governance/do-insecure-open-source-components-threaten- your-apps-.html

[OWAS] “OWASP Top 10”, www.owasp.org/index.php/Category:OWASP_Top_Ten_Project

[PAUK] Paukamainen, I. “Case: Testing in Large Scale Agile Development”, presentation http://testingassembly.ttlry.mearra.com/files/2014%20ISMO%20Case_TestingInLargeScaleAgile Development.pdf

[PFLE] S. L. Pfleeger, N. Fenton, and N. Page, “Evaluating software engineering standards,” IEEE Comput., vol. 27, no. 9, pp. 71–79, 1994.

[PLUM] Java performance tuning survey results, November 113, 2014, https://plumbr.eu/blog/performance-blog/java-performance-tuning-survey-results-part-i

[PUPP] Puppet Labs, IT Revolution Press and Thoughtworks, 2014 State Of DevOps Report and 2013 State Of DevOps Infographic, 2015, https://puppetlabs.com/2013-state-of-devops-infographic

[PUTM] Putman, R. “Secure Agile SDLC.” https://www.brighttalk.com/ webcast/1903/92961.

[SANSa] SANS Institute, “2015 State of Application Security: Closing the Gap”, May 14, 2015, https://software-security.sans.org/blog/2015/05/14/2015-state-of-application-security-closing-the- gap

[SANSb] SANS Institute, “Critical Security Controls”, https://www.sans.org/critical-security-controls/

[SCAL] http://scaledAgileframework.com/features-components/

[SECU] “37 Powerful Penetration Testing Tools for Every Penetration Tester”, Security testing, June 2015, http://www.softwaretestinghelp.com/penetration-testing-tools/

[SIRI] Sirias, C., “Project Metrics for Software Development”, InfoQ, July14, 2009, http://www.infoq.com/articles/project-metrics

[SMAR] SmartBear, “11 Best Practice for Peer Review Code”, Whitepaper

[SOFT] Software Assurance Pocket Guide Series,” Architecture and Design Considerations for Secure Software”, Development, Volume V Version 1.3, February 22, 2011 https://buildsecurityin.us- cert.gov/sites/default/files/Architecture_and_Design_Pocket_Guide_v1.3.pdf

[TECH] Quality metrics: A guide to measuring software quality, SearchSoftwareQuality, http://searchsoftwarequality.techtarget.com/guides/Quality-metrics-A-guide-to-measuring- software-quality

[VERI] Verizon DBIR 2012, IDC, Infonetics Research

70

[WIKI] Software Quality, https://en.wikipedia.org/wiki/Software_quality

[ZERO] zeroturnaround.com/rebellabs/the-developers-guide

71