Transforming Your Way to Control Charts That Work

November 19, 2009

Richard L. W. Welch Associate Technical Fellow Robert M. Sabatino Six Sigma Black Belt Northrop Grumman Corporation

Approved for Public Release, Distribution Unlimited: 1 Northrop Grumman Case 09-2031 Dated 10/22/09 Outline

• The quest for high maturity – Why we use transformations • Expert advice • Observations & recommendations • Case studies – Software code inspections – Drawing errors • Counterexample – Software test failures • Summary

Approved for Public Release, Distribution Unlimited: 2 Northrop Grumman Case 09-2031 Dated 10/22/09 The Quest for High Maturity

• You want to be Level 5 • Your CMMI appraiser tells you to manage your code inspections with statistical process control • You find out you need control charts • You check some textbooks. They say that “in-control” looks like this

Summary for ln(COST/LOC) ASU Log Cost Model ASU Peer Reviews Using Lognormal Probability Density Function

) C

O L

d o

M /

w e N

r e p

s r u o

H (

N L 95% Confidence Intervals

4 4 4 4 4 4 4 4 4 4 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 r r y n n n l l l g a p a u u u Ju Ju Ju u M A J J J - - - A - - -M - - - 3 2 9 - Median 4 8 9 8 3 8 1 2 2 9 2 2 1 2 2 1 Review Closed Date ASU Eng Checks & Elec Mtngs

• You collect some peer review data • You plot your data . . .

A typical situation (like ours, 5 years ago)

3 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 The Reality

Summary for Cost/LOC The data Probability Plot of Cost/LOC Normal

99.9

99 • Highly asymmetrical

95 90 • Flunks Anderson- 80

t 70

n 60 e

c 50 r 40 e Darling test for P 30 20 95% Confidence Intervals 10 Mean 5 normality (p < 0.005)

Median 1

0.1 Cost/LOC

The ASU Cost Model • Penalizes due diligence in reviews Raw Data for Cost/LOC 1 C

– Up to 11% false alarm rate (Chebyshev’s inequality) O L

o d 1 M • Doesn’t flag superficial reviews / 1 w e N

r

– No lower control limit e p

s r o u

• Skews the central tendency H

– Average cost looks like it busts the budget 4 4 4 4 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 r- r- y- - - - l- l- l- - a p a un un un u u u ug A J J J -J -J -J -M - -M - - - 2 2 9 -A 4 8 8 8 3 5 1 2 2 7 2 2 1 2 2 1 Review Closed Date ASU Eng Chks & Elec Mtngs

What do you do?

4 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 An Idea

• Someone suggests that a data transformation might help • You decide to hit the textbooks. Two of the best are – Donald Wheeler, Understanding Variation: The Key to Managing Chaos, 2nd edition, SPC Press, 2000 – Douglas Montgomery, Design and Analysis of Experiments, 6th edition, Wiley, 2005 • We’ll see what they have to say in a minute, but first let’s look at some factoids about data transformations . . .

Approved for Public Release, Distribution Unlimited: 5 Northrop Grumman Case 09-2031 Dated 10/22/09 Data Transformation Factoids

• A data transformation is a mathematical function that converts your data into something else – For example, converting temperature from Fahrenheit to Celsius • When data do not meet the necessary characteristics to apply a desired probability model, the right data transformation can help • Data transformations are encountered in many statistical applications – Data analysis – Experimental – Statistical process control (SPC) – Pattern recognition • At Aerospace Systems, 40% of our control charts use a data transformation – We control about 50 different product development metrics

This presentation contains several slides intended to amplify the main discussion with technical detail meant to provide additional background information. We have labeled these “factoids.”

Approved for Public Release, Distribution Unlimited: 6 Northrop Grumman Case 09-2031 Dated 10/22/09 Expert Advice

Approved for Public Release, Distribution Unlimited: 7 Northrop Grumman Case 09-2031 Dated 10/22/09 Dueling Experts

• Unfortunately, they give conflicting advice

Wheeler favors the analysis Montgomery advocates data of raw data transformations prior to analysis

Who can we believe?

8 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Wheeler

• Doesn’t favor transforming data • Reasons that you cannot always meet the assumptions for specialized models like the binomial or Poisson • Favors empirical use of individuals/moving range charts for SPC or equivalent techniques for other applications because of their simplicity

– But this minimizes difficulties encountered with non-normal or non-symmetric data – Practitioners must take care not to overvalue the utility of 2-sigma and 3-sigma confidence bands/control limits for non-normal data

Approved for Public Release, Distribution Unlimited: 9 Northrop Grumman Case 09-2031 Dated 10/22/09 Calculus Factoids

• Tchebysheff’s Inequality – For a random variable with finite variance (and no particular distribution), + k- sigma limits will be exceeded 1/k2 * 100% of the time • For k=2, a 2-sigma confidence limit can be violated up to 25% of the time – not 5%! • For k=3, the 3-sigma control bands can be exceeded up to 11% of the time – not 0.3%! – This can result in an unworkable rate of false positives

For an extreme case of “dirty” data, you could need + 18-sigma control limits to achieve to same degree of control as + 3-sigma limits with well-behaved, normally distributed data.

But we never have dirty data, so why worry?

10 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Montgomery

• Boldly recommends

– Logarithmic transformation yij = ln xij for lognormal data 0.5 0.5 – Square root transformation yij = (xij) or yij = (1 + xij) for Poisson data – Arcsine transformation yij = arcsin xij for point-binomial data (that is, binomial data expressed in fractional form) – Empirically-derived transformations for other cases (think Box-Cox) • Focuses on stabilizing the variance

– Transformations work when there is some underlying physical causality • How do you know? • What happens when you have small samples and take one more data point? – You may be sacrificing desirable decision-theoretic properties like unbiasedness, maximum likelihood, or uniformly most powerful

Approved for Public Release, Distribution Unlimited: 11 Northrop Grumman Case 09-2031 Dated 10/22/09 Decision Theory Factoids

• Unbiased – An unbiased estimator is, in the long run, accurate – that is, it has no non- random component of error • Maximum Likelihood – A maximum likelihood estimator is the one most likely to be correct • Uniformly Most Powerful – A uniformly most powerful decision rule has the best chance of flagging any discrepant or out-of-control condition

Statisticians do not abandon these properties without good reason.

12 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Observations & Recommendations

Listen to both experts

Approved for Public Release, Distribution Unlimited: 13 Northrop Grumman Case 09-2031 Dated 10/22/09 Observations

• The Central Limit Theorem is a powerful ally – Any distribution with a finite variance can be approximated with a normal distribution by taking a large enough sample size n – Error of approximation varies as 1/n0.5 – In a practical sense, mild excursions from normality can be mitigated by observing more data • A transformation can be worthwhile – When a physical model explains why it should work – When an empirical model is well-founded in a large amount of data • Many studies of similar phenomena • Large data sets • Transformations can be difficult – Mathematically messy – Hard to interpret – Hard to explain – Cultural resistance to their use

Approved for Public Release, Distribution Unlimited: 14 Northrop Grumman Case 09-2031 Dated 10/22/09 Recommendations

• Due diligence requires that we investigate and compare using raw data vs. transformed data – Modern tools limit us only by our imaginations • Simpler is better unless there is harm – Statistical decision theory embraces the double negative: we do not abandon the easiest approach until we compile compelling evidence that it doesn’t work – Consider both the incidence rates and costs of wrong decisions (false positives and negatives, Type I and II errors, etc.) • The “correct” answer is not pre-determined – In 1936, R. A. Fisher derived the first multivariate normally based classifier – In 1976, the senior author showed Fisher’s data were lognormally distributed and a transformation gave more accurate results – Fisher’s error was 3.3%; the author’s was 2% – Is a difference between 3.3% and 2% meaningful to your application?

Fisher’s paper is “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, 7, 179-188. The lognormality of the data was shown in the senior author’s PhD thesis Studies in Bayesian Discriminant Analysis, and published in the 1979 paper “Multivariate Non-Gaussian Bayes Discriminant Analysis", Statistica, 1979:1, pp 13-24, with C. P. Tsokos. 15 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Software Code Inspections

Case Study #1

Approved for Public Release, Distribution Unlimited: 16 Northrop Grumman Case 09-2031 Dated 10/22/09 Reality, Part Two

Summary for Cost/LOC The data Probability Plot of Cost/LOC Normal

99.9

99 • Highly asymmetrical

95 90 • Flunks Anderson- 80

t 70

n 60 e

c 50 r 40 e Darling test for P 30 20 95% Confidence Intervals 10 Mean 5 normality (p < 0.005)

Median 1

0.1 Cost/LOC

The control chart ASU Cost Model • Penalizes due diligence in reviews Raw Data for Cost/LOC 1 C

– 11% false alarm rate (Chebyshev’s inequality) O L

o d 1 M • Doesn’t flag superficial reviews / 1 w e N

r

– No lower control limit e p

s r o u

• Skews the central tendency H

– Average cost looks like it is busting the budget 4 4 4 4 4 4 4 4 4 4 0 0 0 0 0 0 0 0 0 0 r- r- y- - - - l- l- l- - a p a un un un u u u ug A J J J -J -J -J -M - -M - - - 2 2 9 -A 4 8 8 8 3 5 1 2 2 7 2 2 1 2 2 1 Review Closed Date ASU Eng Chks & Elec Mtngs

What do you do?

17 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Stabilizing the Data

• Senior author’s presentation at 2005 SM Logarithms CMMI Technology Conference Can Be Your demonstrated how a log-cost model can Friends

successfully control software code Controlling Peer Review Costs inspections

November 16, 2005

Richard L. W. Welch, PhD Chief Statistician Northrop Grumman Corporation

– Peer review unit costs (hours per line of code) behave like commodity prices in the short term – Short term commodity price fluctuations follow a lognormal distribution – As a consequence, commodity prices follow a lognormal distribution – Therefore, taking the natural logarithm of a sequence of peer review costs transforms the sequence to a normally distributed series

Notes: • Details on the log-cost model, “one of the most ubiquitous models in finance,” can be found at riskglossary.com (http://www.riskglossary.com/articles/lognormal_distribution.htm) • Prior CMMI Technology Conference & User Group papers are published on-line at: http://www.dtic.mil/ndia/

18 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Our Data on Logs

Summary for ln(COST/LOC) ASU Peer Reviews Probability Plot of ln(COST/LOC) Normal

99.9

99

95 90 80

t 70 n 60 Anderson-Darling e

c 50 r 40 e

P 30 20 test p < 0.759 10 5 95% Confidence Intervals 1 Mean

Median 0.1 ln (COST/ LOC)

ASU Eng Checks & Elec Mtngs ASU Eng Checks & Elec Mtngs

• Impacts ASU Log Cost Model - Initial Release Using Lognormal Probability Density Function

– Normality of the transformed data ) C

O L

d minimizes false alarms o M /

w e N

– We catch superficial reviews r e p

s r u o – Code reviews do not bust the budget H (

n l

4 4 4 4 4 4 4 4 4 4 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 r r y n n n l l l g a p a u u u Ju Ju Ju u • Demonstrated utility & applicability M A M -J -J -J - - - A - - - 2 2 9 - 4 8 8 8 3 5 1 2 2 7 2 2 1 2 2 1 Review Closed Date – > 7,000 peer reviews over 6 years provide ASU Eng Checks & Elec Mtngs large sample validation Data Through Review 7351 on 8/25

Demonstrating an in-control, stable process

19 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Lognormal Factoid

Since the sample mean of the transformed data is ln xi i ∏ ∑n ln(x ) n Y = = = ln n ∏ xi n n n

The inverse transformation results in the geometric mean of the untransformed data Y e = n ∏ Xi = X geo n

As a result of a similar derivation, the inverse transformation of the sample is the geometric standard deviation of the untransformed data

Approved for Public Release, Distribution Unlimited: 20 Northrop Grumman Case 09-2031 Dated 10/22/09 Drawing Errors

Case Study #2

Approved for Public Release, Distribution Unlimited: 21 Northrop Grumman Case 09-2031 Dated 10/22/09 Drawing Errors

Chart of Observed and Expected Values

Expected Observed • Drawing errors follow a Poisson distribution 200

150

– Data are discrete counts of defects per drawing e u l a

V 100 – C charts are commonly used to analyze Poisson-distributed defect data 50

0 Tot Errors 0 1 2 >=3 • C-charts lack insight – No ready indicator of deteriorating or improving process performance • Control limits establish the process capability for expected number of defects per drawing – Special causes are lagging indicators – Fractional control limits confuse data analysts – For example, UCL =2.4 defects. What are 2.4 defects? 3 defects are a special cause and 2 are not. But what about a run of consecutive drawings, each with two errors?

C charts detect performance changes poorly

22 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 C Chart Performance

• Is this a stable process? • How sensitive is the C Chart to changes in process performance? – Baseline phase simulated using Poisson distribution with mean = .4 – Increased mean value by an additional 25% in two successive steps – Sample size = 60 drawings

Drawing Simulation Includes Sampling Error 0 Baseline Phase 1 Simulated 25% mean shift 2Additional simulated 25% mean shift UCL=2.064 2.0

1.5 t n u o C

r 1.0 o r r E

0.5 _ C=0.333

0.0 LCL=0

20 40 60 80 100 120 140 160 180 Number of Drawings

Approved for Public Release, Distribution Unlimited: 23 Northrop Grumman Case 09-2031 Dated 10/22/09 Detecting Shifts in Process Performance

• Wheeler suggests transforming raw “counts of events” to “measurements of process activity” (i.e. rates) Counts of events Process activity measurements

A B C D E F Date Dwg # Drawing Any How many Defect No. Errors defects drawings since Rate found ? a defect was found? (E/D) 4/1 1 0 No - - 4/1 2 0 No - - st 1 trial 4/2 3 0 No - - 4/4 4 0 No - - 4/5 5 1 Yes 5 0.200 Measurement of the first trial outcome 4/6 6 0 No - - nd 2 trial 4/7 7 0 No - - Trials until a 4/8 8 1 Yes 3 .333 Continuous data used rd “negative” event 3 trial 4/9 9 4 Yes 1 4.000 for monitoring process 4th trial 4/10 10 1 Yes 1 1.000 occurs. (varies) performance 5th trial 4/11 11 3 Yes 1 3.000 4/12 12 0 No - - 4/12 13 0 No - - 4/12 14 0 No - - 6th trial 4/13 15 0 No - - 4/13 16 0 No - -

4/14 17 0 No - - 4/15 18 0 No - - 4/16 19 2 Yes 8 .250 • The defect rate series is plotted on an Individuals & Moving Range (ImR) chart & analyzed

Yes, this was suggested by the same guy who doesn’t like transformations. See Donald Wheeler, Understanding Variation: The Key to Managing Chaos, 2nd edition, SPC Press, 2000, pp. 100-104. 24 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 ImR Control Chart Sensitivity to Shifts in Process Performance I can • Is this a stable process? see it !!! – Same simulation as the C chart – Data was transformed to establish defect rates – Control chart parameters computed from the baseline phase Drawing Simulation

Baseline Phase Simulated 25% mean shift Additional simulated 25% mean shift 0 1 2 1 1 1 1 1 1 1 1 1 1 1 2 g n i UCL=1.494 w a r D

1 r

e _ p

s X=0.516 t c e f 0 e D LCL=-0.462

20 40 60 80 Stability validated Drawing Samples

0 1 2 1 2 1 1 1 1 1 1 e g n

a UCL=1.201 R 1 g n i v

o __ M MR=0.368

0 LCL=0 20 40 60 80 Observation

25 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 Sensitivity Comparison

• This simulation models a 10% process shift • What chart would you use?

Drawing Rate Simulation

1 1 2 g n i Baseline Phase Simulated 10% Mean Shift UCL=1.474 w a r D

1 r

e _ p

s X=0.547 t c e f

e 0 D

LCL=-0.38 10 20 30 Drawing Samples

2.0 1 1 1 1 1.5 e g n

a UCL=1.139 R 1.0 g n i v o

M 0.5 __ MR=0.349

0.0 LCL=0 10 20 30 Observation

Drawing Simulation

2.5 Baseline Phase Simulated 10% Mean Shift UCL=2.297

2.0

t 1.5 n u o C

r o

r 1.0 r E

0.5 _ C=0.4

0.0 LCL=0

10 20 30 40 50 60 70 80 90 100 Number of Drawings

26 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09 ImR Chart Factoids

• The mR chart portion monitors the process based on points representing the differences (i.e., range) between each 2 consecutive individual data points – Assesses whether the process variation is in control – The mR chart should be in control before you establish the baseline with the I chart – If the mR chart is not in control, then the control limits for the I chart will be inflated (too forgiving) and may fail to signal an out-of-control condition • The I chart portion monitors process based on individual data points (defect rates) – Assesses whether the process center (x) is in control – The process centerline is calculated from the average of the individual data points – The control limits are set a distance of 3σ above and below the center line and provide a visual display for the amount of process variation expected • The ImR chart will not be hyper sensitive to rare events containing numerous errors

Approved for Public Release, Distribution Unlimited: 27 Northrop Grumman Case 09-2031 Dated 10/22/09 Software Test Failures

Counterexample

Approved for Public Release, Distribution Unlimited: 28 Northrop Grumman Case 09-2031 Dated 10/22/09 SW Test Returns – Poisson Model

• How often have you heard, as a universal truth – “Defects are Poisson distributed”? • How about reality, as the Test group sees it? • The u-chart plots SW test failures – This assumes SW test failures follow a Poisson distribution – Being diligent, we check the goodness of the model

Returns from SW Test Chart of Observed and Expected Values 20 Expected Observed

15 1 e u l 10 a V

5 Returns/Build

0 Test 0 1 2 3 4 5 >=6

Build Date χ2 Test for Poisson Not a good fit N DF Chi-Sq p-Value 54 5 88.7177 0.000

Approved for Public Release, Distribution Unlimited: 29 Northrop Grumman Case 09-2031 Dated 10/22/09 Does Transforming the Data Help?

Summary for Sqrt(1+Test)

Anderson-Darling Test for Normality A -S quared 2.38 • Square root transformation P -V alue < 0.005 – As recommended by Montgomery for data we thought should follow a No Poisson model

95% Confidence Intervals

Mean

Median

Probability Plot for 1+Test Goodness of F it Test Lognormal - 95% CI Normal - 95% CI 99 99 Lognormal • Lognormal A D = 2.243 P-V alue < 0.005 95 95 Box-C ox Transformation – When we try a Box-Cox A D = 2.243 P-V alue < 0.005 transformation, we discover the 80 80 t t n n e e c optimal parameter lambda = 0 c 50 50 r r e e P – This is the same as a lognormal P Still 20 20 transformation No 5 5

1 1 0.1 1.0 10.0 0 2 4 1+Test 1+Test After Box-Cox transformation (lambda = 0)

Approved for Public Release, Distribution Unlimited: 30 Northrop Grumman Case 09-2031 Dated 10/22/09 Discussion

• Sometimes no model seems to work – May have multiple failure mechanisms, with mixing of distributions – Need lots of data to sort out

• In this case, no obvious transformation suggestsSummary for Test itself • We revert to the basic principle that simpler is better – Individuals chart – Not great, but the simplest choice

Returns from SW Test 1

1 1

Returns/Build AD 3.258 P-Value < 0.005

Build Date Mean Median

Not much better, but simple Approved for Public Release, Distribution Unlimited: 31 Northrop Grumman Case 09-2031 Dated 10/22/09 Summary

• The Central Limit Theorem is a powerful ally • A transformation can be worthwhile • Transformations can be difficult • Due diligence requires that we investigate and compare using raw data vs. transformed data • Simpler is better unless there is harm • The “correct” answer is not pre-determined

When used carefully, transformations expand analysis capability

Approved for Public Release, Distribution Unlimited: 32 Northrop Grumman Case 09-2031 Dated 10/22/09 Richard L. W. Welch, PhD Robert M. Sabatino Northrop Grumman Corporation Northrop Grumman Corporation (321) 951-5072 (321) 726-7629 [email protected] [email protected] 33 Approved for Public Release, Distribution Unlimited: Northrop Grumman Case 09-2031 Dated 10/22/09