Liars Figure, and Figures Lie

Total Page:16

File Type:pdf, Size:1020Kb

Liars Figure, and Figures Lie

Liars Figure, and Figures Lie

How many times have you heard that old joke used to deflate statistics --, “liars figure and figures lie”. Or even the Mark Twain quote “Lies, Damned Lies, and Statistics”. This month I will demonstrate where these old truisms originate from, and how to avoid them. I will use four different analyses on exactly the same data. I will demonstrate four completely different interpretations of this same data, three lies and one truth. Let the saga begin . . .

I have retrieved 25 months of some operational data which management wants analyzed. The first method we will use is a common one, a bar chart. One thing that most people probably would succumb to is “I only want to see the current year of data!”, but let us assume I am allowed show all 25 months. The resulting bar chart is shown as figure 1. For the sake of discussion, increasing numbers are “bad”.

18 16 Figure 1 14 My assessment given to 12 the manager of this process 10 as follows (in a 8 6 breathless manner befitting 4 the adverse trend that has 2 0 developed):

8 8 8 8 9 9 9 9 0 -9 -9 9 9 -9 -9 9 9 -0 n r l- t- n r l- t- n “The past three months in a a p u c a p u c a J A J O J A J O J row have been increasing! In fact, the current month is at the highest value since more than one year ago! We must do something! We must find out why this month was so high!” Note the interpretation would be the same if I was only showing one year of data, but of course the current month would now be the “highest on the whole graph!”

The manager who owns the process that generated the data (and thus must be accountable, or find someone to hold accountable for the increase) says “Wait a minute. In Excel spreadsheet you can add a ‘trendline’ to these charts. This trendline will tell us if we are overall increasing, or overall decreasing.” We dutifully go to Excel, and generate figure 2. This figure shows the “trendline”, which is generated using a “least-squares” fitted straight line. Just like many of us learned in high school science class. 18 Figure 2 “Aha! The trend line is 16 14 negative! We have an 12 improvement occurring, the 10 rate is decreasing! It is 8 obvious, the Excel trend 6 line shows us the slope 4 is negative. In addition, a 2 y = -0.132x + 11.987 projection ahead shows 0 that we should achieve a 8 9 0 8 8 9 9 8 9 9 9 0 9 9 9 9 9 9 value of less than 8 by July ------l l r t r t n n n u u c c p p a a a 2000.” We get ready for a J J J A J A J O O celebration pizza party . . .

But wait. A consultant arrives saying that he always uses moving averages to smooth out the fluctuations in the raw data. Let us see what a moving average (which averages the last six months together) gives for an interpretation.

“See!” says the 18 Figure 3 consultant. The moving 16 average shows that we 14 were improving 12 until June 1999, then we got worse. Also, look 10 how high above the 8 average 6 the current month is! We must determine what 4 6 Month Moving Average happened back in June 2 that 0 made us worse!” 8 9 8 9 0 8 8 8 9 9 9 8 9 9 9 9 9 0 9 9 9 9 9 9 9 9 ------l l r r v v n n n p y y p u u a o a o a a a e a a e J J J J J N N M M S S M M

The Three Big Lies.

This first interpretation simply reacted to the raw data. Of course at least one point on the graph will be the “highest on the graph”. Likewise, there will also be a lowest. Many people succumb to explaining, in gory detail exactly why the current result was the way it is. We must also find those to hold accountable (that is, blame). The second interpretation simply placed on a linear regression line (least squares fit). It is highly unlikely that the slope of such a line will be exactly zero. So there will always be a “positive trend” or a “negative trend” declared. What most people fail to do is examine the “R-squared” value and determine the statistical significance of the slope. The question is -- “is the slope of the line significantly different than zero?” In this case, the R-squared = 0.11, usually considered to be a pretty poor fit. An R-squared of 1.0 is a perfect fit.

The moving average is next to useless. It also fails to tell you what is significant and what is not. All you know is that whether or not the current month was above or below the previous average. In reality, as you update the moving average, the current month replaces the earliest month in the previous average. If the current month was higher than the earliest month, then the moving average increases. Of course, half the data will be above average and half below average. An even worse structure is a cumulative average, where each average value has a differing number of data points in it. Thus a given shift in the data will either make a huge apparent change (early on in the accumulation of data), and hardly any change once a large amount of data is accumulated.

What is truth in this case? Let us try to find the answer using control chart. The control chart is shown below as Figure 4. This chart shows that the data are actually stable, that no change has occurred. For more details on control charting, please see the Hanford Trending Primer at http://www.hanford.gov/safety/vpp/trend.htm. There are no significant trends on this graph.

And where did this data 25 come from? It was Upper Control Limit generated from a normal 20 distribution random number Average = 10.3 generator, following an 15 (Jan98 - Jan00) average of 10 and a standard deviation of 3. 10 Only the control chart gave us the correct 5 interpretation of the data. Lower Control Limit Figure 4 Steven S Prevette 0 ASQ Certified Quality 8 8 8 8 9 9 9 9 0 -9 -9 9 9 -9 -9 9 9 -0 n r l- t- n r l- t- n Engineer a p u c a p u c a J A J O J A J O J This article is to appear in the October 1999 ASQ newsletter

Recommended publications