PAD5700 week one

MASTER OF PUBLIC ADMINISTRATION PROGRAM PAD 5700 -- for public management Fall 2013 Introduction and relative standing Statistic of the week

̅

The sample mean * Greetings, and welcome to PAD-5700 Statistics for public management. We will start this class with a bit of statistics, then follow this with course mechanics. Levin and Fox offer a useful definition of statistics: A set of techniques for the reduction of quantitative data (that is, a series of numbers) to a small number of more convenient and easily communicated descriptive terms. (p. 14)

An introduction to the course

Numbers are clean, crisp Table 1 and unambiguous, Size of government compared though they may clearly, Economic Size of Regulation Gov’t1 crisply and freedom government % GDP unambiguously G7+ represent a murky, US 7.93 7.13 7.78 16 sloppy and ambiguous Australia 7.83 6.80 7.91 17 reality; or may represent Canada 7.92 6.54 8.16 19 equally murky, sloppy France 7.20 5.43 6.45 23 and ambiguous research; Germany 7.47 5.64 6.34 18 may even on the odd Italy 6.75 5.71 5.77 21 occasion get distorted Japan 7.38 6.18 7.34 18 by dishonest Sweden 7.26 3.61 7.16 26 people! But descriptive UK 7.78 6.02 7.76 22 statistics can be very BRICs powerful ways of Brazil 6.18 6.39 5.00 20 making an argument. China 6.44 3.28 5.93 11 For instance, two big India 6.48 6.84 6.16 12 issues in today's Russia 6.57 7.27 5.69 18 political climate concerns out of control Laggards government in America. Pakistan 5.80 7.71 6.12 11 What do the numbers Nigeria 5.93 5.89 6.99 n/a show? Vietnam 6.15 6.27 6.34 6 Venezuela 4.35 5.09 4.91 14 Sources: The first three columns are from the Fraser Institute and Cato Institute’s Table 1 provides some 2010 Economic Freedom of the World Report. The data is on a 1-10 scale, with 10 data on this. This can equal to more economic freedom (i.e. less government ‘meddling’). The final also walk us through column is from the World Bank’s World Development Report 2011, pages 350-1. some of the basic Note: 1 -- Government final consumption, as % GDP.

PAD5700 week one concepts that we will discuss in this course.

Variables (see Berman & Wang, p. 21) – Economic freedom is more or less what we’re talking about: how ‘free’ is the US economy from this allegedly over-bearing ‘state’?

Operationalization (Berman & Wang, pp. 48-52) – Having conceptualized the variable that we want to analyse, we now need to ‘operationalize’ this, or put it into operation mathematically. In plain English: we need to measure economic freedom. Two conservative thinktanks, the Cato and Fraser Institutes, have provided an ‘operationalization’ of economic freedom in their annual Economic Freedom of the World Report. The indicate how they ‘operationalize’ economic freedom in their Chapter One, and provide more extensive details in the Appendix.

Descriptive statistics (Berman & Wang, pp. 103-4) – Having operationalized economic freedom, we can get some sense of how the US compares. Not that the Fraser/Cato economic freedom index measures economic freedom, so higher scores = more freedom, or less government meddling. Table 1 indicates that the US does quite well, indeed was the most ‘free’ economy of the countries listed. One can also go to the original report (linked in the sources) and go to page 7, which lists all countries. The US ranked 6th on this scale, with only Switzerland, Chile, New Zealand, Singapore and Hong Kong more ‘free’.

As we will see below this brief introduction to the course, descriptive statistics especially include measures of central tendency, and measures of dispersion. In addition to describing datasets in terms of where they are centered, and how spread out they are; these measures (especially the ) also allows us to draw extremely powerful inferences. This is because the standard deviation also serves as a measure or relative standing.

Inferential statistics (Berman & Wang, pp. 163-5) – We can also draw ‘inferences’ with data. As Berman & Wang put it: “[I]nferential statistics allow inferences to be made about characteristics of the population from which the data were drawn. A key application for these statistics is to address whether a relationship exists in the first place, and inferential statistics provide statistical evidence for answering this important question” (p. 163). Three especially powerful types of inferential methods that we’ll use will be hypothesis testing, correlation analysis (which Berman & Wand tend to treat as bivariate regression), and multi- variate .

Hypothesis testing. The date presented above are from 2008. Assume that we wanted to see if economic freedom has changed since ‘The Great Recession’. We can test this by comparing the mean economic freedom score in 2008, with the mean economic freedom score in 2012. Because I have a 2012 economic freedom score loaded onto this dataset for a competing index – The Index of Economic Freedom produced by the Heritage Foundation – I’ll use their figures. Heritage reports a 2008 score of 60.08 (they use a 100 point scale) and a 2012 score of 60.19: an increase. However the big question in hypothesis testing is, as Berman & Wang suggest above, whether a relationship exists between the passing of these four years, and economic freedom around the world. Or as I like to put it: how likely is it that the difference that we note could have come about as chance. An hypothesis test of Heritage’s 2008 and 2012 scores show a significant

PAD5700 week one level of 0.684. In plain English, this tells us that even if there had been no change in the economic freedom score over these four years, 68.4% of the time pure chance would give you a difference as large as that we observe (between the 2008 score of 60.08, and the 2012 score of 60.19. Given that it is very likely this much change could have occurred randomly, we can not conclude that economic freedom has changed from 2008 to 2012, or to put it technically: the observed difference is not statistically significant.

Correlation (Berman & Wang, from p. 239) – The implication of the concerns about out of control government in the US is that this hurts the country. So it is hypothesized that less economic freedom correlates with worse social and economic outcomes. If we were able to ‘operationalize’ social and economic outcomes, we could then look to see if these two are related in some way. Happily, such measures of socio-economic outcomes exist, an especially popular one is the Human Development Index, produced by the United Nations Development Program. For a description of the index, click here; for recent results, click here, page 16.

Figure 2 is a ‘scatterplot (Berman & Wang, p. 240-1), which is a visual presentation of correlation results. Most folks can see that the trend is up and to the right: as economic freedom increases, so does human development. Correlation analysis can also be done quantitatively. SPSS for the same relationship presented in Figure 2, are presented in Table 3.  The ‘N’ refers to the sample size: 139 countries Table 3 in the dataset used (my Correlation between economic freedom and human ‘Global Government’ development dataset, linked here), Human reported results for both of Economic development our variables. freedom (2008 - index, 2007  The Sig. (2 tailed) refers to the statistical - Cato) (HDR 2009) significance of this result. Economic freedom Pearson Correlation 1 .692** The likelihood that this (2008 -- Cato) Sig. (2-tailed) .000 result occurred randomly N 141 139 is close to zero (.000). **  Finally, the Pearson Human development Pearson Correlation .692 1 Correlation (Berman & index, 2007 Sig. (2-tailed) .000 Wang, pp. 245-6). It is a (HDR 2009) N 139 182 scale from -1 to 1. The positive number tells us **. Correlation is significant at the 0.01 level (2-tailed). the two variables are positively related. The number indicates the strength of that

PAD5700 week one

relationship. As a rule of thumb, a figure of 0-0.3 is considered a weak relationship; 0.3 to 0.6 a moderate relationship, and 0.6 to 1.0 a strong relationship. So that Pearson Correlation of 0.692 suggests that the there is a strong, positive relationship between economic freedom and human development.

Multivariate regression analysis (Berman & Wang, from p. 252) – Here the idea is to look at relationships between a number of causes of outcomes (to grossly simplify here in the first hour of this course). We will use a simple model to illustrate. In the correlation analysis above, we looked at the effect of economic freedom on human development. Now we will look at the effect of economic freedom and effective public services on human development. For these multiple ‘independent’ variables, multivariate regression will look at the effect of economic freedom on human development, holding constant the quality of public services; then will look at the impact of public services on economic development, holding constant the degree of economic freedom. Results are presented in Table 4, below.

Table 4 Regression of economic freedom and public services on human development Variable Coefficient Standardized T value Probability (standard error) coefficient Constant 0.346 5.86 0.000

Economic 0.018 0.089 1.71 0.090 Freedom (0.010) Public Services 0.063 0.842 16.21 0.000 (0.004) Adjusted r2 = 0.820 F(2, 135) = 313.4 p = .000

We’ll hold off on the details of interpreting multivariate regression for a couple of weeks but for now the key point is that the effect of economic freedom on human development drops dramatically once the quality of public services are held constant. This is especially evident in the standardized coefficient (pp. 259-60), which is interpreted the same way as the Pearson Coefficient.

Descriptive statistics

Measures of central tendency  Mean  Note the notation for the equation for this on page 107 of Berman and Wang. Don't freak out, we will simplify this. For our purposes, the following functions as a useful formula for the sample mean:

̅

 Median -- the middle number in a dataset (Berman & Wang, p. 109-12)  Mode -- the number that occurs most often (Berman & Wang, p. 112)

PAD5700 week one

Measures of variation  Range -- the high minus the low (Berman & Wang, p. 73)  The average deviation (O'Sullivan, p. 351-2) -- the mean absolute value of the distances of each observation from the sample mean. Expressed mathematically, it would look like this: ̅̅̅̅̅̅̅̅̅̅̅|̅ ̅̅ ̅ ̅̅̅ ̅̅̅ ̅̅ ̅|

 Q. Why absolute value? A. To avoid positives and negatives cancelling themselves out.  Standard deviation -- the square root of the mean of the squared distances of each observation from the sample mean (O'Sullivan, p. 350-1). Again expressed mathematically, it looks like this: ̅ √

To break down this formula into its constituent parts:  Start with subtracting the mean from individual observations. By doing this we find out how far each observation varies, or deviates (hence standard 'deviation') from the mean.  We square these deviations, for the reasons explained above (avoid the ‘cancelling out’ problem.  The Greek doo-hickey (sigma?) is the summation sign, which just says "add them all up,” so we are going to add the sum of all the squared variations.  By then dividing the sum of the squared deviations by the number of observations, we get a mean squared deviation.  Keeping in mind, of course, than we use n-1 in the denominator rather than n, again for reasons explained above.  Finally, we take the square root of the mean squared deviation, to get back to original, intuitive units.  Q. why n-1? A. 'degrees of freedom'. Q. Hello?  "You may wonder why we use the divisor (n - 1) instead of n when calculating the sample . Wouldn't using n seem more logical, so that the sample variance would be the average squared distance from the mean? The trouble is, using n tends to produce an underestimate of the population variance... So we use (n - 1) in the denominator to provide the appropriate correction for this tendency. Since sample statistics like s2 are primarily used to estimate population parameters..., (n - 1) is preferred to n when defining the sample variance" (McClave and Sincich, p. 58)  If this is confusing, Levin and Fox are little more helpful: "...Thus, the sample variance and the standard deviation are slightly biased estimates (tend to be too small) of the population variance and standard deviation. It is necessary, therefore, to let out the seam a bit... To do so, we divide by N - 1 rather than N" (Levin and Fox, p. 134).  And Berman and Wang, in a footnote (#5, on page 186).  Notes:  Because of the squaring, the standard deviation thus tends to emphasize outliers, other things being equal.  The standard deviation is also socially constructed. It has no obvious, intuitive superiority over other methods of expressing variation. But it is the accepted method, and much of statistics is based on it.

PAD5700 week one

The standard deviation drives statistics (I'll argue). Get to know it: it's cool, and wants to be your friend. Practice calculating it, as this is the best way to become familiar with what the thing does. It is simply a way of measuring the mean (average) variability within a sample. I have done an example, step by step, which I think you can access through the following link:  Standard deviation example calculation

Relative standing

Having calculated the standard deviation, one could fairly ask: so what? Is this just yet more inane, useless, fancy schmancy academic nonsense that is forced down your throats at university?

Wellll...maybe. But:

Variability. First, the standard deviation gives us some sort of indication of variability, so that one can compare two samples drawn from the same sort of units, and get some idea which varies more, which less.

Relative standing. Beyond this, the standard deviation is also (indeed more) useful as a measure of relative standing. Measures of relative standing include things like percentile rankings, something that many of us would have received with GRE (or earlier SAT) scores. The standard deviation allows us to do something similar (Berman and Wang, pp. 124-8).

Essentially, the structure of the standard deviation equation is such that when you throw a sample in there, the standard deviation that gets spat out defines the relationship indicated in the figure at right (source, and described in Berman and Wang, p. 124).

The standard deviation is a numerical measure of variability, just as a bell shaped frequency distribution is a graphic representation of variability.

Imagine a bell shaped frequency distribution of (say) income levels in (say) Bradford County. Assume the median family income is $45,000 (census.gov indicates that it was $41,397 in 2008). Imagine that this data was plotted on a frequency distribution, with the lower end tapering off close to the x axis at about $15,000 income, and the upper end tapering off close to the x axis at about $75,000 income. The center of the frequency distribution is, of course, at the mean of $45,000. Draw it out if you have to. Now given this distribution of incomes for Clayans, the likelihood of someone having an income of $125,000 would be very low. Your diagram indicating the distribution of family incomes in the county indicates that there just aren't people that wealthy in the county.

PAD5700 week one

The standard deviation numerically does the same thing. In this case, the number of standard deviations an observation is from the mean tells you a great deal about how likely the observation is. As the figure above (and Berman and Wang, p. 124) indicates, about 99.7% of observations in a sample would be expected to fall within 3s (three standard deviations) of the mean. So the Clay County income example above would have a standard deviation of about $10,000. Given a mean of $45,000 and a standard deviation of $10,000, a figure of $125,000 would be a whopping 8 standard deviations from the mean, something which is very, very unlikely.

This is expressed in terms of 'z score', a concept which we will use as well. Note the equation for the z score (Berman & Wang, p. 125). ̅

This simple equation does exactly what we just did for income of Bradford County residents: gives us an indication of the relative standing of an observation by seeing how far the observation is from the mean (by subtracting the mean from the observation) then dividing this difference by the standard deviation. The distance, in standard deviations of an observation from the mean, tells us how likely it is.

Introduction to SPSS

For this introduction to SPSS we will set up an SPSS spreadsheet, using the income-raw data linked here.

Step 1 -- SPSS (and most spreadsheets) have variables along the top, and individual cases along the left hand side. In SPSS, step 1 is to create the variables. To do this, you click 'Variable View', in the lower left hand side of the spreadsheet. The names of the four variables can then be typed into the first, 'Name' column of the 'Variable View' function. Call these:  case  inc2000  inc2005  city Step 2 -- the second column allows you to indicate what type of variable you are using. The two most common choices are 'String' and 'Numeric'. The latter is clear enough. String refers to non-quantifiable variables, like the names of the individual members of this dataset. So call case a string variable, make the other three numeric. 'City' also seems like a non-numeric type of variable, and it is. However, we will do numerical analytical stuff with this variable, and if we call it a string variable SPSS won't allow us to do this.

You can also specify the number of decimals that you want reported, and adjust the width of the column in the data set.

Step 3 -- Label. The variable names listed in Step 1 are for internal SPSS purposes. In the 'Label' column, you can give the variables longer names for reporting purposes. Do this as follows:  case -- Case

PAD5700 week one

 inc2000 -- 2000 income ($1000s)  inc2005 -- 2005 income ($1000s)  city -- City Step 4 -- Values. The Values column allows you to specify values of nominal variables that you intend to use in the analysis. This small dataset has two variables that are names, rather than numbers: case and city. Case is just a designation of the individuals referred to, and has no analytical value. City, on the other hand, does have analytical value. SPSS is a statistical (quantitative) package, though, so we need to give numbers to these two cities. We do this as follows:  Click on the right side of the cell for 'city' in the Values column. This will open a 'Value Labels' window.  Type 0 for Value, and North Bend for Label, click Add.  Type 1 for Value, and South Bend for Label, click Add.  Click OK Step 5 -- return to Data View (bottom left hand side button) and insert the data.

Descriptive statistics. Having put your dataset together, you can now run descriptive statistics like the ones we discussed above:  Go to Analyze, Descriptive statistics, Descriptives.  Put '2000 Income' in as Variable.  Click 'Options', and select (check) Mean, Std. Deviation, Range, Minimum, Maximum, S.E. Mean.  Click 'Continue', then 'Okay'. You should get this: Table 5 Descriptive Statistics – North Bend & South Bend N Range Minimum Maximum Mean Std. Deviation

Statistic Statistic Statistic Statistic Statistic Std. Error Statistic 2000 income ($1000s) 20 175.00 25.00 200.00 43.2500 8.36404 37.40514 Valid N (listwise) 20

Case summaries. We can also get wild and crazy and do a quick table presenting the difference between North Bend and South Bend. Table 6  Go to Analyze, Reports, Case Summaries. Case Summaries – North Bend & South Bend  Load ‘2000 income’ and ‘2005 income’ in 2000 income 2005 income as Variables City ($1000s) ($1000s)  Load ‘City’ as ‘Grouping Variable’ North Mean 51.50 66.0  Click off the ‘Display cases’. Bend Median 36.00 40.50  Click ‘Statistics’, and let’s keep it simple: Minimum 25 30 in ‘Cell Statistics’, load Mean, Median, Maximum 200 291 Minimum, Maximum, and Standard Std. Deviation 52.610 79.526 Deviation. South Mean 35.00 40.00  Click ‘Continue’, and ‘Okay’. You Bend Median 35.00 40.00 should get Table 6 (I deleted the ‘Totals’ Minimum 25 30 at the end): Maximum 45 50 I know what you’re thinking: OMG make it Std. Deviation 5.888 5.888 end!!!

PAD5700 week one

It has (ended). Note that the Mean 2000 income for North Bend is higher than that for South Bend: $51,500 v. $35,000. But don’t forget the 2008 v. 2012 Economic Freedom scores that we looked at above: though different, this difference was not statistically significant. We’ll look at this when next we meet. Given that I’ve got some space below, and so will be wasting no more paper, I’ll run the test.

Hypothesis test (independent samples test):  Click Analyze, Compare Means, Independent-Samples T Test  Load 2000 income as Test Variable,  Load city as Grouping Variable. Click ‘Define Groups’ and type 1 (that is how we coded North Bend) for Group 1, and 2 (that is how we coded South Bend) for Group 2.  Click Continue and Okay. You should get this:

Table 7 Group Statistics – North Bend v. South Bend, 2005 income Std. Error City N Mean Std. Deviation Mean 2005 income ($1000s) North Bend 10 66.00 79.526 25.148 South Bend 10 40.00 5.888 1.862

Table 8 Independent Samples Test, North Bend v. South Bend, 2005 income t df Sig (2 tailed) 2005 income 1.031 18 .316

I’ve reformatted Table 8. In a nutshell, what this tells us is that the variation in the sample (and the small sample size!) is such is that observed difference ($66,000 v. $40,000) isn’t large enough that we can be confident that the difference didn’t just result from randomness. The significance value for the test (.316 or .329, we will discuss the difference), means that there is about a 1/3rd chance that we would randomly observe a difference as large as $26,000 (i.e. $66k - $40k), so cannot conclude that North Bend is richer than South Bend.

Hypothesis test (Paired samples test): if you’re bored, try this one, to see if these two cities combined have become richer from 2000 to 2005.  Click Analyze, Compare Means, Paired-Samples T Test.  Load 2000 income and 2005 income in as Paired Variables. Click okay. * References Levin, Jack and James Fox (2011). Elementary Statistics in Social Research. Allyn and Bacon. McClave, James and Terry Sincich (2003). A First Course in Statistics, Prentice Hall. Elizabethann O’Sullivan, Gary Rassel and Maureen Berner (2008). Research Methods for Public Administrators. New York: Longman. Welch, Susan and John Comer (1988). Quantitative Methods for Public Administrators, The Dorsey Press.