<<

Chapter Coverage and Supplementary Lecture Material: Chapter 17

Please replace the hand-out in class with a print-out or on-line reading of this material. There were typos and a few errors in that hand-out. This covers (and then some) the material in pages 275-278 of Chapter 17, which is all you are responsible for at this time.

There are a number of important concepts in Chapter 17 which need to be understood in order to proceed further with the exercises and to prepare for the data analysis we will do on data collected in our group research projects. I want to go over a number of them, and will be sure to cover those items that will be on the quiz.

First, at the outset of the chapter, the authors explain the difference between quantitative analysis which is descriptive and that which is inferential. And then the explain the difference between analysis, and multivariate analysis.

Descriptive data analysis, the authors point out on page 275, involves focus in on the data collected per se, which describes what Ragin and Zaret refer to as the object of the research - what has been studied (objectively one hopes). This distinction between the object of research and the subject of research is an important one, and will help us understand the distinction between descriptive and inferential data analysis. As I argue in the paper I have written: “In thinking about the topic of a research project, it is helpful to distinguish between the object and subject of research. Ragin and Zaret distinguished between the object of research, which are the observational units, and the subject of research, such as relationships among variables (Ragin and Zaret 1983), the nature of a social mechanism, or some other subject.” (Dover, Michael. 2006. “Teaching Yourself How to Write a Thesis: Several Easy Steps,” article submitted to Teaching Sociology, citing Ragin, Charles and David Zaret. 1983. "Theory and Method in Comparative Research: Two Strategies." Social Forces 61:731-53.)

Univariate Bivariate Multivariate (1 Variable) (Relationship (Relationship between 2 between 3 or more variables) variables) Descriptive Frequencies To Be Discussed To Be Discussed (Characteristics of a sample itself or of a population) Inferential To Be Discussed To Be Discussed To Be Discussed (Inference from a sample to a larger population)

As the text points out (p. 275), “Descriptive analysis does not provide a basis for generalizing beyond our particular study or sample.” In other words, descriptive analysis doesn’t permit coming to conclusions based upon the empirical results about anything other than the object of the research, it doesn’t permit inferring anything about a larger population, even if the sample studied was a random sample of that larger population. Doing so requires the use of inferential data analysis, which is covered in the second half of this chapter, after it covers descriptive data analysis. The authors imply but evade the question of whether relationships between variables can be studied in descriptive data analysis. The do point out, “Even when we describe relationships between variables in our study, that alone does not provide sufficient grounds for inferring those relationships exist in general or have an theoretical meaning.” Implied here is that one can in fact discuss relationships among variables, but because most social science research focuses on efforts to generalize to larger populations, mere descriptive research is often considered of lesser scientific importance. In social work research, however, it is often the case that the population of interest is for instance the clients of an agency, or the workers within an agency, or a defined neighborhood area. Important research can be done that is descriptive about such a population, even if one can’t infer to larger populations.

In other words, even in descriptive research, the subject of a research project can be relationships between variables among a sample studied, if the sample studied is the actual population of interest. This is an important point not covered by the text, and it is one which is relevant to our research this term. If one is studying an entire population, say all SWK 100 students or all SWK 250 students, can can do data analysis that is descriptive but which permits studying characteristics of that entire population. One assumption behind inferential is that one is starting with a random sample of a population, such that one can analyze relationship among variables found in the sample and project with some degree of confidential that the same findings would apply to the larger population. However, it is universally agreed that the larger a random sample, the better, resources permitting. In fact, the larger the random sample, the smaller the error associated with research. But think about that for a . What if one took a 99% random sample, i.e. one drew from a population a sample that included all but 1% of the population. Could one infer results from a “descriptive” data analysis of that sample that applied to the entire population? For instance, say a strong relationship was found between gender and belief in the importance of HIV testing, and that this was found at the .01 level of statistical significance. This that in 99% of all random samples drawn from the population, such a relationship would be found. Such an interpretation is made irrespective of the size of the random sample! But, thinking logically, one could communicate such a finding in say a report to respondents or a press release with greater confidence if there was a large random sample, correct? So what if the random sample were 99% of the population? Still true, presumably. But what if the entire population was studied, with 100% response rate, in other words, all possible members of that population responded. Still, the relationship between gender and belief in HIV could be reported. Therefore, at the level of a population being studied as a population, the distinction between descriptive and inferential data analysis disappears. Accordingly, there is nothing magical about the distinction between descriptive and inferential data analysis. And, there is nothing “inferior” about descriptive data analysis compared to inferential data analysis. I might add, I disagree with the authors that can’t be of theoretical relevance. As I argue: “As such, my dissertation was a theoretically relevant exploratory and descriptive study. One lesson here is that theory is relevant not only to explanatory studies but to descriptive studies as well.” (Dover, 2006).

Descriptive

First, let’s discuss univariate analysis, the analysis of characteristics of a single variable, such as the distribution of values, the dispersion of values, etc.. What is univariate analysis used for? Let’s take the example of age, in our dataset. After all, univariate analysis of age can help us understand the age distribution of the sample. It can’t tell us anything about the relationship between age and another variable, say education, that would be bivariate analysis. It can’t tell us anything about the relationship between age and education, controlling for health, that would be multivariate analysis.

Univariate analysis - a single variable Bivariate analysis - two variables Multivariate analysis - more than 2 variables.

What are some of the key types of univariate analysis? Three types of univariate analysis are referred to as ways of measuring the and these include mode, median and mean. But before discussing these, let’s be sure that we understand the concept . When one does a in Excel, as is done in Exercise 2, one is producing a frequency distribution that is expressed in the table that accompanies the Histogram chart. For each value of the variable, actually, to be more specific, for each value which is a code assigned to an attribute of the variable as well as for each value representing of some kind if there is such missing data, the frequency of that value is found. For an interval variable (definition below), there may be dozens or even hundreds of values found in a frequency distribution, but for ordinal or nominal variables, there are typically only a few or up to around a dozen such values. For them, a table such as that accompanying the histogram is a good way to display a frequency distribution. A frequency distribution, then, is an excellent way to do a univariate analysis of the characteristics of a single variable.

Another way to engage in a univariate analysis is to use a or a parameter to characterize that variable. Note that the authors don’t use that term! In fact, one can read entire textbooks on statistics and find the term statistic isn’t defined. In fact, of the leading four dictionaries of sociology, only one defines the term statistic: “A mathematical value that summarizes the characteristics of a sample.” (George A. Theodorson and Achilles G. Theodorson, 1969, Modern Dictionary of Sociology). The mode, median and mean are three types of statistics (when applied to a sample) and are three forms of what are known as population parameters when applied to a population. Seen this way, statistics are hopefully less intimidating: they are merely ways of understanding something about a sample of a population or about a population itself. There are three common measures of central tendency, the mode, the median and the mean.

The Mode: The mode is the most important measure of central tendency for nominal variables, where mean and median are meaningless, because nominal variables aren’t rank ordered. The mode value, or modal value, of a single variable is simply the most frequent attribute of that variable. So it is the mode, rather than the mean, which is the most sufficient way of portraying the typical case in a distribution.

So, if within a population, the most common age is 60 (which is how old the first wave of baby boomers is today), the mode of the variable age would be 60. Remember, an attribute is a characteristic of a person or thing, and the values of a variable merely represent some attribute. Even with age, 58 is a number as well as a value of the variable age, and so you might think 58 is not an attribute of a person or thing, but believe me, it is! Being 58 years old is an attribute of a person and the value, 58, in the variable age represents that attribute! 58 is both a value and an attribute.

Now, say you had a variable TIRED and it was ordinal, and the attributed of the variable were “very tired”, “tired”, “not very tired”, and “not tired at all”. You might coded these attributes with the values 1 (very tired), 2 (tired), 3 (not very tired), and 4 (not tired at all). Tired is an attribute of the variable TIRED. Now, let’s say that more respondents answered “tired” than any of the other three attributes. It would be the modal value. So when you say modal value, you mean 2. If you say model attribute, you mean tired.

Let’s once again distinguish between a variable’s values and its attributes:

Values of a variable - the numbers assigned to attributes of a variable

Attributes of a variable - a characteristic of a person or thing which a variable seeks to measure by assigning values of some kind to the of characteristics being measured.

And, as may be apparent, coding refers to the assignment of a number or numeral to the attributes of a variable (see p. 275). For each attribute, a code is assigned. For instance, 1 for male, or 65 for 65 years of age.

Keep in mind that the fact that the most common value, the modal value, is 2, doesn’t imply anything at all about the other values or attributes of that variable. It is merely one statistic or parameter of the sample or population being described. The mode is a descriptive univariate statistic.

Again, the fact that the mode value is say 49 years of age doesn’t mean that half the subjects are over 49 and half are under 49 or anything at all about that. It merely means that more subjects were 49 than any other age. In fact, 99% of those who are 49 could turn 50 tomorrow, and 99% of those who are 50 may have had their birthday yesterday! So any value, 49 or 50, is merely a value assigned to an attribute, and it isn’t necessary a perfect measure. True, an interval level variable like age tends to have fewer problems of the kind associated with an uneven distribution of exact values within the value coded as representative of an attribute. If you had age as an ordinal scale, say grouped in ten year ranges, 20-30, 30-40, etc., and if one of those ranges included at the upper end or lower end the beginning or end of the baby boom, the number of respondent’s within that range might not be evenly distributed. That is important to realize, for reasons that will be discussed further below. But as the above example about actual birth date shows, even within interval level measures like age in years, respondents aren’t necessarily evenly distributed.

The Median: The median is considered an ordinal measure of central tendency, but as such it can be used with ordinal and interval/ratio variables. This is because both ordinal and interval variables are rank ordered. The median value of a variable is the value which describes the attribute which is in the middle of the range of all values of a rank- ordered variable. One way to understand the median is that half the cases are above the median and half the cases are below the median. That is true on this quiz and on most quizzes, and it is true in most research, because in most cases median is used as an average for interval variables. But, and this is not what the text says, but what I say, when applied to ordinal variables, median does not refer to how many cases are at what rank-ordered category, but what the middle category of the pre-determined rank-ordered categories is. If you have strong disagree, disagree, neither disagree, agree, strongly disagree, the median on this scale is always neither agree nor disagree, irrespective if everyone answered agree! So all you can say is that the actual score is above or below the median on the scale. You can say the modal score on the ordinal scale is the one most frequently chosen. And as explained further below, you can’s use the to describe such an ordinal scale except under certain circumstances.

Since the attributes of a nominal variable can’t be rank-ordered, it doesn’t make sense to use mean or median with a nominal variable. Yes, you could “sort” the values assigned to the attributes of a nominal variable according to their frequency. But this has no meaning of the sort that would permit median or mean, as they aren’t truly “rank- ordered.” The median is entirely independent of the number of cases found at that level. It doesn’t matter if there is only one case with that value, and there are a hundred cases at each one of the 10 values lower ranked than it and a hundred cases at each one of the 10 values ranked higher than it. It is the median because there are 10 values above and 10 values below. On a Likert scale, with five points, the median is always the middle point on the scale. This is one reason why, if you want to describe an ordinal scale in terms of , it is useless to have an even number of categories, although the number of categories should be determined theoretically or empirically, not with a view to the kind of analysis to do.

The median may be the mode, in fact it often is. Take the bell curve. In a perfect bell curve, the median is the mode is the mean (average). (To be completely accurate, multiple modes exist, but the mean is a mode and the median is also a mode.)

Take the following distribution, which I’ve posted as a spreadsheet mynormal.xls to the quantitative data analysis page. The median (middle) value is 10, the mean value is 10, and one of the multiple modes is 10 (since there is only one case per value, every value is a mode.) But, as I’ve done in mynormal.xls, and mynormal.sav (an SPSS file), I’ve also created another column with 4 cases per value, with an N (total number of cases) of 164, and a third variable with 3 of the cases of 10 removed, leaving only 1 value of 10, and an N of 161. That value of 10 is still the mean and is still the median! But it is not the mode. The distribution is no longer a . It is no longer a “bell curve”, since at the apex of the curve, it plunges down in the exact middle (at 10), so that you could draw a parallel line from 9.6 to 10.4. This normal distribution is a good place to start with thinking about univariate data analysis.

Normal 1 Case Per Value N=41

2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6 6.4 6.8 7.2 7.6 8 8.4 8.8 9.2 9.6 10 10.4 10.8 11.2 11.6 12 12.4 12.8 13.2 13.6 14 14.4 14.8 15.2 15.6 16 16.4 16.8 17.2 17.6 18

These data represent 41 numbers beginning with 2 and ending with 18, spaced .4 apart. If you plot a histogram with a normal curve superimposed on them, for 4 sets of these numbers (N=164), you find that the left tail begins at the value of two on the y axis at a value of approximately 3.75 and curve up to the apex at approximately 13.75. This is just an arbitrary number of sets that produces a readable histogram. What you see is that these values don’t represent a perfect bell curve, but that they have many of the characteristics of a normal distribution. The superimposed bell curve shows that a normal distribution of these values on the x axis (which range from 2 to 18) would involve having about 4 cases at 2 and about 14 cases at 10, with the curve descending again to about 4 cases at 18.

Again, it is important to point out that mean or median can’t be used to describe the central tendency of a nominal variable such as race.

More on the median and something on averages: The median is defined by the text as “an average that represents the value of the “middle” case in a rank-ordered set of observations. For people aged 17, 18, 19, 20 and 21, the median is 19. Why is it referred to as an “average”? Because average according to the text is “an ambiguous term that generally suggests typical or normal.” In fact, mean, median and mode are all “specific examples of mathematical averages”.

The authors are correct, we normally refer to the average as the mean. That is why we should talk about means where possible, rather than average. The arithmetic mean is the correct term. How do you calculate the median? It’s hard, actually, if you have many values on the variable (for instance, if it is interval). You almost have to do it manually or use SPSS. For ordinal variables, it is easy to find the median. Just look at the original coding and find the value assigned to the middle attribute of the rank-ordered variable. Do not sort the results by the frequency of the cases found at each level of an ordinal value, and pick the value that is in the “middle” of the sort! That is not the median! The median is pre-determined by the construction of the ordinal scale. Thus, where there are an even number of items in an ordinal scale, to speak of the median is meaningless.

The Mean: The mean is not always sufficient in portraying the typical case in a distribution. It can describe with a statistic one way of understanding the average of a distribution, if there aren’t a lot of extreme values. The mean is an arithmetic measure of central tendency. Strictly speaking a mean is relevant only to an interval variable. Why? Because of the example I gave before about age 49 and 50. In other words, even with an interval variable like age, the value for the overall attribute age and the specific attribute 49 or 50 isn’t an exact measure, to the day. The same is true with each category of an ordinal variable. There is very tired, and there is VERY TIRED! It could be that if you were to ask, “how tired are you on a scale of 1-12, with 1 as the most tired, that everyone in the range from 10-12 answered 12, whereas everyone in the range from 7-9 answered 7. Now, strictly speaking, a 12 point scale is not an interval scale. It’s just an ordinal scale of TIRED with many categories! But let’s assume we treat it as interval. Were the respondents equally distributed, with the same number answering 1 as 12 and everything in between, the mean score would be 6.5. But if they weren’t evenly distributed, as in the example where everyone that was between 7 and 9 answered 7, and everyone who was between 10 and 12 answered 12, the mean would be very different. Now, let’s “recode” the evenly distributed variable down to a four point scale.

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Very Tired (1-3) Tired (4-6) Not Very Tired (7-9) Not Tired At All(10-12)

Take the following example. You asked if people were very tired (theorized as the same as 10-12 on a 12 point scale), tired (same as 7-9), not very tired (same as 4-6), or not tired at all (same as 1-3), and if you coded these four variables 1, 2, 3, and 4, respectively, so that 1 was very tired, 2 was tired, and so forth. You found that equal numbers of respondents answered 1, 2, 3 and 4. Would it be accurate to say that the mean value of the ordinal variable “tired” was 2.5? If you knew for a fact that the cases were evenly distributed within each of the four categories of TIRED and that there were equal numbers of respondents in each of the four categories, such a statistic would have meaning. And if you had determined this prior to recoding the variable into an ordinal one, instead of having asking the question as an ordinal scale to begin with, you could report that fact and “treat” the variable as interval in statistical analyses. But in most cases you don’t know that. The danger is that for all you know, the people who answered “very tired” were all the equivalent to 12s on a 12 point scale, and the people who answered “tired”were at the equivalent of 7 on a 12 point scale. In which case your effort to use a mean score to describe a for TIRED would be very misleading.

The point I’m making above is that extreme scores within a range of values can affect the mean score. A few people very high in age could affect the average age. But depending upon where those who are “very high” are within a range of ordinally scaled values, it could produce a mean score of an ordinal scale which is not a good representation of what we are trying to measure. So it is the mean that is most affected by extreme scores.

Now let’s talk some more about some of the concepts in this chapter. After central tendency, the text discussed dispersion. The dispersion of a variable is defined as “the distribution of values around some central value, such as an average.” But another way of thinking about dispersion is what the range of the values of a variable is. The range is “a measure of dispersion that is composed of the highest and lowest values of a variable in some set of observations.” So you might have a range of age values from 18- 89, that is the range. The most important measure of dispersion is , which portrays “how far away from the mean individual scores on average are located.” The standard deviation is a descriptive statistic that portrays how far away for the mean individual scores, on average, are located. In other words, it is the mean of the difference between each score and the mean for all scores! We’ll see how this is done in the next exercise. For now, see the handout from Knoke’s text on calculation of the standard deviation. And see my new spreadsheet, mynormal.xls, on the Quantitative Data Analysis main page. It shows how to calculate the standard deviation using the above data set of 41 cases that ranged from 2.0 to 18, and how that can be done in one cell using an Fx formula as well as how this is done using columns of data and math as simple as addition, substraction, a square, and a square root. The standard deviation is calculated by subtracting the mean from each value, squaring each, adding up the squares of these deviations from the mean, dividing this sum by the (the number of cases minus one), and then taking the square root of that number.

Put simply, the standard deviation portrays how far away from the mean individual scores on average are located. To interpret the standard deviation, assuming a normal bell-curve like distribution, see the handout from Agresti and Findlay p. 53. As you can see, 68% of all cases (the number of the y or vertical axis) will be between the mean score and one standard deviation from the mean. So you take the mean score, and subtract the standard deviation to get the smallest score on the X axis that will be within one standard deviation from the mean (the scores from the 16th to the 50th ). And you add the standard deviation to the mean to get the maximum score at the 84th percentile. So now you know the scores (or whatever other measure is relevant to the X axis) that are within the range represented by just over 2/3rds of respondents. You could do the same with 2 standard deviations to understand the range for 95% of all respondents, since 95% of all respondents (the 2.5th to the 97.5th ) are within 2 standard deviations of the mean in such a normal distribution.

On p. 282 is material about . This is used in clinical research. The main thing you need to know is how to think about whether an effect size is weak, medium or strong. Don’t assume an effect size is the same thing as a correlation. Because I consider that the text is confusing in this regard, I’ve eliminated questions about effect size from the quiz.

Statistical significance testing identifies the probability that our findings can be attributed to chance. It tends to be expressed in whether the findings are deemed statistically significant at some predetermined level, such as .01 or .05, meaning that in 99% or 95% of all random samples drawn from the same population, to which the same measure was applied, one would see the same results. In other words, it is a measure of how confident we can be that the results are not a random occurrence. Remember, an effect size is not the same as the overall value of an intervention. It is merely one measure of that value and must be considered in light of other findings. A determination of substantive significance or practical significance or in clinical studies clinical significance involves value judgments about the importance of the variables or the problem studied. It is not true that a relationship cannot have substantive significance unless it also has a large effect size or a high degree of statistical significance. For instance, the choice of .05 is arbitrary. If we know that 90% of the time it isn’t random, we might be happy with that! However, I have removed questions about substantive, clinical and statistical significance from the Chapter 17 quiz, we’ll get into this when we discuss Chi Square.

Let’s talk some more about . The nominal level of measurement is a level which describes variables whose different attributes are categorical and can only be described in terms of how many but not degree.

I handed out another handout to supplement our Decision Chart. It is from Agresti and Finlay. One thing about understanding statistics and research methods. You can’t count on being able to understand this just from reading one text and listening to one instructor. Find other texts in the library. Go on the web. Talk to other students. Stare at the date and ponder the definitions. Write your own! You have to make it make sense to you. Also, don’t think that the fact that it may take you longer than some other students to understand this means you don’t understand it as well. In the end, initial confusion can lead to ultimate clarity at a higher level than you might think.

One key issue about why determining the level of measurement of a variable is important is that it influences the type of calculations that can be performed on it. For instance, if you wanted to measure the number of children someone had, would you use a nominal measurement like “oodles” or “none”, or an ordinal one like many, some, none that is at least an effort at a rank-order? No you would use a ratio variable (interval with true zero.)

Levels of Measurement: Various characteristics The following chart, devised by Dr. Susan Carol Losh (http://edf5481-01.su00.fsu.edu/LevelsofAnalysis.htm) can helps to further explain the distinction between nominal, ordinal, interval and ratio variables.

VARIABLE CATEGORIES CATEGORIES CASES CAN BE CATEGORIES CATEGORIES NON ARBITRARY EXHAUSTIVE MUTUALLY SEPARATED BY CAN BE RANK- SEPARATED BY ZERO EXCLUSIVE CATEGORY ORDERED EQUAL INTERVAL

NOMINAL X X X ORDINAL X X X X INTERVAL X X X X X

RATIO X X X X X X