Topic 2. Descriptive Statistics, Data Screening, and Transforming Variables

Univariate Descriptive Statistics  Procedures that help us organize, screen, and describe data from a sample or a population  These allow us to identify patterns that would be impossible to see if we looked at the data all at once  Tables o Frequency distribution  Graphs o For nominal and ordinal variables . Pie chart . Bar graph o For interval and ratio variables . Histogram . Line chart . Box plot . Quantile-normal plots  Statistics o Central Tendency . Mode, median, mean o Variability . Range, inter-quartile range, variance and standard deviation o Other statistics . Quantiles and percentiles, quartiles, skewness, kurtosis

First things first You should use all of these to examine your data. Statistics are the most efficient (a single number vs. a whole table or chart), but they are often misleading because they omit many details to achieve this efficiency. You should begin every analysis by examining each variable (i.e., before you do anything else that is more complicated).

The Uses of frequency distributions and graphs  Summarizing vast quantities of data  Obtaining information about the sample – how representative of the population are the sample data?  Data screening – searching for unusual cases/outliers, coding errors, missing data  Testing assumptions of statistical tests – for example, normality

Some guidelines for creating good frequency distributions and graphs  Choose the design based on the level of measurement of the variable  Categories should be mutually exclusive and exhaustive  The table/graph should stand on its own (i.e., it should pass the “lost in the parking lot” test)  Always include a descriptive title, labels, the data source, and the total number of cases

Page 1 of 13 Frequency distributions  A table reporting the number of observations/cases falling into each category of the variable  Can be used for all levels of measurement, but are better for variables with a small number of categories  Content of table depends on level of measurement o % allows us to make comparisons across groups easily o valid % if there are missing data o cumulative % for ordinal, interval, and ratio variables

Frequency Distributions in SPSS Analyze → Descriptive Statistics → Frequencies

Syntax: freq vars=satfin.

Unedited output from SPSS (fine for data screening, but not for presenting your research to others): SATFIN SATISFACTION WITH FINANCIAL SITUATION

Cumulative Frequency Percent Valid Percent Percent Valid 1 SATISFIED 834 29.6 29.8 29.8 2 MORE OR LESS 1261 44.8 45.0 74.7 3 NOT AT ALL SAT 708 25.1 25.3 100.0 Total 2803 99.5 100.0 Missing 8 DK 9 .3 9 NA 5 .2 Total 14 .5 Total 2817 100.0 Edited output: Table 1. Satisfaction with Financial Situation a (2000; N=2,803). Frequency Percent Valid Percent Cumulative Percent Valid 1 Satisfied 834 29.6 29.8 29.8 2 More or less satisfied 1261 44.8 45.0 74.7 3 Not satisfied at all 708 25.1 25.3 100.0 Total 2803 99.5 100.0 Missing 8 Don’t know 9 .3 9 Not available 5 .2 Total 14 .5 Total 2817 100.0 Source: Davis and Smith (2007).

Page 2 of 13 a. Question wording: “We are interested in how people are getting along financially these days. So far as you and your family are concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied at all?”

Page 3 of 13 Graphs Generating Simple Graphs in SPSS 1. Bar graphs, pie charts, and histograms: Analyze → Descriptive Statistics → Frequencies; Select the ‘Charts’ box 2. Line charts: Graphs → Legacy Dialogs Bar graph Pie chart Figure 1. Marital Status (2000; N=2,816). Figure 2. Satisfaction with Financial Situation (2000;

50 N=2,803).

45

40 Not at all

25.3% Satisfied

29.8% 30

25

20

16

10 10 t n

e More or less c r 4 e 45.0%

P 0 MARRIED DIVORCED NEVER MARRIED WIDOWED SEPARATED Source: Davis and Smith (2007). Source: Davis and Smith (2007). Histogram Line Chart Figure 3. The Number of Work Hours Last Week Figure 4. The Number of Work Hours Last Week (2000; N=1,818). (2000; N=1,818).

800 700

600

600 500

400 400

300

200 200 y c n

e 100 u t q n e r u

F 0 o

C 0 5.0 15.0 25.0 35.0 45.0 55.0 65.0 75.0 85.0 3 8 15 20 25 30 35 40 45 50 55 60 66 74 84 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 Source: Davis and Smith (2007). Source: Davis and Smith (2007). Note: SPSS collapses the data into ‘bins’ – that is, scores from 0 to 5 are summarized by the first bar; you can change this in the chart editor. References Davis, James Allan and Smith, Tom W. General social surveys, 1972-2006 [machine-readable data file] /Principal Investigator, James A. Davis; Director and Co-Principal Investigator, Tom W. Smith; Co- Principal Investigator, Peter V. Marsden; Sponsored by National Science Foundation. --NORC ed.-- Chicago: National Opinion Research Center [producer]; Storrs, CT: The Roper Center for Public Opinion Research, University of Connecticut [distributor], 2007. Page 4 of 13 Central tendency (See the Excel file for examples) Mode  The mode is the category (nominal/ordinal) or score (interval-ratio) with the largest frequency  The mode is always the category or score on the variable, not the frequency or percent  Bimodal and essentially bimodal distributions

Median (an order statistic)  The median is the score than divides the distribution into two equal parts so that half the cases are above it and half are below it  Order is implied in the definition of the median. You have to be able to order the categories of the variable from least to most or most to least to calculate the median. You can only calculate the median for ordinal and interval-ratio variables  Note – if you are finding the median for an ordinal variable with an even number of categories, it is inappropriate to calculate the average of the middle two cases  The median is not affected by the scores assigned to the categories

Quantiles, percentiles, and quartiles (order statistics)  A percentile is a score at or below which a specific percentage of cases fall; quantiles differ only in that the value is a proportion rather than a percent  The lower quartile is equal to the 25th percentile and the .25 quantile. The middle quartile is equal to the median, the 50th percentile, and the .50 quantile. The upper quartile is equal to the 75th percentile and the .75 quantile

Mean The mean is the average, which is obtained by adding up all of the scores and dividing by the number of scores: N y  i y  i1 N

Some important properties of the mean: 1. You can only calculate the mean for interval-ratio variables 2. The mean is the center of gravity of the distribution because it incorporates all scores in the distribution (unlike the median and mode). This means that if we subtract the mean from each case and add up the differences, the result will always equal zero 3. Unlike the mode and the median, the mean is sensitive to extreme scores in the distribution. This is because it takes into account every score

Dichotomous variables (i.e., those with only two categories) are special. Sex (1) Male – frequency = 3 (2) Female – frequency = 7

If recoded sex such that 1=0 and 2=1 (i.e., into a ‘dummy variable’), then the mean would be .7, which is the proportion female!

Page 5 of 13 Variability (See the Excel file for examples) The Importance of Measuring Variability If we just look at central tendency, we may be misled because two distributions can have the exact same mean, median, and/or mode, but different degrees of variability.

Commuting times for two different routes: Freeway Minutes Downtown Minutes Week 1 Week 1 Monday 21 Monday 31 Tuesday 19 Tuesday 27 Wednesday 20 Wednesday 29 Thursday 28 Thursday 27 Friday 53 Friday 31 Week 2 Week 2 Monday 24 Monday 30 Tuesday 18 Tuesday 28 Wednesday 19 Wednesday 27 Thursday 28 Thursday 27 Friday 60 Friday 33

Mean 29.0 Mean 29.0

 On average, it takes 29.0 minutes to get to work regardless of the route  Notice that there is more variability day-to-day if you take the freeway  If you value being on-time, then you should take the downtown route and leave the house about 33 minutes before work

Range The range is the difference between the maximum observed value and the minimum observed value. The drawback is that there may be outliers. In the presence of outliers, you should use the inter-quartile range.

Inter-quartile Range The inter-quartile range is the difference between the values at the lower and upper quartiles. It is similar to the range except that it focuses on two less extreme scores (i.e., the lower and upper quartiles instead of the minimum and maximum values).

To find the case at the lower quartile: N*.25 To find the case at the upper quartile: N*.75

Once you identify the cases at the lower and upper quartiles, subtract the score for the case at the lower quartile from that of the case at the upper quartile. The result is the inter-quartile range.

Page 6 of 13 Variance and Standard Deviation 2 Variance ( SY ): a measure of variation for interval-ratio variables; it is the average of the squared deviations from the mean (note: it is usually N-1). (y  y) 2 S 2   y N

Standard deviation (SY): a measure of variation for interval-ratio variables; it is equal to the square root of the variance or it is the average deviation from the mean. (y  y) 2 S   y N

The shape of the distribution Distributions can be either symmetric or skewed  A symmetric distribution is a mirror image of itself if you divide the distribution in half  In a unimodal symmetric distribution the mean, median, and mode are equal

Skewed distributions are characterized by extreme values on one side of the distribution  Those that have extremely high values (compared to the rest of the distribution) are positively skewed  Those that have extremely low values (compared to the rest of the distribution) are negatively skewed  The mean is pulled toward the side with the extreme values; the median, however, is unaffected  The easiest way to tell between positive and negative skew is to compare the mean to the median o If the mean is higher than the median then the variable is positively skewed o If the mean is lower than the median then the variable is negatively skewed

Positive Skew Negative Skew

30 30

25 25

20 20 t t n n e e

c c 15

r 15 r e e P P

10 10

5 5

0 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Score on X Score on X

Page 7 of 13 How do you choose a measure of central tendency or variability? 1. Level of measurement: Nominal Ordinal Interval and ratio Central tendency Preferred Mode Mode Mean Median Median Mode Variability Preferred Standard deviation Variance Range Inter-quartile range

2. Shape of the distribution – for skewed interval-ratio variables, use the median and inter-quartile range 3. Research objective – for example, do you want the most typical value, the value in the middle of the distribution, or the average of all scores?

Measures of central tendency and variability in SPSS 1. Analyze → Descriptive Statistics → Frequencies; Select the ‘Statistics’ box

2. Analyze → Descriptive Statistics → Explore; Select the ‘Statistics’ box 3. Analyze → Descriptive Statistics → Descriptives; Select the ‘Options' box

Statistics

MARITAL MARITAL STATUS HRS1 NUMBER OF HOURS WORKED LAST WEEK N Valid 2,816 Correction N Valid 1818 Missing 999 Missing 1 Mean 41.90 Mode 1 ‘Married’ Median 40.00 Mode 40 Std. Deviation 13.395 SATFIN SATISFACTION WITH FINANCIAL SITUATION Variance 179.430 N Valid 2,803 Correction Skewness .212 Missing 14 Std. Error of Skewness .057 Median 2.00 ‘More or less’ Kurtosis 1.668 Std. Error of Kurtosis .115 Mode 2 ‘More or less’ Range 86 Minimum 3 Maximum 89 Percentiles 25 40.00 50 40.00 75 49.00

Page 8 of 13 Data Screening Missing Data What are some of its sources?  The use of contingency questions in questionnaires  The use of multiple ballots  Non-response and refusals  Interviewer errors Why does it matter?  Missing data reduces sample size, which reduces statistical power  Missing data can make a representative sample non-representative  Missing data can influence our estimates

There are a variety of methods available to deal with missing data – the most advanced are far beyond the scope of this course. Regardless, you need to examine how many cases have missing data on each variable and why.

Outliers What are they?  Definitions vary, but cases more than 3 standard deviations from the mean can be considered outliers What causes them and why do they matter?  They may be data errors  They may result from the inclusion of a case from a different population  They may influence our estimates because they can have great leverage – we’ll discuss this later

You can search for outliers using your descriptive statistics, frequency distributions, and graphs. SPSS will identify the five highest and lowest outliers in the Explore command: Extreme Values

Case Number Value HRS1 NUMBER OF Highest 1 1089 89 HOURS WORKED 2 1986 89 LAST WEEK 3 2494 89 4 2223 89 5 2298 .a Lowest 1 787 3 2 914 3 3 1245 3 4 2288 4 5 1785 .b a. Only a partial list of cases with the value 89 are shown in the table of upper extremes. b. Only a partial list of cases with the value 4 are shown in the table of lower extremes.

You can also identify outliers by using a box plot: Analyze → Descriptive Statistics → Explore; Select the ‘Plots’ box

Page 9 of 13 Box Plot Box plot Figure 5. The Number of Work Hours Last Week Summary plot based on the median, quartiles, and extreme values. (2000; N=1,818). A line across the box indicates the median [marked in red in 100 Figure 5; 40 in this example].

The box represents the inter-quartile range which contains 50% of 80 values [from 40 to 49 in this example].

The whiskers are lines that extend from the box to the highest and 60 lowest non-outlier values. The highest and lowest non-outlier values are defined as up to 1.5 of the inter-quartile range. In this example, IQR=9, 1.5*9=13.5, so the whiskers extend from 49 to 40 62.5 and from 40 to 26.5.

Outliers (i.e., cases with values between 1.5 and 3 box lengths 20 from the upper or lower edge of the box) are represented by circles; IQR=9 so1.5*9=13.5 and 3*9=g Outliers are between 62.5 and 76 and 13 and 26.5. 0 NUMBER OF HOURS WORK Extreme values (i.e., cases with values more than 3 box lengths from the upper or lower edge of the box) are represented by asterisks. Extreme values extend above 76 and below 13 in this example. Source: Davis and Smith (2007).

Normality Why does it matter?  Some statistical procedures assume normality – they are invalid if this assumption is invalid  Even when it is not assumed, skewed distributions can influence estimation and hypothesis testing by causing other statistical problems

You can examine normality using any of the charts listed above (e.g., a histogram) as well as quantile-normal plots: Analyze → Descriptive Statistics → Explore; Select the ‘Plots’ box and check the ‘Normality plots with tests’ box

Figure 6. A Normal Q-Q Plot of Work Hours.

4 The Q refers to quantiles. This plot displays the 3 quantiles of our variable against the expected

2 quantiles (i.e., if it were normal)

1 If the variable is normally distributed, the dots will

0 all fall on the line

-1 l

a This plot suggests that work hours deviates from m r

o -2 normality; for example, there are: N

d  More cases between 4 and 24 hours per e t -3 c

e week than expected p x -4 E  Fewer cases between 27 and 39 hours per 0 10 20 30 40 50 60 70 80 90 100 week than expected Observed Value  Etc. Page 10 of 13 Transforming Variables Recoding Variables in SPSS Using the Menu: Transform → Recode → Into Different Variables

SPSS Syntax: recode tvhours (0=0) (1=1) (2=2) (3 4=3) (5 thru 24=4) (else=sysmis) into tvhrcat.

The original variable: The recoded variable: TVHOURS HOURS PER DAY WATCHING TV TVHRCAT

Cumulative Cumulative Frequency Percent Valid Percent Percent Frequency Percent Valid Percent Percent Valid 0 107 3.8 5.9 5.9 Valid .00 107 3.8 5.9 5.9 1 380 13.5 20.8 26.6 1.00 380 13.5 20.8 26.6 2 510 18.1 27.9 54.5 2.00 510 18.1 27.9 54.5 3 310 11.0 16.9 71.5 3.00 543 19.3 29.7 84.2 4 233 8.3 12.7 84.2 4.00 289 10.3 15.8 100.0 5 95 3.4 5.2 89.4 Total 1829 64.9 100.0 6 64 2.3 3.5 92.9 Missing System 988 35.1 7 18 .6 1.0 93.9 Total 2817 100.0 8 47 1.7 2.6 96.4 10 24 .9 1.3 97.8 11 4 .1 .2 98.0 Note – Be sure to add value labels to the tvhrcat variable so that you 12 21 .7 1.1 99.1 will remember that 3 now means 3 or 4 and 4 now means 5 or more. 13 1 .0 .1 99.2 14 2 .1 .1 99.3 15 7 .2 .4 99.7 SPSS Syntax for adding value labels: 20 3 .1 .2 99.8 21 1 .0 .1 99.9 24 2 .1 .1 100.0 add value labels tvhrcat Total 1829 64.9 100.0 0 ‘0’ Missing -1 NAP 940 33.4 1 '1' 98 DK 3 .1 99 NA 45 1.6 2 '2' Total 988 35.1 3 '3 or 4' Total 2817 100.0 4 '5 or more'.

Page 11 of 13 Computing Variables in SPSS Using the Menu: Transform → Compute

SPSS Syntax: compute wktvdiff=hrs1-tvhours.

Here is the result: Statistics WKTVDIFF WKTVDIFF 400 N Valid 1181 Missing 1636 Mean 39.6842 Median 39.0000 300 Mode 38.00 Std. Deviation 13.83398 Variance 191.37898 200 Skewness .184 Std. Error of Skewness .071 Kurtosis 1.690 Std. Error of Kurtosis .142 100 Range 98.00 y c n

Minimum -10.00 e u

Maximum 88.00 q e r

Percentiles 25 35.0000 F 0 50 39.0000 -10.0 10.0 30.0 50.0 70.0 90.0 0.0 20.0 40.0 60.0 80.0 75 46.0000

Page 12 of 13 Transformations to reduce skewness and to pull in outliers It is also possible to transform variables to reduce skew and to pull in outliers; tvhours is positively skewed: TVHOURS HOURS PER DAY WATCHING TV

Cumulative Frequency Percent Valid Percent Percent Valid 0 107 3.8 5.9 5.9 1 380 13.5 20.8 26.6 2 510 18.1 27.9 54.5 3 310 11.0 16.9 71.5 600 4 233 8.3 12.7 84.2 5 95 3.4 5.2 89.4 6 64 2.3 3.5 92.9 7 18 .6 1.0 93.9 500 8 47 1.7 2.6 96.4 10 24 .9 1.3 97.8 11 4 .1 .2 98.0 400 12 21 .7 1.1 99.1 13 1 .0 .1 99.2 14 2 .1 .1 99.3 300 15 7 .2 .4 99.7 20 3 .1 .2 99.8 21 1 .0 .1 99.9 200 24 2 .1 .1 100.0 Total 1829 64.9 100.0 Missing -1 NAP 940 33.4 y

c 100 98 DK 3 .1 n e

99 NA 45 1.6 u q

Total 988 35.1 e r 0 Total 2817 100.0 F

SPSS Syntax: * Natural log - you have to add 1 because the natural log of 0 is undefined. compute tvhr_ln=ln(tvhours+1). TVHR_LN

Cumulative Frequency Percent Valid Percent Percent Valid .00 107 3.8 5.9 5.9 .69 380 13.5 20.8 26.6 600 1.10 510 18.1 27.9 54.5 1.39 310 11.0 16.9 71.5 1.61 233 8.3 12.7 84.2 500 1.79 95 3.4 5.2 89.4 1.95 64 2.3 3.5 92.9 2.08 18 .6 1.0 93.9 400 2.20 47 1.7 2.6 96.4 2.40 24 .9 1.3 97.8 2.48 4 .1 .2 98.0 300 2.56 21 .7 1.1 99.1 2.64 1 .0 .1 99.2 2.71 2 .1 .1 99.3 200 2.77 7 .2 .4 99.7 3.04 3 .1 .2 99.8 3.09 1 .0 .1 99.9 y

3.22 2 .1 .1 100.0 c 100 n

Total e

1829 64.9 100.0 u q

Missing System 988 35.1 e r 0 Total 2817 100.0 F

Page 13 of 13