Statistical Analysis Using Spss
Total Page:16
File Type:pdf, Size:1020Kb
STATISTICAL ANALYSIS USING SPSS Anne Schad Bergsaker 24. September 2020 BEFORE WE BEGIN... LEARNING GOALS 1. Know about the most common tests that are used in statistical analysis 2. Know the difference between parametric and non-parametric tests 3. Know which statistical tests to use in different situations 4. Know when you should use robust methods 5. Know how to interpret the results from a test done in SPSS, and know if a model you have made is good or not 1 TECHNICAL PREREQUISITES If you have not got SPSS installed on your own device, use remote desktop, by going to view.uio.no. The data files used for examples are from the SPSS survival manual. These files can be downloaded as a single .zip file from the course website. Try to do what I do, and follow the same steps. If you missed a step or have a question, don’t hesitate to ask. 2 TYPICAL PREREQUISITES/ASSUMPTIONS FOR ANALYSES WHAT TYPE OF STUDY WILL YOU DO? There are (in general) two types of studies: controlled studies and observational studies. In the former you run a study where you either have parallel groups with different experimental conditions (independent), or you let everyone start out the same and then actively change the conditions for all participants during the experiment (longitudinal). Observational studies involves observing without intervening, to look for correlations. Keep in mind that these studies can not tell you anything about cause and effect, but simply co-occurence. 3 RANDOMNESS All cases/participants in a data set should as far as it is possible, be a random sample. This assumption is at the heart of statistics. If you do not have a random sample, you will have problems with unforeseen sources of error, and it will be more difficult to draw general conclusions, since you can no longer assume that your sample is representative of the population as a whole. 4 INDEPENDENCE Measurements from the same person will not be independent. Measurements from individuals who belong to a group, e.g. members of a family, can influence each other, and may therefore not necessarily be independent. In regular linear regression analysis data needs to be independent. For t-tests and ANOVA there are special solutions if you have data from the same person at different points in time or under different test conditions. It is a little more complicated to get around the issue of individuals who have been allowed to influence each other. It can be done, but it is beyond the scope of this course. 5 OUTLIERS Extreme values that stand out from the rest of the data, or data from a different population, will always make it more difficult to make good models, as these single points will not fit well with the model, while at the same time, they may have a great influence on the model itself. If you have outliers, you may want to consider transforming or trimming (remove the top and bottom 5%, 10%, etc) your data set, or you can remove single points (if it seems like these are measurement errors). Alternatively, you can use more robust methods that are less affected by outliers. If you do remove points or change your data set, you have to explain why you do it. It is not enough to say that the data points do not fit the model. The model should be adapted to the data, not the other way around. 6 LINEARITY AND ADDITIVITY Most of the tests we will run assume that the relationship between variables is linear. A non-linear relation will not necessarily be discovered, and a model based on linearity will not provide a good description of the data. Additivity means that a model based on several variables is best represented by adding the effects of the different variables. Most regular models assume additivity. 7 HOMOSCEDASTICITY/CONSTANT VARIANCE Deviation between the data and the model are called residuals. The residuals should be normally distributed, but they should also have more or less constant spread throughout the model. Correspondingly the variance or spread in data from different categories or groups should also be more or less the same. If the error in the model changes as the input increases or decreases, we do not have homoscedasticity. We have a problem: heteroscedasticity. 8 NORMALITY OR NORMAL DISTRIBUTION Most tests assume that something or other is normally distributed (the residuals, the estimates should come from a normal sample distribution, etc.), and use this to their advantage. This is the case for t-tests, ANOVA, Pearson correlation and linear regression. Because of the central limit theorem we can assume that for large data sets (more than 30 cases) the parameter estimates will have a normal sampling distribution, regardless of the distribution of the data. However, if you have much spread in the data or many outliers, you will need more cases. It is usually a good idea to have at least 100 just to be safe. It is still a good idea to check how the data are distributed before we get started on any more complicated analyses. 9 NORMALITY To check if the data are normally distributed we use Explore Analyze > Descriptive Statistics > Explore • Dependent list: The variable(s) that you wish to analyze • Factor list: Categorical variable that can define groups within the data in the dependent list. • Label cases by: Makes it easier to identify extreme outliers 10 NORMALITY Explore: Plots • Boxplots: Factor levels together - If you have provided a factor variable this option will make one plot with all the groups • Boxplots: Dependents together - If you have provided more than one dependent variable, this will put the different variables together in the same graph • Descriptive: Histogram is usually the most informative choice • Normality plots with tests: Plots and tables that makes it clearer if the data are normal or not 11 NORMALITY Output: Tests of Normality and Extreme Values • Tests of Normality: If the data are perfectly normal, sigma will be greater than 0.05. HOWEVER, this hardly ever happens in large data sets. Therefore it is better to look at the plots to decide if they are normal or not. • Extreme values: the five biggest and smallest values. 12 NORMALITY • Histogram: Here we see that the data are a little skewed, but overall they are almost normal • Box plot: Shows much the same as the histogram 13 NORMALITY • Normal Q-Q plot: The black line shows where the data should be if it is perfectly normal.Except for the right tail, the data lie fairly close to the line • Detrended normal Q-Q plot: This shows the deviation between the data and the normal distribution more clearly. There is no clear trend in the deviation, which is a good sign, but we see even more clearly that the right tail is more heavy than expected according to the normal distribution. 14 HELP, MY DATA DO NOT FULFILL THE REQUIREMENTS! In some cases you will be able to choose a non-parametric test instead, which does not have the same strict prerequisites. Trim the data: Remove the highest and lowest 5%, 10%, 15%, etc, alternatively trim based on standard deviation (usually a bad idea as the standard deviation is heavily influenced by outliers). Windsorizing: Replace the values of extreme data points with the highest (or lowest) value that is not an outlier. Bootstrapping: Create many hypothetical sets of samples, based on the values you already have, and do the same types of analysis on all these samples, to obtain interval estimates Transform variables that deviate greatly from the normal distribution 15 BOOTSTRAPPING Bootstrapping is a form of robust analysis, and it is the most commonly used one. It is also very easy to implement in SPSS. Bootstrapping is a method where you get SPSS to treat your sample as a kind of population. New cases are drawn from your sample (and then returned), to create a new sample consisting of N cases. Usually N is equal to the number of cases in your sample. This is done as many times as you want, typically at least 1000 times, to create the same number of new samples. For each sample, the statistical parameters you ask for are calculated. Based on the results from all of the samples, interval estimates for each parameter is given, e.g.˙for mean or correlation coefficient. 16 TRANSFORMATION OF ’ABNORMAL’ DATA Last resort, as it can make it harder to interpret the results. 17 NUMBER OF CASES/DATA MEASUREMENTS There is no single number of cases needed to use a specific statistical test, or to obtain significance. It will depend on the type of analysis and the size of the effect your are looking for. The smaller the effect, the more cases you need. The central limit theorem states that even though the population the data is taken from is not normal, the estimates you create will have a normal sample distribution as long as you have enough data, typically you need at least 30 cases. However, 30 cases is the absolute minimum, and is not always enough. If you have data with a high variance, you will need more than 30 cases, and if you wish to compare groups, the different groups should also have more than 30 cases each. 18 EXPLORE RELATIONSHIP BETWEEN VARIABLES CORRELATION Correlation measures the strength of the linear relationship between variables. If two variables are correlated, a change in one variable will correspond with a change in the other. Correlation is given as a number between -1 and 1, where 1 indicates a perfect correlation.