Practice-Oriented Paper Conducting and Interpreting T-Test and ANOVA Using JASP

Practice-Oriented Paper Conducting and Interpreting T-test and ANOVA Using JASP Caroline J. Handley Asia University Many foreign language university educators conduct quantitative research; for some analysing and interpreting the data can be the most challenging aspect. A short introduction to data analysis is presented, focusing on why effect sizes and confidence intervals should be reported as well as the probability of a statistical difference between groups p( value). This is followed by an explanation of how to perform two commonly-used analyses in applied linguistics research, t-test and ANOVA, using JASP, an open source software package created at the University of Amsterdam. Finally, further consideration is given to effect size and confidence intervals, as related to statistical power, to highlight the importance of good study design. The overview provided should be useful for novice researchers who need to perform and report basic statistical analyses of quantitative data obtained through classroom research. This paper provides an introduction to basic statistical analysis of quantitative data for novice researchers. Statistical analysis inevitably involves technical language; such terms that are used in this paper are explained and, for ease of reference, described in table format (Appendix A). A brief overview of some theoretical issues in understanding statistics is given first. Then, after a summary of the version of JASP1(JASP Team, 2018; Version 0.9.0.1; downloaded at jasp-stats.org), a fictional data set is used to illustrate how to perform two frequently-used analyses in applied linguistics research using JASP software. First, I explain how to import data into JASP, then how to calculate and visualise the numbers that describe the data (descriptive statistics). Next, an example is given of an independent t-test and analysis of variance (ANOVA). These tests yield a probability p( ) value which enables researchers to make predictions about the whole population based 143 Handley on the sample used in one’s experiment (inferential statistics). Finally, the data is used to discuss interpreting and supplementing p values and designing reliable experiments. The experimental design described in this paper involves the comparison of students within a control group (i.e., a group which receives no experimental intervention) and one or more experimental groups. The intervention is the independent variable (the factor manipulated) and its effect is measured on the dependent variable (test scores) by means of an independent t-test and ANOVA (explained below). Both test the null hypothesis that any difference between the scores of each group is due to chance variation (random error). The tests give the probability of the results that were obtained if the null hypothesis were true, stated as a p value, which is typically declared statistically significant if it is less than .05 (if the results would be obtained due to chance in less than 5% of studies). This probability is based on the difference between the mean (average test score) of each group and the variance within each group (the difference between each student’s score and their group mean), summarised as the standard deviation of each group. The tests do not test the research hypothesis (the alternative hypothesis) of a systematic effect of the intervention on test scores, therefore a significant p value does not indicate the research hypothesis is true (see Cohen, 1994 for a detailed explanation). Statistical significance testing, although widely used, has long been criticised due to two further limitations (e.g., Carver, 1978). Thep value obtained is largely dependent on the sample size (the number of students in each group) and does not provide any information about the magnitude of any difference between groups. Since 1994, APA guidelines have stated that effect sizes should always be reported alongside p values (American Psychological Association, 2009), although this is rare in applied linguistics research (Norris & Ortega, 2000; Plonsky, 2011). Effect size is a measure of the magnitude of the effect or difference between groups (or interventions), so it more directly diagnoses the importance of the factor(s) or variables being investigated (Plonsky, 2015). Benchmarks proposed by Cohen (1992) for the behavioural sciences are often used. For differences between the means of two groups (such ast -test), 144 Conducting and Interpreting T-test and ANOVA Using JASP, OCJSI 1, pages 143-162 he suggested that Cohen’s d = 0.2 should be considered a small effect, 0.5 is a medium effect, and any value over 0.8 indicates a large effect. Unlike when reporting p values, the zero before the decimal point is needed as effect sizes larger than one are possible. However, Oswald and Plonsky (2010) tentatively proposed that in L2 research, Cohen’s d values of 0.4, 0.7, and 1.0 should be considered the benchmarks for small, medium, and large effects. For differences between the means of more than two groups (such as ANOVA), the standards for eta squared (ƞ2) or omega squared (ω2) are 0.01 for small, 0.06 for medium, and 0.14 for large effects (Field, 2017). Confidence intervals (CIs) are another measure that should be reported (Cumming, 2012), as they show the uncertainty around the mean values obtained. They are based on the standard error, which is calculated from the standard deviation and sample size, meaning that confidence intervals decrease as sample or group size increases. The 95% confidence interval covers a range of values above and below the mean of each group, indicating 95% confidence that the true mean lies somewhere within this range. Small groups are more vulnerable to chance variation, so 95% CIs are likely to be long, reflecting low confidence in the accuracy of one’s results. Effect sizes and CIs are explained not only to encourage better reporting of results, but also to highlight the importance of sample size in quantitative research. In the fictional experiment created to model data analysis, I used the average group size in second language research of 20 participants (Plonsky, 2013; Plonsky & Gass, 2011). This was deliberate to illustrate why this sample size is often too small to reliably detect a significant difference even when the research hypothesis is true. If the group size is too small, an experiment is underpowered, and chance largely determines whether the result will be statistically significant (Ioannidis, 2005). This issue in the analyses conducted is revisited at the end of this paper, and some further resources for understanding and conducting quantitative data analysis are provided in Appendix B. 145 Handley Data Analysis in JASP JASP is used in this paper as it is free software, available for Windows, Mac, and LINUX. It is user-friendly for novices and includes guided analyses of various data sets, many taken from Field (2017). Results of statistical tests performed in JASP are displayed in tables and graphs that can be exported into documents in APA format. In addition, JASP is regularly being updated to expand its functionality. JASP also offers advanced statistical analyses, such as structural equation modelling (SEM) and Bayesian versions of most inferential tests. Bayesian analysis is an alternative approach to probability testing which is being adopted in many scientific disciplines (e.g., Van de Schoot, Winter, Ryan, Zondervan-Zwijnenburg, & Depaoli, 2017). It is not discussed in this paper, but a short introduction by the developers can be downloaded from the website (Wagenmakers, Gronau, Etz, & Matzke, 2017). The analyses described below were performed using a deliberately simple fictional data set, simulating the scores of 20 participants per group on a vocabulary test with a maximum score of 20. The imaginary scenario was that students read texts containing a total of 20 unknown words, then were tested on their knowledge of those target words under one of two to three conditions: no explicit vocabulary instruction (control group), Intervention A, and, to describe ANOVA, an alternative Intervention B. The research hypothesis for both analyses was that the students who received the teaching intervention(s) would perform significantly better on the vocabulary test than the control group. Therefore, the null hypothesis being tested was that there would be no statistically significant difference in test scores between groups. Importing and Manipulating Data in JASP JASP can open data files that are saved as .csv, .sav, or .ods files. Excel spreadsheets can be saved as .csv files by selecting that option from the dropdown menu on saving. When the file is open in JASP, any subsequent changes saved in the original spreadsheet are automatically updated. The data for each participant should be entered in separate rows, with appropriate column headings. In the example data sets used here, there are a control and one or two intervention groups. This 146 Conducting and Interpreting T-test and ANOVA Using JASP, OCJSI 1, pages 143-162 means that for both the t-test and ANOVA there are just three columns of data: participant ID, group type, and vocabulary test score. As Figure 1 shows, JASP displays information in three panels. The data are on the left, the instructions options for the analyses are in the centre, and the results of the analyses performed are on the right. On loading the spreadsheet, JASP guesses the type of data in each column: nominal (non-numerical data, such as gender), ordinal (ranked numbers, such as the responses on a Likert scale), and scale (numbers that are separated by equal distances, such as test scores2). If JASP guesses incorrectly, click on the symbol next to the heading to change it before proceeding. The available statistical tests are spread across the top bar, from which a sub-type is selected. Then the middle panel will appear with boxes for entering data into the analysis.

Load more