D.G. Bonett (8/2018)

Module 3

One-factor

A between-subjects treatment factor is an independent variable with a  2 levels in which participants are randomized into a groups. It is common, but not necessary, to have an equal number of participants in each group. Each group receives one of the a levels of the independent variable with participants being treated identically in every other respect. The two-group considered previously is a special case of this type of design.

In a one-factor experiment with a levels of the independent variable (also called a completely randomized design), the population parameters are 휇1, 휇2, …, 휇푎 where 휇푗 (j = 1 to a) is the population of the response variable if all members of the study population had received level j of the independent variable. One way to assess the differences among the a population is to compute confidence intervals for all possible pairs of differences. For example, with a = 3 levels the following pairwise comparisons of population means could be examined.

휇1 – 휇2 휇1 – 휇3 휇2 – 휇3

In a one-factor experiment with a levels there are a(a – 1)/2 pairwise comparisons. Confidence intervals for any of the two-group measures of effects size (e.g., mean difference, standardized mean difference, mean ratio, difference, median ratio) described in Module 2 can be used to analyze any pair of groups.

For any single 100(1 − 훼)% , we can be 100(1 − 훼)% confident that the confidence interval has captured the population parameter and if v 100(1 − 훼)% confidence intervals are computed, we can be at least 100(1 − 푣훼)% confident that all v confidence intervals have captured their population parameters. For example, if 95% confidence intervals for 휇1 – 휇2, 휇1 – 휇3, and 휇2 – 휇3 are computed, we can be at least 100(1 − 3훼)% = 100(1 – .15)% = 85% confident that all three confidence intervals have captured the three population mean differences.

When considering v confidence intervals for some measure of , the researcher would like to be at least 100(1 − 훼)% confident, rather than at least 100(1 − 푣훼)% confident, that all v confidence intervals will capture the v population effect size values. One simple way to achieve this is to use a Bonferroni adjustment 훼* = 훼/v rather than 훼 in the critical t-value or critical z-value for each confidence interval.

1 D.G. Bonett (8/2018)

When examining all possible pairwise differences, the Tukey-Kramer method yields a narrower confidence interval than the Bonferroni method. The classical Tukey-Kramer method for comparing all possible pairs of means assumes equal population , but a version of the Tukey-Kramer method that does not require equal population variances is available. SPSS provides an option to compute Games-Howell confidence intervals for all pair-wise comparisons of means that are the same as the unequal version of the Tukey-Kramer confidence intervals. The Tukey-Kramer and Games-Howell methods are used only when the researcher is interested in examining all possible pairwise differences. A Bonferroni confidence interval will be narrower than a Tukey- Kramer or Games-Howell confidence interval if, prior to an examination of the sample results, the researcher is interested in only u < v of the v = a(a – 1)/2 possible pairwise comparisons. For u planned comparisons, the Bonferroni adjustment is 훼* = 훼/u. However, if u of the v possible pairwise comparisons appear interesting after an examination of the sample results, it is necessary to use 훼* = 훼/v and not 훼* = 훼/u.

Example 3.1. There is considerable variability in measures of intellectual ability among college students. One psychologist believes that some of this variability can be explained by differences in how students expect to perform on these tests. Ninety undergraduates were randomly selected from a list of about 5,400 undergraduates. The 90 students were randomly divided into three groups of equal size and all 90 students were given a nonverbal intelligence test (Raven’s Progressive Matrices) under identical testing conditions. The raw scores for this test from 0 to 60. The students in group 1 were told that they were taking a very difficult intelligence test. The students in group 2 were told that they were taking an interesting “puzzle”. The students in group 3 were not told anything. Simultaneous Tukey-Kramer confidence intervals for all pairwise comparisons of population means are given below

Comparison 95% Lower Limit 95% Upper Limit 휇1 – 휇2 -5.4 -3.1 휇1 – 휇3 -3.2 -1.4 휇2 – 휇3 1.2 3.5

The researcher is 95% confident that the mean intelligence score would be 3.1 to 5.4 greater if all 5,400 undergraduates had been told that the test was a puzzle instead of a difficult IQ test, 1.4 to 3.2 greater if they all had been told nothing instead of being told that the test is a difficult IQ test, and 1.2 to 3.5 greater if they all had been told the test was a puzzle instead of being told nothing. The simultaneous confidence intervals allow the researcher to be 95% confident regarding all three conclusions.

Linear Contrasts

Some research questions can be expressed in terms of a linear of population means, ∑푎 푐 휇 , where 푐 is called a contrast coefficient. For example, 푗=1 푗 푗 푗

2 D.G. Bonett (8/2018) in an experiment that compares two costly treatments (Treatments 1 and 2) with a new inexpensive treatment (Treatment 3), a confidence interval for (휇1 + 휇2)/2 – 휇3 may provide valuable information regarding the relative costs and benefits of the new treatment. Statistical packages and various statistical formulas require 푎 linear contrasts to be expressed as ∑푗=1 푐푗휇푗 which requires the specification of the contrast coefficients. For example, (휇1 + 휇2)/2 – 휇3 can be expressed as (½)휇1 + (½)휇2 + (-1)휇3 so that 푐1= .5, 푐2 = .5, and 푐3= -1. Consider another example where Treatment 1 is delivered to groups 1 and 2 by experimenters A and B and Treatment 2 is delivered to groups 3 and 4 by experimenters C and D. In this study we may want to estimate (휇1 + 휇2)/2 – (휇3 + 휇4)/2 which can be expressed as (½)휇1 + (½)휇2 + (-½)휇3 + (-½)휇4 so that 푐1= .5, 푐2 = .5 푐3= -.5 and 푐4= -.5.

푎 A 100(1 − 훼)% unequal-variance confidence interval for ∑푗=1 푐푗휇푗 is

2̂2 푎 푎 푐푗 휎푗 ∑푗=1 푐푗휇̂푗  푡훼/2;푑푓√ ∑푗=1 (3.1) 푛푗

2 2 2 4 4 푎 푐푗 휎̂푗 푎 푐푗 휎̂푗 where df = [∑푗=1 ] /[ ∑푗=1 2 ]. When examining v linear contrasts, 훼 can 푛푗 푛푗 (푛푗−1) be replaced with 훼* = 훼/v in Equation 3.1 to give a set of Bonferroni simultaneous confidence intervals.

If the sample sizes are approximately equal and there is convincing evidence from previous research that the population variances are not highly dissimilar, then the unequal-variance in Equation 3.1 could be replaced with an equal-

2 푎 2 2 푎 2 variance standard error√휎̂푝 ∑푗=1 푐푗 /푛푗 where 휎̂푝 = [∑푗=1(푛푗 − 1) 휎̂푗 ]/푑푓 and 푎 df = (∑푗=1 푛푗) − 푎.

Standardized Linear Contrasts

In applications where the intended audience may be unfamiliar with the metric of the response variable, it could be helpful to report a confidence interval for a standardized linear contrast of population means which is defined as

푎 ∑푗=1 푐푗휇푗 휑 = 푎 2 √∑푗=1 휎푗 /푎 and is generalization of the standardized mean difference defined in Module 2. The denominator of 휑 is called the standardizer. Some alternative standardizers have

3 D.G. Bonett (8/2018) been proposed for linear contrasts. One alternative standardizer averages variances across only those groups that have a non-zero contrast coefficient. Another standardizer uses only the variance from a control group. Although not recommended for routine use, the most popular standardizer is the square root of 2 휎̂푝 defined above, which can be justified only when the population variances are approximately equal or the sample sizes are equal.

An approximate equal-variance 100(1 − 훼)% confidence interval for 휑 is

휑̂ ± 푧훼/2푆퐸휑̂ (3.2)

푎 푎 2 2 2 푎 1 푎 2 where 휑̂ = ∑푗=1 푐푗휇̂푗/√(∑푗=1 휎̂푗 )/푎 and 푆퐸휑̂ = √(휑̂ /2푎 ) ∑푗=1 + ∑푗=1 푐푗 /푛푗. 푛푗−1 An unequal-variance confidence interval for 휑 is available and is recommended in studies with unequal sample sizes. When examining v linear contrasts, 훼 can be replaced with 훼* = 훼/v in Equation 3.2 to give a set of Bonferroni simultaneous confidence intervals.

Example 3.2. Ninety students were randomly selected from a research participant pool and randomized into three groups. All three groups were given the same set of boring tasks for 20 minutes. Then all students listened to an audio recording that listed the names of 40 people who will be attending a party and the names of 20 people who will not be attending the party in random order. The participants were told to simply write down the names of the people who will attend the party as they hear them. In group 1, the participants were asked to draw copies of complex geometric figures while they were listening to the audio recording and writing. In group 2, the participants were not told to draw anything while listening and writing. In group 3, the participants were told to draw squares while listening and writing. The number of correctly recorded attendees was obtained from each participant. The sample means and variances are given below.

Complex Drawing No Drawing Simple Drawing

휇̂1 = 24.9 휇̂2 = 23.1 휇̂3 = 31.6 2 2 2 휎̂1 = 27.2 휎̂2 = 21.8 휎̂3 = 24.8 n1 = 30 n2 = 30 n3 = 30

The 95% confidence interval for (휇1 + 휇2)/2 – 휇3 is [-9.82, -5.38]. The researcher is 95% confident that the population mean number of correctly recorded attendees averaged across the no drawing and complex drawing conditions is 5.38 to 9.82 lower than the population mean correctly recorded attendees under the simple drawing condition. The 95% confidence interval for 휑 is [-2.03, -1.03]. The researcher is 95% confident that the population mean number of correctly recorded attendee names, averaged across the no drawing and complex drawing conditions, is 1.03 to 2.03 standard deviations below the population mean number of correctly recorded attendee names under the simple drawing condition.

4 D.G. Bonett (8/2018)

Hypothesis Tests for Linear Contrasts

푎 A confidence interval for ∑푗=1 푐푗휇푗 can be used to perform a directional two-sided test of the following hypotheses.

푎 푎 푎 H0: ∑푗=1 푐푗휇푗 = 0 H1: ∑푗=1 푐푗휇푗 > 0 H2: ∑푗=1 푐푗휇푗 < 0

푎 If the lower limit for ∑푗=1 푐푗휇푗 is greater than 0, then reject H0 and accept H1. If the 푎 upper limit for ∑푗=1 푐푗휇푗 is less than 0, then reject H0 and accept H2. The results are inconclusive if the confidence interval includes 0. Note that it is not necessary to 푎 develop special hypothesis testing rules for 휑 because ∑푗=1 푐푗휇푗 = 0 implies 휑 = 0, 푎 푎 ∑푗=1 푐푗휇푗 > 0 implies 휑 > 0, and ∑푗=1 푐푗휇푗 < 0 implies 휑 < 0.

푎 In an equivalence test, the goal is to decide if ∑푗=1 푐푗휇푗 is between -푏 and 푏 or if 푎 ∑푗=1 푐푗휇푗 is outside this range, where 푏 is a number that represents a small or 푎 unimportant value of ∑푗=1 푐푗휇푗. An equivalence test involves selecting one of the following two hypotheses.

푎 푎 H0: | ∑푗=1 푐푗휇푗| ≤ 푏 H1: |∑푗=1 푐푗휇푗| > 푏

In applications where it is difficult to specify a small or unimportant value of 푎 ∑푗=1 푐푗휇푗, it may be easier to specify 푏 for a standardized linear contrast of means and choose between the following two hypotheses.

H0: |휑| ≤ 푏 H1: |휑| > 푏

Simultaneous Two-sided Directional Tests

Simultaneous confidence intervals could be used to test multiple hypotheses and keep the familywise directional error rate (FWDER) at or below 훼/2. FWDER is the probability of making one or more directional errors when testing multiple null hypotheses. The Holm test is more powerful than tests based on simultaneous confidence intervals and also keeps the FWDER at or below 훼/2. To perform a Holm test of v null hypotheses, rank order the p-values for the v tests from smallest to largest. If the smallest p-value is less than 훼/v, then reject H0 for that test and examine the next smallest p-value; otherwise, do not reject H0 for that test or any of the remaining v – 1 null hypotheses. If the second smallest p-value is less than 훼/(v – 1), then reject H0 for that test and examine the next smallest p-value; otherwise, do not reject H0 for that test or any of the remaining v – 2 null hypotheses. If the third smallest p-value is less than 훼/(v – 2), then reject H0 for

5 D.G. Bonett (8/2018)

that test and examine the next smallest p-value; otherwise, do not reject H0 for that test or any of the remaining v – 3 null hypotheses (and so on).

Suppose the ranked p-values for three (v = 3) tests of linear contrasts are .004, .028, and .031. For 훼 = .05, the first null hypothesis is rejected because .004 < .05/3. The second null hypothesis is not rejected because .028 > .05/2. The third null hypothesis is not rejected because the second null hypothesis was not rejected even though .031 < .05/1.

One-way

The variability in the response variable scores in a one-factor design can be decomposed into two sources of variability – the variance of scores within treatments (called the error variance or residual variance) and the variance due to mean differences across treatments (also called between-group variance). The decomposition of variability in a one-factor design can be summarized in a one- way analysis of variance (one-way ANOVA) table, as shown below, where n is the total sample size (n = 푛1 + 푛2 + … + 푛푎 ), SS stands for sum of squares, and MS stands for mean square. The between-group factor (i.e., the independent variable) will be referred to as "Factor A". The components of the ANOVA table for a one- factor design are shown below.

Source SS df MS F ______

A SSA dfA = a – 1 MSA = SSA/dfA MSA/MSE

ERROR SSE dfE = n – a MSE = SSE/dfE

TOTAL SST dfT = n – 1 ______

The sum of squares (SS) formulas are

푎 2 푎 푛푗 푎 SSA = ∑푗=1 푛푗(휇̂푗 − 휇̂+) where 휇̂+ = ∑푗=1 ∑푖=1 푦푖푗/ ∑푗=1 푛푗 푎 푛푗 2 푎 2 SSE = ∑푗=1 ∑푖=1(푦푖푗 − 휇̂푗) = ∑푗=1(푛푗 − 1) 휎̂푗 푎 푛푗 2 SST = ∑푗=1 ∑푖=1(푦푖푗 − 휇̂+) = SSA + SSE.

SSA will equal zero if all sample means are equal and will be large if the sample means are highly unequal. MSE = SSE/dfE is called the mean squared error and is 2 equal to the pooled within-group variance (휎̂푝 ) that was defined previously for the equal-variance confidence interval. SST/dfT is the variance for the total set of n scores ignoring group membership.

6 D.G. Bonett (8/2018)

The SS values in the ANOVA table can be used to estimate a standardized measure 2 2 2 of effect size called eta-squared which can be defined as 휂 = 1 – 휎퐸 /휎푇 . In a 2 nonexperimental design, 휎푇 is the variance of the response variable for everyone 2 in the study population and 휎퐸 is the variance of the response variable within each 2 subpopulation of the study population (and 휎퐸 is assumed to be equal across all 2 subpopulations). In an experimental design, 휎퐸 is the variance of the response variable for everyone in the study population assuming they all received a 2 2 2 2 particular treatment and 휎푇 = 휎휇 + 휎퐸 where 휎휇 is the variance of the a population means (휇1, 휇2, … , 휇푎).

An estimate of 휂2 can be computed using any of the following three formulas. 푆푆 푆푆 푆푆 휂̂2 = 1 – 퐸 = 퐴 = 퐴 푆푆푇 푆푆퐴 + 푆푆퐸 푆푆푇

2 The value of 휂 can range from 0 to 1 (because SSE has a possible range of 0 to SST) and describes the proportion of the response variable variance in the population that is predictable from the between-group factor. In designs with many groups, 휂2 is a useful alternative to an examination of all possible pairwise comparisons.

The estimate of 휂2 contains error of unknown magnitude and direction and therefore a confidence interval for 휂2 should be reported along with 휂̂2. In applications where the goal of the study is to show that all a population means have similar values, a small upper confidence interval limit for 휂2 would provide the necessary evidence to make such a claim. The confidence interval for 휂2 is complicated but can be obtained in SAS or R.

Example 3.3. Sixty undergraduates were randomly selected from a study population of 4,350 college students and then classified into three groups according to their political affiliation (Democrat, Republican, Independent). A stereotyping was given to all 60 participants. A one-way ANOVA detected differences in the three population means (F(2, 57) = 5.02, p = .010, 휂̂ 2 = .15, 95% CI = [.01, .30]). The researcher can be 95% confident that 1% to 30% of the variance in the stereotyping scores of the 4,350 college students can be predicted from knowledge of their political affiliation.

The F from the ANOVA table is traditionally used to test the null hypothesis H0: 휇1 = 휇2 = … = 휇푎 against an that at least one pair of population means is not equal. This type of hypothesis test is referred to as an omnibus test. The null and alternative hypotheses also can be expressed as 2 2 H0: 휂 = 0 and H1: 휂 > 0. Statistical packages will compute the p-value for the F statistic which can be used to decide if H0 can be rejected. The use of the F statistic to test H0 is often referred to as an F test.

7 D.G. Bonett (8/2018)

It is common practice to declare the ANOVA result to be “significant” when the p-value is less than .05, but it is important to remember that a significant result 2 simply indicates a rejection of H0. The rejection of H0: 휂 = 0 is not a scientifically 2 important finding because H0: 휂 = 0 is known to be false in almost every study. Furthermore, a "nonsignificant" results should not be interpreted as evidence that 2 2 H0: 휂 = 0 is true. Some researchers will conduct a preliminary test of H0: 휂 = 0, and only if the results are "significant" will they proceed with tests or confidence intervals of pairwise comparisons or linear contrasts. However, this preliminary test approach is not required or recommended when using simultaneous confidence intervals or the Holm test.

The directional two-sided test, equivalence test, and noninferiority test do not have 2 the same weakness as the test of H0: 휂 = 0 because the directional two-sided test, equivalence test, and noninferiority test provide useful information about the 2 direction or magnitude of an effect. In comparison, rejecting H0: 휂 = 0 in a one- factor design does not reveal anything about how the population means are ordered or the magnitudes of the population mean differences. In one-factor 2 studies where the test of H0: 휂 = 0 is "significant", a common mistake is to assume that the order of the population means corresponds to the order of the sample means.

Two-Factor Experiments

Human behavior is complex and is influenced in many different ways. In a one- factor experiment, the researcher is able to assess the causal effect of only one independent variable on the response variable. The effect of two independent variables on the response variable can be assessed in a two-factor experiment. The two factors will be referred to as Factor A and Factor B. The simplest type of two- factor experiment has two levels of Factor A and two levels of Factor B. We call this a 2 × 2 factorial experiment. If Factor A had 4 levels and Factor B had 3 levels, it would be called a 4 × 3 factorial experiment. In general, an a × b factorial experiment has a levels of Factor A and b levels of Factor B.

There are two types of two-factor between-subjects experiments. In one case, both factors are between-subjects treatment factors and participants are randomly assigned to the combinations of treatment conditions. In the other case, one factor is a treatment factor and the other is a classification factor. A classification factor is a factor with levels to which participants are classified according to some existing characteristic such as gender, ethnicity, or political affiliation. In a two-factor experiment with one treatment factor and one classification factor, participants are randomly assigned to the treatment conditions within each level of the

8 D.G. Bonett (8/2018) classification factor. A study could have two classification factors, but then it would be a nonexperimental design.

Example 3.3. An experiment with two treatment factors takes randomly sampled Coast Guard personnel and randomizes them to one of four treatment conditions: 24 hours of sleep deprivation and 15 hours without food; 36 hours of sleep deprivation and 15 hours without food; 24 hours of sleep deprivation and 30 hours without food; and 36 hours of sleep deprivation and 30 hours without food. One treatment factor is hours of sleep deprivation (24 or 36 hours) and the other treatment factor is hours of food deprivation (15 or 30 hours). The response variable is the score on a complex problem-solving task.

Example 3.4. An experiment with one classification factor and one treatment factor uses a random sample of men and a random sample of women from a volunteer list of students taking introductory chemistry. The samples of men and women are each randomized into two groups with one group receiving 4 hours of chemistry review and the other group receiving 6 hours of chemistry review. The treatment factor is the amount of review (4 or 6 hours) and the classification factor is gender. The response variable is the score on the final comprehensive exam.

One advantage of a two-factor experiment is that the effects of both Factor A and Factor B can be assessed in a single study. Questions about the effects of Factor A and Factor B could be answered using two separate one-factor experiments. However, two one-factor experiments would require at least twice the total number of participants to obtain confidence intervals with the same precision or hypothesis tests with the same power that could be obtained from a single two-factor experiment. Thus, a single two-factor experiment is more economical than two one-factor experiments.

A two-factor experiment also can provide information that cannot be obtained from two one-factor experiments. Specifically, a two-factor experiment can provide unique information about the effect between Factor A and Factor B. An interaction effect occurs when the effect of Factor A is not the same across the levels of Factor B (which is equivalent to saying that the effect of Factor B is not the same across the levels of Factor A).

The inclusion of a second factor can improve the external of an experiment. For example, if there is a concern that participants might perform a particular task differently in the morning than in the afternoon, then time of day (e.g., morning vs. afternoon) could serve as a second 2-level factor in the experiment. If the interaction effect between the Factor A and the time-of-day factor (Factor B) is small, then the effect of Factor A would generalize to both morning and afternoon testing conditions, thus increasing the of the results for Factor A.

9 D.G. Bonett (8/2018)

The external validity of an experiment also can be improved by including a classification factor. In stratified random sampling, random samples are taken from two or more different study populations that differ geographically or in other demographic characteristics. If the interaction between the classification factor and the treatment factor is small, then the effect of the treatment factor can be generalized to the multiple study populations, thereby increasing the external validity of the results for the treatment factor.

The inclusion of a classification factor also can reduce error variance (MSE), which will in turn increase the power of statistical tests and reduce the widths of confidence intervals. For example, in a one-factor experiment with male and female subjects, if women tend to score higher than men, then this will increase the error variance (the variance of scores within treatments). If gender is added as a classification factor, the error variance will then be determined by the variability of scores within each treatment and within each gender, which will result in a smaller MSE.

Consider the special case of a 2 × 2 design. The population means for a 2 × 2 design are shown below.

Factor B

b1 b2

a1 휇11 휇12 Factor A a2 휇21 휇22

The main effects of Factor A and Factor B and the AB interaction effect are given below.

A: (휇11 + 휇12)/2 – (휇21+ 휇22)/2

B: (휇11 + 휇21)/2 – (휇12+ 휇22)/2

AB: (휇11 − 휇12) – (휇21 − 휇22) = (휇11 − 휇21) – (휇12 − 휇22) = 휇11 − 휇21 – 휇12 + 휇22

The simple main effects of A and B are given below.

A at b1: 휇11 − 휇21 B at 푎1: 휇11 − 휇12

A at b2: 휇12 − 휇22 B at 푎2: 휇21 − 휇22

The interaction effect can be expressed as a difference in simple main effects, specifically (휇11 − 휇12) – (휇21 − 휇22) = (B at 푎1) – (B at 푎2), or equivalently,

10 D.G. Bonett (8/2018)

(휇11 − 휇21) – (휇12 − 휇22) = (A at b1) – (A at b2). The main effects can be expressed as averages of simple main effects. The main effect of A is (A at b1 + A at b2)/2 =

(휇11 − 휇21+ 휇12 − 휇22)/2 = (휇11 + 휇12)/2 – (휇21 + 휇22)/2. The main effect of B is (B at 푎1 + B at 푎2)/2 = (휇11 − 휇12 + 휇21 − 휇22)/2 = (휇11 + 휇21)/2 – (휇12 + 휇22)/2. All of the above effects are special cases of a linear contrast of means, and confidence intervals for these effects can be obtained using Equation 3.1.

The main effect of A (which is the average of A at b1 and A at b2) could be misleading because A at b1 and A at b2 will be highly dissimilar if the AB interaction is large.

Likewise, the main effect of B (which is the average of B at 푎1 and B at 푎2) could be misleading if the AB interaction is large because B at 푎1 and B at 푎2 will be highly dissimilar. If the AB interaction effect is large, then an analysis of simple main effects will be more meaningful than an analysis of main effects. If the AB interaction is small, then an analysis of the main effects of Factor A and Factor B will not be misleading and an analysis of simple main effects will be unnecessary.

Pairwise Comparisons in Two-factor Designs

In experiments where Factor A or Factor B has more than two levels, various pairwise comparisons can be examined. Consider a 2 × 3 design where the main effects of Factor B are of interest. The population means are given below.

Factor B

b1 b2 b3 푎1 휇11 휇12 휇13 Factor A 푎2 휇21 휇22 휇23

The following three pairwise main effects can be defined for Factor B

B12: (휇11 + 휇21)/2 – (휇12 + 휇22)/2

B13: (휇11 + 휇21)/2 – (휇13 + 휇23)/2

B23: (휇12 + 휇22)/2 – (휇13 + 휇23)/2 where the subscripts of B represent the levels of the factor being compared.

If one or both factors have more than two levels, then more than one interaction effect can be examined. An interaction effect can be defined for any two levels of Factor A and any two levels of Factor B. For example, in the 2 × 3 design described above, the following three pairwise interaction effects can be defined

11 D.G. Bonett (8/2018)

A12B12: 휇11 − 휇12 − 휇21 + 휇22

A12B13: 휇11 − 휇13 − 휇21 + 휇23

A12B23: 휇12 − 휇13 − 휇22 + 휇23 where the subscripts of AB represent the levels of Factor A and Factor B being compared. The number of pairwise interaction effects can be overwhelming in larger designs. For examples, in a 4 × 3 design, there are six pairs of Factor A levels and three pairs of Factor B levels from which 6 × 3 = 18 pairwise interaction effects could be examined. Pairwise interaction effects are typically examined in designs where the number of factor levels of each factor is small.

If an AB interaction has been detected, then the simple main effects of Factor A or the simple main effects of Factor B provide useful information. Suppose the simple main effects of Factor B are to be examined and Factor B has more than two levels. In this situation, pairwise simple main effects can be examined. In the 2 × 3 design described above, Factor B has three levels and the pairwise simple main effects of Factor B are

B12 at 푎1: 휇11 − 휇12 B12 at 푎2: 휇21 − 휇22

B13 at 푎1: 휇11 − 휇13 B13 at 푎2: 휇21 − 휇23

B23 at 푎1: 휇12 − 휇13 B23 at 푎2: 휇22 − 휇23

Note that all of the above pairwise comparisons are linear contrasts of the 푎푏 population means and can be expressed as ∑푗 푐푗 휇푗 where ab is the total number of groups. For example, the contrast coefficients that define the B12 pairwise main effect of Factor B (assuming the means in the 2 × 3 table are ordered left to right and then top to bottom) are 1/2, -1/2, 0, 0 1/2, -1/2, 0, 0; the contrast coefficients that define the A12B12 pairwise interaction effect are 1, -1, 0 -1, 1, 0; and the contrast coefficients that define the pairwise simple main effect for B12 at 푎1 are 1, -1, 0, 0, 0, 0.

Two-Way Analysis of Variance

Now consider a general a × b factorial design. The variability of the response variable scores in a two-factor design can be decomposed into four sources of variability: the variance due to differences in means across the levels of Factor A, the variance due to differences in means across the levels of Factor B, the variance due to differences in simple main effects of one factor across the levels of the other factor (the AB interaction), and the variance of scores within treatments (the error

12 D.G. Bonett (8/2018) variance). The decomposition of the total variance in a two-factor design can be summarized in the following two-way analysis of variance (two-way ANOVA) table where n is the total sample size.

Source SS df MS F ______A SSA dfA = a – 1 MSA = SSA/dfA MSA/MSE

B SSB dfB = b – 1 MSB = SSB/dfB MSB/MSE

AB SSAB dfAB = (a – 1)(b – 1) MSAB = SSAB/dfAB MSAB/MSE

ERROR SSE dfE = n – ab MSE = SSE/dfE

TOTAL SST dfT = n – 1 ______

The TOTAL and ERROR sum of squares (SS) formulas in a two-way ANOVA shown below are conceptually similar to the one-way ANOVA formulas

푏 푎 푛푗푘 2 SST = ∑푘=1 ∑푗=1 ∑푖=1(푦푖푗푘 − 휇̂++)

푏 푎 푛푗푘 2 SSE = ∑ ∑ ∑ (푦 − 휇̂ ) 푘=1 푗=1 푖=1 푖푗푘 푗푘

푏 푎 푛푗푘 푏 푎 where 휇̂++=∑푘=1 ∑푗=1 ∑푖=1 푦푖푗푘/ ∑푘=1 ∑푗=1 푛푗푘. The formulas for SSA, SSB, and SSAB are complicated unless the sample sizes are equal. If all sample sizes are equal to no, the formulas for SSA, SSB, and SSAB are

푎 2 푏 푛0 SSA = 푏푛0 ∑푗=1(휇̂푗+ − 휇̂++) where 휇̂푗+ = ∑푘=1 ∑푖=1 푦푖푗/푏푛0

푏 2 푎 푛0 SSB = 푎푛0 ∑푘=1(휇̂+푘 − 휇̂++) where 휇̂+푘 = ∑푗=1 ∑푖=1 푦푖푗/푎푛0

SSAB = SST – SSE – SSA – SSB.

Partial eta-squared estimates are computed from the sum of squares estimates, as shown below.

2 휂̂퐴 = SSA/(SST – SSB – SSAB) = SSA/(SSA + SSE) 2 휂̂퐵 = SSB/(SST – SSA – SSAB) = SSB/(SSB + SSE) 2 휂̂퐴퐵 = SSAB/(SST – SSB – SSA) = SSAB/(SSAB + SSE)

These measures are called “partial” effect sizes because variability in the response variable due to the effects of other factors is removed. For example, SSB and SSAB 2 are subtracted from SST to obtain 휂̂퐴. The method of computing a confidence interval for a population partial eta-squared parameter is complicated but can be

13 D.G. Bonett (8/2018) obtained in SAS or R. In designs where a factor has many levels, a partial eta- squared estimate is a simple alternative to reporting all possible pairwise comparisons among the factor levels.

The F statistics for the main effect of Factor A, the main effect of Factor B, and the 2 2 2 AB interaction effect, test the null hypotheses H0: 휂퐴 = 0, H0: 휂퐵 = 0, and H0: 휂퐴퐵 = 0, respectively. Tests of these omnibus null hypotheses suffer from the same problem as the test of the null hypothesis in a one-way ANOVA. Specifically, a “significant” result does not imply a scientifically important result, and a “nonsignificant” result does not imply that the effect is zero. The new APA guidelines recommend supplementing the F statistics and p-values for each effect with confidence intervals for population eta-squared values, linear contrasts of population means, or linear contrasts of unstandardized linear population means.

Although a “nonsignificant” (i.e., inconclusive) test for the AB interaction effect does not imply that the population interaction effect is zero, it is customary to examine main effects rather than simple main effects if the AB interaction test is inconclusive. If the test for the AB interaction effect is “significant”, it is customary to only analyze simple main effects or pairwise simple main effects. However, a main effect could be interesting, even if the AB interaction effect is “significant”, if 2 the partial eta-squared estimate for the main effect is substantially larger than 휂̂퐴퐵.

Three-factor Experiments

The effects of three independent variables on the response variable can be assessed in a three-factor design. The three factors will be referred to as Factor A, Factor B, and Factor C. Like a two-factor design, a three-factor design provides information about main effects and two-way interaction effects. Specifically, the main effects of Factors A, B, and C can be estimated as well as the AB, AC, and BC two-way interactions. These main effects and two-way interaction effects could be estimated from three separate two-factor studies. A three-factor study has the advantage of providing all this information in a single study and also provides information about a three-way interaction (ABC) that could not be obtained from separate two-factor studies. The factors in a three-factor design can be treatment factors or classification factors. If all factors are classification factors, then the study would be a nonexperimental design.

The simplest type of three-factor study has two levels of each factor and is called a 2 × 2 × 2 factorial design. In general, a × b × c factorial designs have a levels of Factor A, b levels of Factor B, and c levels of Factor C. A table of population means is shown below for a 2 × 2 × 2 factorial design.

14 D.G. Bonett (8/2018)

Factor C c1 c2 Factor B Factor B b1 b2 b1 b2 푎1 휇111 휇121 휇112 휇122 Factor A 푎2 휇211 휇221 휇212 휇222

The main effects of Factors A, B, and C are defined as

A: (휇111 + 휇121 + 휇112 + 휇122)/4 – (휇211 + 휇221 + 휇212 + 휇222)/4 B: (휇111 + 휇211 + 휇112 + 휇212)/4 – (휇121 + 휇221 + 휇122 + 휇222)/4

C: (휇111 + 휇211 + 휇121 + 휇221)/4 – (휇112 + 휇212 + 휇122 + 휇222)/4, the three two-way interaction effects are defined as

AB: (휇111 + 휇112)/2 – (휇121 + 휇122)/2 – (휇211 + 휇212)/2 + (휇221 + 휇222)/2 AC: (휇111 + 휇121)/2 – (휇112 + 휇122)/2 – (휇211 + 휇221)/2 + (휇212 + 휇222)/2 BC: (휇111 + 휇211)/2 – (휇112 + 휇212)/2 – (휇121 + 휇221)/2 + (휇122 + 휇222)/2, and the three-way interaction effect is defined as

ABC: 휇111 − 휇121 − 휇211 + 휇221 − 휇112 + 휇122 + 휇212 − 휇222.

The simple main effects of Factors A, B, and C are given below.

A at b1: (휇111 + 휇112)/2 – (휇211 + 휇212)/2 A at b2: (휇121 + 휇122)/2 – (휇221 + 휇222)/2 A at c1: (휇111 + 휇121)/2 – (휇211 + 휇221)/2 A at c2: (휇112 + 휇122)/2 – (휇212 + 휇222)/2

B at a1: (휇111 + 휇112)/2 – (휇121 + 휇122)/2 B at a2: (휇211 + 휇212)/2 – (휇221 + 휇222)/2 B at c1: (휇111 + 휇211)/2 – (휇121 + 휇221)/2 B at c2: (휇112 + 휇212)/2 – (휇122 + 휇222)/2

C at a1: (휇111 + 휇121)/2 – (휇112 + 휇122)/2 C at a2: (휇211+ 휇221)/2 – (휇212 + 휇222)/2 C at b1: (휇111 + 휇211)/2 – (휇112 + 휇212)/2 C at b2: (휇121 + 휇221)/2 – (휇122 + 휇222)/2

15 D.G. Bonett (8/2018)

The simple-simple main effects of Factors A, B, and C are defined as

A at b1c1: 휇111 − 휇211 B at a1c1: 휇111 − 휇121 C at a1b1: 휇111 − 휇112 A at b1c2: 휇112 − 휇212 B at a1c2: 휇112 − 휇122 C at a1b2: 휇121 − 휇122 A at b2c1: 휇121 − 휇221 B at a2c1: 휇211 − 휇221 C at a2b1: 휇211 − 휇212 A at b2c2: 휇122 − 휇222 B at a2c2: 휇212 − 휇222 C at a2b2: 휇221 − 휇222, and the simple two-way interaction effects are defined as

AB at c1: 휇111 − 휇121 − 휇211 + 휇221 AB at c2: 휇112 − 휇122 − 휇212 + 휇222 AC at b1: 휇111 − 휇211 − 휇112 + 휇212 AC at b2: 휇121 − 휇221 − 휇122 + 휇222 BC at a1: 휇111 − 휇121 − 휇112 + 휇122 BC at a2: 휇211 − 휇221 − 휇212 + 휇222.

The ABC interaction in a 2 × 2 × 2 design can be conceptualized as a difference in simple two-way interaction effects. Specifically, the ABC interaction is the difference between AB at 푐1 and AB at 푐2, the difference between AC at 푏1 and AC at 푏2, or the difference between BC at 푎1 and BC at 푎2. Although the meaning of a three-way interaction is not easy to grasp, its meaning becomes clearer when it is viewed as the difference in simple two-way interaction effects with each simple two-way interaction viewed as a difference in simple-simple main effects. (Note that 푐1 and 푐2 are used in this section to represent levels of Factor C and should not be confused with the previous use of 푐푗 to represent contrast coefficients).

The two-way interaction effects in a three-factor design are conceptually the same as in a two-factor design. Two-way interactions in a three-factor design are defined by collapsing the three-dimensional table of population means to create a two- dimensional table of means with cell means that have been averaged over the collapsed dimension. For example, a table of averaged population means after collapsing Factor C gives the following 2 × 2 table from which the AB interaction can be defined in terms of the averaged population means.

Factor B b1 b2

a1 (휇111 + 휇112)/2 (휇121 + 휇122)/2 Factor A a2 (휇211 + 휇212)/2 (휇221+ 휇222)/2

Three-Way Analysis of Variance

The variability of the response variable scores in a three-factor design can be decomposed into eight sources of variability – three main effects, three two-way interactions, one three-way interaction, and the within-group error variance. The

16 D.G. Bonett (8/2018) decomposition of the total variance in a three-factor design can be summarized in the following three-way analysis of variance (three-way ANOVA) table where n is the total sample size.

Source SS df MS F ______

A SSA dfA = a – 1 MSA = SSA/dfA MSA/MSE

B SSB dfB = b – 1 MSB = SSB/dfB MSB/MSE

C SSC dfC = c – 1 MSC = SSC/dfC MSC/MSE

AB SSAB dfAB = dfAdfB MSAB = SSAB/dfAB MSAB/MSE

AC SSAC dfAC = dfAdfC MSAC = SSAC/dfAC MSAC/MSE

BC SSBC dfBC = dfBdfC MSBC = SSBC/dfBC MSBC/MSE

ABC SSABC dfABC = dfAdfBdfC MSABC = SSABC/dfABC MSABC/MSE

ERROR SSE dfE = n – abc MSE = SSE/dfE

TOTAL SST dfT = n – 1 ______

The SS formulas for a three-way ANOVA are conceptually similar to those for the two-way ANOVA and will not be presented. Partial eta-squared estimates are computed from the SS estimates in a three-way ANOVA in the same way they are 2 computed in a two-way ANOVA. For example, 휂̂퐴 = SSA/(SSA + SSE) and 2 휂̂퐴퐵퐶 = SSABC/(SSABC + SSE).

The seven omnibus F tests in the three-way ANOVA suffer from the same problem as the omnibus F tests in the one-way and two-way ANOVA. These tests should be supplemented with confidence intervals for population eta-squared values, linear contrast of population means, or standardized linear contrasts of population means to provide information regarding the magnitude of each effect.

If an ABC interaction has been detected in a three-way ANOVA, simple two-way interactions or simple-simple main effects should be examined. A two-way interaction could be examined even if an ABC interaction is “significant” if the partial eta-squared estimate for a two-way interaction is substantially larger than the partial eta-squared estimate for the ABC interaction.

If the test for an ABC interaction is inconclusive, the AB, AC, and BC interactions should be examined. Using Factor A as an example, if AB and AC interactions are detected, then simple-simple main effects of A should be examined because Factor A interacts with both factor B and Factor C. If an AB interaction is detected, but the test for the AC interaction is inconclusive, then the simple main effects of A should be examined at each level of Factor B. Similarly, if an AC interaction is detected, but the test of an AB interaction is inconclusive, then the simple main

17 D.G. Bonett (8/2018) effects of A should be examined at each level of Factor C. Even if AB and AC interactions have been detected, the main effect of A could be examined if the partial eta-squared estimate for the main effect of A is substantially larger than the partial eta-squared estimates for the AB and AC interactions.

If the tests for the ABC, AB, AC, and BC interactions are all inconclusive then all three the main effects should be examined. An analysis of main effects can be justified even if interactions are “significant” if the partial eta-squared estimates for the main effects are substantially larger than the partial eta-squared estimates for the interaction effects.

Assumptions

In addition to the random sampling and independence assumptions, the ANOVA tests, the equal-variance Tukey-Kramer confidence intervals for pairwise 푎 comparison, the equal-variance confidence interval for ∑푗=1 푐푗휇푗, the equal- variance confidence interval for 훿, and the confidence interval for 휂2 all assume equality of population variances across treatment conditions and approximate normality of the population response variable scores within each level of the independent variable. The effects of violating these assumptions are similar to those for the equal-variance confidence interval for 휇1 − 휇2 described in Module 2.

The Games-Howell and unequal-variance Tukey-Kramer methods for pairwise 푎 comparisons, and the unequal-variance confidence intervals for ∑푗=1 푐푗휇푗 and 휑 relax the equal population variance assumption and are preferred to the equal- variance methods unless the sample sizes are approximately equal and there is compelling prior information to suggest that the population variances are not highly dissimilar across treatment conditions. The Welsh test is an alternative to the one-way ANOVA test that relaxes the equal variance assumption and can be obtained in SAS, SPSS, and R.

The adverse effects of violating the normality assumption on the F tests and the 푎 confidence intervals for ∑푗=1 푐푗휇푗 are usually not serious unless the response variable is highly skewed and the sample size per group is small (푛푗 < 20). However, leptokurtosis of the response variable is detrimental to the performance of the confidence interval for 휂2 and 휑. Furthermore, the adverse effects of leptokurtosis on these confidence intervals are not diminished in large sample sizes. transformations are sometimes helpful in reducing leptokurtosis in distributions 푎 that are also skewed. A data transformation could render ∑푗=1 푐푗휇푗 uninterpretable but 휂2 and 휑, which are unitless measures of effect size, will remain interpretable.

18 D.G. Bonett (8/2018)

To informally assess the degree of non-normality in a design with a ≥ 2 groups, subtract 휇̂푗 from all of the group j scores then estimate the and coefficients from these 푛1 + 푛2 + ⋯ + 푛푎 deviation scores. If the deviation scores are skewed, it may be possible to reduce the skewness by transforming (e.g., log, square-root, reciprocal) the response variable scores.

Distribution-free Methods

If the response variable is skewed, a confidence interval for a linear contrast of population may be more appropriate and meaningful than a confidence interval for a linear contrast of population means. An approximate 100(1 − 훼)% 푎 confidence interval for ∑푗=1 푐푗 휏푗 is

∑푎 푐 휏̂ ± 푧 ∑푎 푐2푆퐸2 (3.3) 푗=1 푗 푗 훼/2√ 푗=1 푗 휏̂푗 where 푆퐸2 was defined in Equation 1.10 of Module 1. This confidence interval only 휏̂푗 assumes random sampling and independence among participants. Equation 3.3 푎 can be used to test H0: ∑푗=1 푐푗 휏푗 = 0 and if H0 is rejected, Equation 3.3 can be used 푎 푎 to decide if ∑푗=1 푐푗 휏푗 > 0 or ∑푗=1 푐푗 휏푗 < 0. Equation 3.3 also can be used to test 푎 푎 H0: |∑푗=1 푐푗 휏푗| ≤ 푏 against H1: |∑푗=1 푐푗 휏푗| > 푏.

The Kruskal-Wallis test is a distribution-free test of the null hypothesis that the response variable distribution is identical (same location, variance, and shape) in all a treatment conditions (or all a subpopulations in a nonexperimental design). A rejection of the null hypothesis implies differences in the location, variance, or shape of the response variable distribution in at least two of the treatment conditions or subpopulations.

The Kruskal-Wallis test is used as a distribution-free alternative to the F test in the one-way ANOVA and suffers from the same problem as the F test because the null hypothesis is known to be false in virtually every study. In designs with more than two groups, useful information can be obtained by performing multiple Mann- Whitney tests for some or all pairwise comparisons using the Holm procedure. Simultaneous confidence intervals for pairwise differences or ratios of medians, the Mann-Whitney parameter (휋) for pairwise comparisons, or linear contrasts of medians are informative alternatives to the Kruskal-Wallis test. Some researchers use the Kruskal-Wallis test as a screening test to determine if multiple Mann- Whitney tests or simultaneous confidence intervals are necessary.

19 D.G. Bonett (8/2018)

Sample Size Requirements for Desired Precision

The sample size requirement per group to estimate a linear contrast of a population means with desired confidence and precision is approximately

푧 푧2 푛 = 4휎̃2(∑푎 푐2)( 훼/2)2 + 훼/2 (3.4) 푗 푗=1 푗 푤 2푚 where 휎̃2 is a planning value of the average within-group variance and m is the number of non-zero 푐푗 values. Note that Equation 3.4 reduces to Equation 2.5 for the special case of comparing two means. Equation 3.4 also can be used for factorial designs where a is the total number of treatment combinations. The MSE from previous research is often used as a planning value for the average within- group variance.

Example 3.7. A researcher wants to estimate (휇11 + 휇12)/2 – (휇21 + 휇22)/2 in a 2 × 2 factorial experiment with 95% confidence, a desired confidence interval width of 3.0, and a planning value of 8.0 for the average within-group error variance. The contrast coefficients are 1/2, 1/2, -1/2, and -1/2. The sample size requirement per group is 2 approximately 푛푗 = 4(8.0)(1/4 + 1/4 + 1/4 + 1/4)(1.96/3.0) + 0.48 = 14.2 ≈ 15.

The sample size requirement per group to estimate a standardized linear contrast of a population means (휑) with desired confidence and precision is approximately

푧 푛 = [2휑̃ 2/푎 + 4(∑푎 푐2)]( 훼/2)2 (3.5) 푗 푗=1 푗 푤 where 휑̃ is a planning value of 휑. Note that this sample size formula reduces to Equation 2.6 in Module 2 for the special case of a standardized mean difference.

Example 3.8. A researcher wants to estimate 휑 in a one-factor experiment (a = 3) with 95% confidence, a desired confidence interval width of 0.6, and 휑̃ = 0.8. The contrast coefficients are 1/2, 1/2, and -1. The sample size requirement per group is approximately 2 푛푗 = [2(0.64)/3 + 4(1/4 + 1/4 + 1)](1.96/0.6) = 68.6 ≈ 69.

A simple formula for approximating the sample size needed to obtain a confidence interval for 휂2 having a desired width is currently not available. However, if sample data can be obtained in two stages, then the confidence interval width for 휂2 obtained in the first-stage sample can be used in Equation 1.12 to approximate the additional number of participants needed in the second-stage sample to achieve the desired confidence interval width.

20 D.G. Bonett (8/2018)

Example 3.9. A first-stage sample size of 12 participants per group in a one-factor experiment gave a 95% confidence interval for 휂2 with a width of 0.51. The researcher would like to obtain a 95% confidence interval for 휂2 that has a width 0f 0.30. To achieve this goal, [(0.51/0.30)2 – 1]12 = 22.7 ≈ 23 additional participants per group are needed.

Sample Size Requirements for Desired Power

푎 The sample size requirement per group to test H0: ∑푗=1 푐푗휇푗= 0 for a specified value of 훼 and with desired power is approximately

푧2 푛 = 휎̃2(∑푎 푐2)(푧 + 푧 )2/(∑푎 푐 휇̃ )2 + 훼/2 (3.6) 푗 푗=1 푗 훼/2 훽 푗=1 푗 푗 2푚

2 푎 where 휎̃ is the planning value of the average within-group variance, ∑푗=1 푐푗 휇̃푗 is the anticipated effect size value, and m is the number of non-zero 푐푗 values. This sample size formula reduces to Equation 2.7 when the contrast involves the 푎 2 comparison of two means. In applications where ∑푗=1 푐푗 휇̃푗 or 휎̃ is difficult for the researcher to specify, Equation 3.6 can be expressed in terms of a planning value for 휑, as shown below

푧2 푛 = (∑푎 푐2)(푧 + 푧 )2/휑̃ 2 + 훼/2 (3.7) 푗 푗=1 푗 훼/2 훽 2푚 which simplifies to Equation 2.8 in Module 2 when the contrast involves the comparison of two means.

Example 3.10. A researcher wants to test H0: (휇1 + 휇2 + 휇3 + 휇4)/4 − 휇5 in a one-factor experiment with power of .90, 훼 = .05, and an anticipated standardized linear contrast value of 0.5. The contrast coefficients are 1/4, 1/4, 1/4, 1/4, and -1. The sample size 2 2 requirement per group is approximately 푛푗 = 1.25(1.96 + 1.28) /0.5 + 0.38 = 52.9 ≈ 53.

The sample size requirements for v simultaneous confidence intervals or tests are obtained by replacing 훼 in Equations 3.4 - 3.7 with 훼∗ = 훼/푣. Sampling from a more diverse study population can enhance the scientific value of a study but then 휎̃2 must be set to larger value which increases the sample size requirement. Some 푎 2 contrasts will require larger sample sizes because they have lager ∑푗=1 푐푗 values.

Using Prior Information

Suppose a population mean difference for a particular response variable has been estimated in a previous study and also in a new study. The same independent variable is used in both studies and the two study populations are assumed to have

21 D.G. Bonett (8/2018) similar demographic characteristics. The previous study used a random sample to estimate 휇1 – 휇2 from one study population, and the new study used a random sample to estimate 휇3 – 휇4 from another study population. Both 휇1 – 휇2 and 휇3 – 휇4 are assumed to describe the effect of the same independent variable that was used in the two studies. This is a 2 × 2 factorial design with a classification factor where Study 1 and Study 2 are the levels of the classification factor.

If a confidence interval for (휇1 – 휇2) − (휇3 – 휇4) suggests that 휇1 – 휇2 and 휇3 – 휇4 are not too dissimilar, then the researcher may want to compute a confidence interval for (휇1 + 휇3)/2 – (휇2 + 휇4)/2. A confidence interval for (휇1 + 휇3)/2 – (휇2 + 휇4)/2 will have greater external validity and could be substantially narrower than the confidence interval for 휇1 – 휇2 or 휇3 – 휇4.

A 100(1 − 훼)% confidence interval for (휇1 + 휇3)/2 – (휇2 + 휇4)/2 is obtained from Equation 3.1, and if medians have been computed in each study an approximate

100(1 − 훼)% confidence interval for (휏1 + 휏3)/2 – (휏2 + 휏4)/2 is obtained from Equation 3.3. The contrast coefficients in Equation 3.1 or 3.3 would be

푐1 = .5, 푐2 = −.5, 푐3 = .5, and 푐4 = −.5.

If a standardized mean difference has been estimated in each study and a confidence interval for 훿1 − 훿2 suggests that these two parameter values are not too dissimilar, the researcher may want to compute the following approximate

100(1 − 훼)% confidence interval for (훿1 + 훿2)/2

̂ ̂ 2 2 (훿1 + 훿2)/2 ± 푧훼/2√(푆퐸 + 푆퐸 )/4 (3.8) 훿̂1 훿̂2

2 where 푆퐸̂ was defined in Equation 2.2 of Module 2. 훿푗

Example 3.12. An eye-witness identification study with 20 participants per group at Iowa State University assessed participants’ certainty in their selection of a suspect individual from a photo lineup after viewing a short video of a crime scene. Two treatment conditions were assessed in each study. In the first treatment condition the participants were told that the target individual “will be” in a 5-person photo lineup, and in the second treatment condition participants were told that the target individual “might be” in a 5-person photo lineup. The suspect was included in the lineup in both instruction conditions. The estimated means were 7.4 and 6.3 and the estimated standard deviations were 1.7 and 2.3 in the “will be” and “might be” conditions, respectively. This study was replicated at UCLA using 40 participants per group. In the UCLA study, the estimated means were 6.9 and 5.7, and the estimated standard deviations were 1.5 and 2.0 in the “will be” and “might be” conditions, respectively. A 95% confidence interval for (휇1 – 휇2) – (휇3 – 휇4) indicated that 휇1 – 휇2 and 휇3 – 휇4 do not appear to be substantially dissimilar. The 95% confidence interval for (휇1 + 휇3)/2 – (휇2 + 휇4)/2, which describes the Iowa State and UCLA study populations, was [0.43, 1.87].

22 D.G. Bonett (8/2018)

Data Transformations and Interaction Effects

Data transformations were described in Module 1 as a way to reduce nonnormality. Most psychological measures are assumed to be interval-scale measurements but they might actually be ordinal-scale measurements. Interval-scale measurements are assumed to be linearly related to the attribute they claim to measure while ordinal-scale measurements are assumed to be monotonically related to the attribute. If any data transformation substantially reduces the magnitude of an interaction effect, the interaction might simply be due to the characteristics of the measurement scale.

Consider the following example of a 2 × 2 design with three participants per group.

Factor B b1 b2

푎1 49, 64, 81 100, 121, 144 Factor A 푎2 1, 4, 9 16, 25, 36

The simple main effect of A at b1 is 64.67 – 4.67 = 60 and the simple main effect of A at b2 is 121.67 – 25.67 = 96, which indicates a nonzero interaction effect in this sample. After taking a square root transformation of the data, the simple main effect of A at b1 is 8 – 2 = 6 and the simple main effect of A at b2 is 11 – 5 = 6, which indicates a zero interaction effect. In this example, the estimated interaction effect was reduced to zero by simply transforming the data.

Interaction effects can be classified as removable or non-removable. A removable interaction effect (also called an ordinal interaction effect) can be reduced to near zero by some data transformation (e.g., log, square-root, reciprocal). A non- removable interaction effect (also called a disordinal interaction effect) cannot be reduced to near zero by a data transformation. In a two-factor design, if the simple main effects (or simple pairwise main effects) of Factor A have different signs at different levels of Factor B, or the simple main effects (or simple pairwise main effects) of Factor B have different signs at different levels of Factor A, then the interaction effect is non-removable. Otherwise, the interaction effect is removable by some data transformation. In studies where an interaction effect has an important theoretical implication, a more compelling theoretical argument can be made if it can be shown, based on confidence intervals for the simple main effects, that the observed interaction effect is non-removable.

23 D.G. Bonett (8/2018)

Graphing Results

Results of a two-factor design can be illustrated using a clustered where the means for the levels of one factor are represented by a cluster of contiguous bars (with different colors, shades, or patterns) and the levels of the second factor are represented by different clusters.

If one factor is more interesting than the other factor, the factor levels within each cluster should represent the more interesting factor because it is easier to visually compare means within a cluster than across clusters. In the above graph, it is easy to see than the mean for level 2 of Factor A is greater than the mean for level 1 of Factor A within each level of Factor B. An example of a clustered bar chart for a 2 × 2 design is shown below where the levels of Factor A define each cluster.

24