The Two-Sample T-Test and the Influence of Outliers

The Two-Sample t-test and the Influence of Outliers

- A simulation study on how the type I error rate is impacted by outliers of different magnitude.

Bachelor’s thesis Department of Statistics Uppsala University

Date: 2019-01-15

Author: Carl Widerberg

Supervisor: Harry Khamis Abstract

This study investigates how outliers of different magnitude impact the robustness of the two- sample t-test. A simulation study approach is used to analyze the behavior of type I error rates when outliers are added to generated data. Outliers may distort parameter estimates such as the mean and variance and cause misleading test results. Previous research has shown that Welch’s t- test performs better than the traditional Student’s t-test when group variances are unequal. Therefore these two alternative statistics are compared in terms of type I error rates when outliers are added to the samples. The results show that control of type I error rates can be maintained in the presence of a single outlier. Depending on the magnitude of the outlier and the sample size, there are scenarios where the t-test is robust. However, the sensitivity of the t-test is illustrated by deteriorating type I error rates when more than one outlier are included. The comparison between Welch’s t-test and Student’s t-test shows that the former is marginally more robust against outlier influence.

Keywords

Outlier, extreme value, outlying observation, ANOVA, Two-sample t-test, Student’s t-test, Welch’s t-test, type I error rate, robustness.

Table of content 1. Introduction ...... 1 2. Theory and methodology ...... 3 2.1 What is an outlier? ...... 3 2.2 Methodology ...... 4 2.2.1 ANOVA ...... 4 2.2.2 Student’s t-test...... 7 2.2.3 Welch’s t-test ...... 8 3. Simulation study ...... 10 3.1 Sample size ...... 12 3.2 Outlier values ...... 13 3.3 Simulation set up ...... 14 4. Results ...... 16 4.1 Small sample size ...... 16 4.2 Medium sample size ...... 18 4.3 Large sample size ...... 19 5. Conclusion ...... 21 References ...... 23 Appendix A. Additional Simulations ...... 25 Alternative 1 ...... 25 Alternative 2 ...... 26 Alternative 3 ...... 27

1. Introduction The wide use of Analysis of Variance (ANOVA) for inference may stem from the applicability and the relevance of comparing means between groups. One-way ANOVA is a straightforward approach and yields interpretable results, enabling the researcher to make inferences about population means based on sample means.

The aptness, however, is dependent on a number of assumptions being met. As for all parametric tests, parameters and estimates are a foundation of ANOVA. This creates a sensitivity towards non-normality, both in parameters and errors, and unequal variances. These assumptions need to be met if the ANOVA is to be considered reliable (Ramsey et al, 2011). There has been a fair amount of academic interest in these assumptions and how violations affect results. Methods have been developed to provide researchers with more robust methods using trimming and bootstrapping of data (Keselman et al., 2004). However, Ramsey et al (2011) argue that these “super-robust” methods have problems dealing with outliers. Trimming, for example, sacrifices large portions of data to remove outliers but then the issue is that outliers might contain important information. While academia has long been aware of the lack of robustness and potentially skewed results which can be consequences of outliers in ANOVA, there is not a great amount of published academic research on the actual effect of outliers on ANOVA.

Hampel et al (1986) argue that researchers oftentimes rely on subjective decisions when faced with outliers. A deviant value in the sample is identified as either an outlier or not, and then the researcher finds a statistic or identification-method to support their decision (p.57). Sheskin (2000) presents a similar view. What appears to be an outlier for one researcher may not appear so for another and the individual view is influenced by knowledge within an area. Hampel et al (1986) discuss how outliers are of varying importance in different scenarios. In some disciplines, for example medicine, an outlier can contain crucial information and it might not be suitable to remove it. Sheskin (2000) argues that there is a risk at both ends of the treatment of outliers. At one end, removing outliers might eliminate valid information about an underlying population. At the other end, including all available observations might contaminate the sample and therefore distort the results. Stock & Watson (2015) even argue that the assumptions of the closely related OLS regression should be expanded to include that large outliers in the data are unlikely, that is,

1 that kurtosis is finite (p.173). The reason behind this view is the risk of misleading results due to large outliers.

Overall there is little opposition towards the view that standard ANOVA-techniques are sensitive to outliers. But the method of simply deleting deviating values without further consideration is opposed by many researchers. Sheskin (2000) states that the presence of one or several outliers can have a substantial effect on both the mean and the variance of a distribution. If the variance is impacted by outliers test-statistics might not be reliable. Zimmerman & Zumbo (2009) discuss how the pooling of variances done in ANOVA distort the probability of making a type I error when variances are unequal. There are alternative tests available, such as the test presented by Welch (1938), which do not pool variances. Research has shown that Welch’s test can manage unequal variances (Delacre et al. 2017) but there is less evidence about how it performs when outliers are included.

In his bachelor’s thesis, Halldestam (2016) examines the robustness of the parameter estimate in One-way ANOVA when outliers are present through a simulation study. He finds that outliers increase the type I error probability, concluding that the parameter estimates are not robust. However, the outliers in his study are fixed to a single, rather extreme value. There may be opportunities for deepening the understanding by investigating outliers of varying magnitudes. Expanding the evidence may provide researchers with more guidance when they are faced with outliers in analysis of variance, potentially decreasing the loss of valuable information stemming from deletion of outliers.

The purpose of this study is therefore to investigate how outliers of different magnitudes influence the special case of One-way ANOVA, known as the Two-Sample t-test. The research questions are: What are the influence of outliers of different magnitude on the type I error rate in the two sample t-test? Are there any differences in robustness between Welch’s t-test and Student’s t-test when outliers are present?

2. Theory and methodology In practice assumptions related to statistical tests are rarely perfectly satisfied. It is therefore important to know whether a statistical method is robust to particular violations of assumptions (Agresti & Finlay, 2014, 122). For this purpose a simulation study is arguably suitable. The advantages of simulation studies have been clear since the 1930s as it gives the researcher ability to control the underlying data (Stigler, 1977). This study turns the focus towards outliers and the simulation approach enables control of the distribution of the data since it is generated through a computer program instead of collected. Therefore outliers can be added to samples drawn from a known normal distribution, which makes it easier to capture the effect of outliers without uncertainty regarding the distribution of the underlying population.

However, it is appropriate to first establish a foundation regarding what “outlier” means in practice.

2.1 What is an outlier? There is no universal definition of what actually constitutes an outlier. Barnett & Lewis (1984:4) provide the following definition in their book titled “Outliers in Statistical Data”.

“An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.”

The authors discuss the wording “appears to be inconsistent” as it means that the definition of an outlier is dependent on subjective interpretation. This view is shared with other researchers (Hampel et al., 1986, Sheskin, 2000).

To mitigate this subjectivity, there are a number of available outlier identification techniques. However, there is no universally applied approach (Hodge & Austin, 2004, 121). Despite a vast amount of published definitions of how an outlier is defined, practitioners are still interested in ways to objectively identify if an observation is an outlier or not (Seely et al., 2003, 37)

One of the most common ways of detecting outliers is to use boxplots. Tukey (1977) discusses the convenience in having a rule of thumb for detecting outliers and presents the boxplot. A boxplot is based on the median and the quartiles of a data set. The hinge-spread is defined as the distance between the 25th percentile and the 75th percentile. This distance creates the “box”, in the boxplot. The inner fences are set at 1,5 times the hinge-spread (or interquartile range (IR)) out

3 from the box. The outer fences are 3 times the hinge-spread out from the box. The observations inside the inner fences are categorized as adjacent. Observations between the inner and outer fences are categorized as outside values and values outside the outer fences are categorized as far out. (Tukey, 1977, 44)

The separation of far out values (extreme outliers) and outside values (outliers) is suitable for this study. The aim is to investigate the influence of outliers of varying magnitude. In contrast, Ramsey et al. (2011) and Halldestam (2016) both use outliers which are significantly beyond the outer fences. Halldestam (2016) set outliers to approximately two times the outer fence and Ramsey et al. (2011) set outliers to five standard deviations from the mean. In practice, researchers can be faced with both extreme outliers and outliers that are marginally beyond the fences. Researchers may feel comfortable dealing with extreme outliers, where the effects on p- values are known to be substantial, as compared to outliers of varying magnitude where the influence on the results of the analysis is more opaque. This study will contribute to the understanding of how different kinds of outliers influence ANOVA.

2.2 Methodology The methodological foundation is Analysis of Variance. In this study, the main focus will be on the special case of ANOVA, named Two-Sample t-test. As the name implies, the analysis is made on difference in means between two populations based on group sample means. For this purpose both Student’s t and the modification, Welch’s t is tested. The reason for applying two alternative tests is that the addition of outliers causes the group variances to deviate, which is a violation of one of the assumptions of ANOVA. It has been shown that Welch’s t is more robust in cases of unequal variances (Delacre et al. 2017).

In this section, the aforementioned methods are presented. First, the general case of ANOVA is visited, followed by the modified versions of ANOVA, Student’s t-test and Welch’s t-test.

2.2.1 ANOVA The motive behind using ANOVA in research is to find out whether there are statistically significant differences between group means. These groups are formed by a categorical variable, sorting observations depending on what category level an observation belongs to. For example, two different kinds of fertilizers can be used on a number of plots of soil. By looking at the yield of each plot, one can easily calculate the yield of each group and find out whether there are any

4 differences between the two fertilizer types. However, it is not suitable to make any general remarks by only looking at the group sample means. For generalization to the population, one needs to draw inference by investigating whether the means in the sample are representative of the means in the population, taking into account the random variation of the sample means. ANOVA is used to accomplish this goal.

The standard formula for the one-way ANOVA is:

푌푖푗 = 휇푖 + 훼푗 + 휀푖푗 where 푖 = 1, … , 푘 푎푛푑 푗 = 1, … , 푛푖 and where µ is the overall mean, α is the differential effect of the ith treatment, and ε is the error term (Scariano & Davenport, 1987).

The aim of ANOVA is to find out whether there are any differences between group means. To do that, ANOVA makes use of the variability in the data. This variability quantifies the spread of the individual observations around the mean. The variability is made up of two quantities, which in ANOVA are kept separate. The first quantity is the sum of squares, and the other is the degrees of freedom associated with the sum of squares. The degrees of freedom is interpreted as the number of independent pieces of information that contributes to a statistic, in this case, the variability around the mean.

The first step is to calculate a grand mean of the total sample. The next step is to calculate group means for each group. If there are significant differences between group means, the grand mean will include a lot of variability while the group means will explain most of the variability existent in the group mean. If there are no differences between the groups, the group means will not remove the variability in the grand mean. So at what threshold does the ANOVA reject the null hypothesis of no significant difference between the groups? How much of the variability in the grand mean must be removed by the fitting of the group means to obtain a significant result? To answer this, the ANOVA-statistic needs to be broken down into its mathematical components.

The breakdown is done to separate the variability between the groups and the variability within the groups. The between group sum of squares (SSB) is calculated through squaring the deviations of the group means from the grand mean.

# 표푓 푔푟표푢푝푠 2 푆푆퐵 = ∑ (푥̅푗 − 푥푔푟푎푛푑̅ ) 푗=1

The sum of squares within groups (SSW) is calculated by squaring the deviation of each observation from the group mean and pooling over groups. This gives a measure of the variation within each of the groups.

# 표푓 푔푟표푢푝푠 푛푗 2 푆푆푊 = ∑ ∑(푥푖푗 − 푥푗̅ ) 푗=1 푖=1

The SSB and the SSW summed together is, logically, the total sum of squares (SST) That is, the sum of squared deviations of each observation around the grand mean.

푛 2 푆푆푇 = ∑(푥푖 − 푥푔푟푎푛푑̅ ) 푖=1

Or, 푆푆푇 = 푆푆퐵 + 푆푆푊

At this point, it is possible to analyze the variability between the group means and see if there are any distinguishable differences. However, for a thorough statistical comparison it is not enough to only look at the variability. The variability needs to be adjusted for the degrees of freedom. In doing this, the sum of squares is transformed into variance.

Starting with the variance between different groups adjusted for the number of groups, that is k-1 degrees of freedom, gives the Mean Square between groups.

# 표푓 푔푟표푢푝푠 2 ∑ (푥̅푗 − 푥푔푟푎푛푑̅ ) 푆푆퐵 푀푆퐵 = 푗=1 = 푘 − 1 푘 − 1

A similar adjustment is made for SSW but here n-k is used to adjust for sample size. The deviation within each group must always sum to zero. Thus, for any number of observations, the deviation of the last observation is predetermined and n-k is used. This gives the Mean Square within groups.

# 표푓 푔푟표푢푝푠 푛푗 2 ∑ ∑ (푥푖푗 − 푥푗̅ ) 푆푆푊 푀푆푊 = 푗=1 푖=1 = 푛 − 푘 푛 − 푘

The total variance in the data, also called Total Mean Square (MST) is calculated by combining the total sum of squares and the degrees of freedom. Here n-1 is the degrees of freedom.

푛 2 ∑ (푥푖 − 푥푔푟푎푛푑̅ ) 푆푆푇 푀푆푇 = 푖=1 = 푛 − 1 푛 − 1

After obtaining variances, the next step is to formally test whether the differences are statistically significant. For ANOVA with more than two groups, this is done through the F-ratio. If there are no differences between the groups, the variation between the groups would be similar to the variation within the groups. Thus the F-ratio is calculated as:

푀푆퐵 퐹 = 푀푆푊

An F-ratio above 1 suggests that the group means are different because the variance between the groups is larger than the variance within the groups. If the F-ratio is below 1, the interpretation is the opposite. Since the variation within the group is larger than the variation between the groups it is difficult to distinguish any differences between the groups.

In statistics, the p-value is often of large interest and to obtain a p-value for the ANOVA, the F- distribution is used to lookup the p-value that corresponds to the F-ratio and the calculated degrees of freedom connected to both the within-group and the between-group sum of squares. A p-value below 5% is said to reject the null hypothesis of no difference in means between the population groups. A p-value above 5% means that there is not enough evidence to reject the null hypothesis that the population groups are equal in means.

2.2.2 Student’s t-test. If the categorical variable divides the data into two equally sized groups, the special case of ANOVA named Student’s t-test is used. In this scenario, the t-statistic and the t-distribution is used. Since there are two groups, the difference between the means can be measured in a more direct manner compared to the one-way ANOVA. The t-statistic is calculated as:

푋̅ − 푋̅ 푡∗ = 1 2 1 1 푠푝√ + 푛1 푛2

Where 푠푝 represents the pooled standard deviation for the equally sized groups and the two variances:

2 2 (푠푥 + 푠푥 ) 푠 = √ 1 2 푝 2

The t-statistic is then checked against the t-distribution to retrieve the p-value for the test. If a low p-value is obtained, the researcher can reject the null hypothesis of no difference in means in favor of the alternative hypothesis that the population group means differ.

2.2.3 Welch’s t-test Adding outliers will affect the variances. Equal variances is an assumption in One-way ANOVA and Student’s t-test. Jan & Shieh (2014) state that research has shown that the F-test is not robust to heterogeneous populations and that actual significance levels can be distorted even in cases when group sizes are equal.

To control for this, one option available is to utilize the Welch-Satterthwaite equation through a Welch’s t-test. It is a common view among researchers that Welch’s test is the most appropriate method when there is heterogeneity in the data (Moser & Stevens, 1992). Jan & Shieh (2014) state that Welch’s t test maintains control over type I error rates when variances are unequal. The test performs a pooling of the degrees of freedom so that they correspond to the pooled variance (Welch, 1938). The outcome is a more reliable statistic when the variances are unequal. The Welch’s t-statistic is arrived at through:

푋̅ − 푋̅ 푡∗∗ = 1 2 푠∆̅

Where,

2 2 푠1 푠2 푠∆̅ = √ + 푛1 푛2

And where the pooled degrees of freedom is calculated as,

2 2 2 푠1 푠2 (푛 + 푛 ) 푑. 푓. = 1 2 2 2 2 2 푠1 푠2 (푛 ) (푛 ) 1 + 2 푛1 − 1 푛2 − 1

Agresti & Finlay (2014) argue that outliers create a case where the t-test is not suitable. This is due to outliers impacting the mean and making the mean a poor representative of the center of the distribution (p.122). However, Agresti & Finlay (2014) do not present any empirical evidence in connection to that statement. Seely et al. (2003) argue that a single outlier added in their dataset would inflate the estimate of variation and also bias the arithmetic mean towards the outlier- value. This is used as motivation for using an alternative test statistic, but little is mentioned about the possible variation in severity depending on the magnitude of the outlier. Delacre et al. (2017) argue that Welch’s t test should always be applied instead of Student’s t because of the enhanced robustness.

Thus, previous research shows that Welch’s t-test is more robust in cases where basic ANOVA- assumptions are not satisfied. The addition of outliers motivates the use of Welch’s t test as extreme values potentially violate these assumptions. The current study investigates the potential differences between the Student’s t-test and the Welch’s t-test when outliers are present. The simulation approach enables an analysis of how the two tests are affected by outliers.

3. Simulation study For the purpose of this study, a simulation approach is suitable. The choice is supported by previous studies using simulations when analyzing ANOVA. Delacre et al. (2017) perform a simulation study in which they investigate the difference in type I error rate between the Student’s t-test and the Welch’s t-test. Halldestam (2016) also performs a simulation study in which he adds outliers to sample-groups and analyzes the type I error rate.

Delacre et al. (2017) show that when variances are unequal, but the group sizes are equal, Student’s t-test manages to maintain good control over the type I-error rate. However, Delacre et al. (2017) argue that overall, Welch’s t-test provides more stable type I-error rates compared to Student’s t-test. As discussed earlier, the addition of outliers impacts the equal variance assumption. However, deviant observations also affect the means of sample-groups. This aspect is not addressed in the Delacre et al. (2017) study since they exclusively focus on unequal variances.

Halldestam (2016) introduces extreme outliers in the groups, which distort the variance homogeneity assumption of the one-way ANOVA. However, the design is balanced as there is an equal number of observations in each of the three groups. So while Delacre et al. (2017) investigate differences between Student’s t and Welch’s t due to unequal variances, and Halldestam (2016) analyzes outliers, there is a possibility to expand the results of both of these studies by jointly investigating outliers of varying magnitude and how well the two t-statistics can deal with the outlier-effects on sample group means and variances.

There are, however, possible drawbacks in using simulation studies. Stigler (1977) argues that many of the available robust estimators related to ANOVA have been tested mostly using simulated data. The data generated through simulation is not necessarily representative of real data and Stigler (1977) discusses the difficulties relating to generating data that behave the same as real data. One example brought forward is that extreme values are more common in the real datasets used, compared to many simulation studies. The concerns expressed by Stigler (1977) are relevant for the current study. In reality, data could include both outliers and other type of noise or distributional deviations. However, the purpose of the study is to focus on the influence of outliers, and therefore the simulation approach is deemed appropriate.

The first step of the simulation is that a standard normal population is generated using R. From this population a loop is set up. The process starts with the drawing of a sample consisting of two groups from the population. One or more outliers are then added to one of the groups. Then Student’s t-test and Welch’s t-test are performed on the drawn sample, and the p-values of these tests are stored. One p-value for each test is generated for every replication performed. Both Ramsey et al. (2011) and Halldestam (2016) perform 10,000 replications. 10,000 replications is arguably an appropriate trade-off between time requirement and generalizability.

The significance level for the tests is 0,05 which should translate to an average null rejection rate of 0,05 if outliers have no effect on the results. This is because the two population groups are generated from a normal distribution with mean zero and variance 1. The 10 000 stored p-values for each of the two t-tests are then analyzed in relation to the significance level. Calculating the mean of the p-values yields a mean type I error rate, which is the percentage of tests rejecting the null hypothesis of no difference between the group means when there is in fact no difference in group means.

Depending on the resulting type I error rate, conclusions can be made about the effect of outliers. It is therefore relevant to provide a threshold for what value can be interpreted as significantly deviant from the 0,05 significance level. This can be done using a confidence interval (CI) (Ramsey et al., 2011).

Due to the design of the study, with varying outlier-values and sample size settings, a number of type I error rates are generated and analyzed. As a consequence of the large number of comparisons between the 0,05 significance level and the simulated p-value, the risk of incorrectly declaring that the simulated value differs from 0,05 inflates. In order to, at least partly, address the inflation a 99% confidence interval is established. This means that the interval will be wider than a 95% CI and thus the simulated type I error rates needs to be further away from the 0,05 significance level in order to be significantly deviant.

푛 = 푛푢푚푏푒푟 표푓 푟푒푝푙푖푐푎푡푖표푛푠 = 10 000, 푝 = 0,05

2 푝̂(1 − 푝̂) 0,05(1 − 0,05) 휎 ̂ = = = 0,00000475 푝 푛 10 000

휎푝̂ = √0,00000475 = 0,002179

푀퐸 = 2,5768휎푝̂ = 0,0056

99 % 퐶표푛푓푖푑푒푛푐푒 퐼푛푡푒푟푣푎푙: [0,0444 ∶ 0,0556]

This means that type I error rates below 4,44% or above 5,56% will be considered as influenced by the outliers added into the samples. Ramsey et al. (2011) also refer to Bradley’s liberal criterion of robustness (1978), which states that no statistic can be considered robust if the type I error rate is outside the interval 0,5α – 1,5α, that is outside the interval [0,025 : 0,075]. This criterion is used in addition to the 99% CI.

3.1 Sample size An important consideration for the study is the sample size. If sample size is small, researchers need to be particularly cautious when it comes to outliers as they can affect the validity of the mean as a center (Agresti & Finlay, 2014, p.128). Large sample sizes are especially important when extreme outliers are present as averaging effects can mitigate the outlier impact (Boneau, 1960). Therefore it is arguably relevant to perform the analysis in different sample size settings. Sample sizes used for simulating are determined using the technique presented by Cohen (1988).

휎 2 푃푒푟 푔푟표푢푝 푛 = 16 ( ) 푑

0,2휎 푠푚푎푙푙 푒푓푓푒푐푡 푠푖푧푒 Where the effect size, 푑 = {0,5휎 푚푒푑푖푢푚 푒푓푓푒푐푡 푠푖푧푒 0,8휎 푙푎푟푔푒 푒푓푓푒푐푡 푠푖푧푒

Which yields three different per group sample sizes.

휎 2 16 ( ) = 400 0,2 ∗ 휎

휎 2 푃푒푟 푔푟표푢푝 푛 = 16 ( ) = 64 0,5 ∗ 휎 휎 2 16 ( ) = 25 { 0,8 ∗ 휎

The calculated sample size of 25 is rounded up to 30 to make use of the Central Limit Theorem. Thus, three different per group sample sizes are simulated and analyzed: 30, 64 and 400 observations per group.

3.2 Outlier values Both Ramsey et al. (2011) and Halldestam (2016) use fixed-value outliers. Ramsey et al. (2011) set outlier values to 휇 + 5휎 while Halldestam (2016) use outliers with values ±10. This value is arrived at through doubling of the x-values of the outer fences (i.e. 3 IR from the box). Referring back to the Tukey (1977) classification, these outliers are categorized as far out and they can even be considered as extreme outliers. Since the aim of the current study is to investigate how outliers of different magnitudes influences the two t-tests of Student and Welch, using a single outlier value is not appropriate.

In order to determine the values of the multiple outliers used in this study, the boxplot presented by Tukey (1977) is utilized. In the population generated, the first and third quartile are -0,674 and 0,674 respectively. That makes an IR of 1,348. The IR is then utilized to amplify outlier values in a stepwise fashion. The calculated outlier values are presented in Table 1 below.

Table 1 Due to symmetry, the calculations for the negative outlier values are not shown. - Outlier values

Outlier (i) 푄3 + 2 퐼푅 0,674 + 2 ∗ 1,348 ≈ ퟑ, ퟒ Outlier (ii) 푄3 + 3 퐼푅 0,674 + 3 ∗ 1,348 ≈ ퟒ, ퟕ Outlier (iii) 푄3 + 4 퐼푅 0,674 + 4 ∗ 1,348 ≈ ퟔ, ퟏ Outlier (iv) 푄3 + 5 퐼푅 0,674 + 5 ∗ 1,348 ≈ ퟕ, ퟒ Outlier (v) 푄3 + 6 퐼푅 0,674 + 6 ∗ 1,348 ≈ ퟖ, ퟖ

The outlier values range between 2 – 6 interquartile ranges from the third quartile. This translates to outlier values between 3,4 – 8,8 which can be compared to the mean of 0 in the generated population. Increasing outlier values with a fixed interval is arguably suitable since it provides nuances to how different outlier values impact the tests and allows for investigation of whether the varying severity of outliers can provide an extended analysis of the effect on type I error rate.

3.3 Simulation set up In this section the structure of the simulations is presented. It is shown how the outliers calculated in the previous section are added into samples.

One possible simulation set up strategy is the one used by Ramsey et al. (2011). They run through each of the observations in their dataset and at each data point there is a 5% chance of it being replaced by an outlier. Doing this for all of the observations yields a dataset where approximately 1 out of 20 observations is an outlier. A similar approach is possible for the current study as well. However, the loss of control due to random inclusion of outliers is not suitable for the current study. Halldestam (2016) establishes a pre-specified inclusion-scheme and adds outliers to the three groups in the one-way ANOVA accordingly.

Adopting the approach of Halldestam (2016) enables more control over the different outlier scenarios. In addition, it makes it easier to compare the results of different outlier values since there are no differences between how they are added to the groups. The situations in Table 2 illustrate how outliers are added to the samples. For each scenario, samples are drawn to leave out the same number of observations which are to be added as outliers. For example, the total sample drawn for Situation 2A contains 59 observations divided by the two groups. The outlier is then added to balance the group sizes.

Table 2 A B Situation 1 No Outliers (reference group) Situation 2 One positive (i) outlier in one of the groups. Two positive (i) outliers in one of the groups. Situation 3 One positive (ii) outlier in one of the groups. Two positive (ii) outliers in one of the groups. Situation 4 One positive (iii) outlier in one of the groups. Two positive (iii) outliers in one of the groups. Situation 5 One positive (iv) outlier in one of the groups. Two positive (iv) outliers in one of the groups. Situation 6 One positive (v) outlier in one of the groups. Two positive (v) outliers in one of the groups.

Situation 1 is the reference scenario without any outliers added. It will contain observations from the generated normal distribution, without any influence from outliers. In Situation 2 through 6 outliers of different magnitude are added to one of the two sample groups. Situations 2A and 2B include the “mildest” outlier and the outlier value is then increased in a stepwise fashion with Situations 6A and 6B containing the most extreme outlier-value.

Halldestam (2016) finds that one extreme outlier added to one on the three groups in the ANOVA-setting yields a type I error rate below 5%. This is described as a possible special case. Zimmerman (1994) found that the probability of a Type I error declined to 3% for simulated two sample t-tests. He argued that this was due to both the extremity of the outlier and the probability of an outlier being included in the sample. These findings motivate analyzing what effect one outlier has on the type I error rate (column A in Table 2), as it will impact both the mean and the variance. Halldestam (2016) further shows that the type I error rate increases with the addition of multiple extreme outliers in his one-way ANOVA setting. These results motivate the choice of including two outliers in one of the sample groups (column B in Table 2).

Each scenario is generated and tested. This means that 10 000 tests, with the null hypothesis of no difference in means between the two groups in the population, are performed. A p-value for each test is extracted and the fraction of tests with a p-value below the significance level of 5% is the type I error rate. Since the population is generated as normally distributed, there is no actual difference between the two underlying population groups. This means that if the fraction of rejected tests significantly increases or decreases, the statistic cannot be considered robust. The reason being that the presence of outliers may a have heavy influence on obtained p-values, leading statisticians to wrongfully believe that there are significant differences between population means.

4. Results In this section the results of the simulations are presented. The presentation will be divided by the three alternative sample sizes.

4.1 Small sample size The small sample size contains a total of 60 observations divided by the two groups. The tables contain the fraction of tests rejecting the null hypothesis of no difference between the population groups. The generated population is normally distributed and there is no actual difference between the population groups.

Table 3 Type I error rate when one outlier is present. - One Outlier CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Small sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0550 0,0471 0,0456 0,0415 0,0363 0,0316 Student’s t 0,0546 0,0478 0,0472 0,0429 0,0392 0,0340

Table 3 shows that for the small sample size the type I error rate seems to decrease with the addition of one outlier. The reference group with no outliers in the sample yield a fraction of around 5,5% which is right on upper limit for the confidence interval. This is somewhat surprising as it should be close to 5% but it is interpreted as due to chance. The two mildest outliers show type I error rates close to 5%, suggesting that there is no significant influence of a single mild outlier, even when the sample size is small. However, the three more severe outliers indicate that the fraction of tests rejected actually decreases with a single outlier present. The type I error rate for these outliers are beyond the lower limit of 4,44% in the 99% CI. This means that the test actually decreases the risk of making a type I error, which is in line with the results of the Halldestam (2016) study.

Analyzing the differences between the two test statistics shows that the fraction for Welch’s t is systematically lower compared to Student’s t. The differences, although not gigantic, support the choice of Welch’s t when outliers are present.

Table 4 Type I error rate when two outliers are present. - Two Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Small sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers 2 Outlier (i) 2 Outlier (ii) 2 Outlier (iii) 2 Outlier (iv) 2 Outlier (v) Welch’s t 0,0502 0,0809* 0,0888* 0,0898* 0,0841* 0,0769* Student’s t 0,0500 0,0805* 0,0904* 0,0926* 0,0885* 0,0819*

Turning focus towards Table 4, where instead of including only one outlier, two outliers are added in one of the two groups. The No Outlier situation yields a type I error rate of close to 5%, as expected. The fractions when outliers are present shows that the type I error rate deteriorates in the small sample size setting as all are above the 5,56% upper CI-limit. All type I error rates are even beyond the upper limit of Bradley’s (1978) robustness criterion of 0,075. Therefore neither of the tests can be considered robust when more than one outlier are present in one of the groups and sample sizes are small.

The highest type I error rate belongs to the group containing two outlier (iii). The fraction for Student’s t is 9,26% and Welch’s t is marginally below 9%. In practice this means a significantly increased risk for rejecting the null hypothesis even though there is no difference in the population. It further seems that the fraction first increases with the distance from the third quartile, but only up until 4 IR. Outliers set at 5 and 6 IR from the “box” (Outlier (iv) & (v)) then seems to decrease the type I error rate again, although not to an extent that would suggest robustness.

Comparing Table 3 to Table 4 shows rather interesting results. A single outlier in one of the sample groups is actually decreasing the type I error rate well below 5% while adding another outlier of the same value into the same group yields a type I error rate around 8-10%. The explanation for this might be the relationship between the mean difference and variances in the formulas for the test statistics. Zimmerman & Zumbo (2009) shows that when sample sizes are unequal, a larger variance connected to the larger of the two samples will inflate the pooled variance and yield a lower t statistic and thus lower type I error rates. Although sample sizes are kept equal in the current study it is possible that a single outlier might influence the variance to a larger extent than the mean, making it more difficult to find a significant difference. Adding a second outlier might in contrast not have such a large additional effect on the variance, but

together with the first outlier it can alter the mean and lead to larger t-statistics from the tests, making it easier to find significant differences between the groups.

Comparing the two test statistics shows that the Welch’s t-test is again yielding lower type I error rates in the case of two present outliers in the small sample size setting. Overall this suggests that when the sample size is small and one or two outliers are present in the data, the Welch’s t-test should be the preferred choice. This is consistent with the results presented by Delacre et al. (2017).

4.2 Medium sample size The medium sample size contains a total of 128 observations divided by the two groups.

Table 5 Type I error rate when one outlier is present. - One Outlier CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Medium sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0516 0,0480 0,0482 0,0478 0,0468 0,0441 Student’s t 0,0516 0,0475 0,0479 0,0477 0,0467 0,0437

Table 5 shows that the reference group with no outliers yields an expected fraction close to the significance level of 5%. Type I error rates, in similarity to the small sample case, are below 5% for all different outlier values. In addition, the fraction of rejected tests declines when the magnitude of the outlier is increased. Although the tendency is not as distinct as in the small sample setting, which is reflected by that only outlier (v) shows type I error rates lower than 4,44%.

Comparing the two test statistics there does not seem to be any differences between how Welch’s t and Student’s t can manage single outlier values when total sample size is 128 observations. Overall the medium sample setting is more robust towards a single outlier compared to the smaller sample size.

Table 6 Type I error rate when two outliers are present. - Two Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Medium sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers 2 Outlier (i) 2 Outlier (ii) 2 Outlier (iii) 2 Outlier (iv) 2 Outlier (v) Welch’s t 0,0503 0,0678 0,0778* 0,0843* 0,0874* 0,0875* Student’s t 0,0503 0,0680 0,0779* 0,0851* 0,0892* 0,091*

Adding a second outlier to the medium sample size of 128 observations leads to an increased type I error rate. Table 6 illustrates that even two mild outliers added to one group increase the type I error rate to around 7%, which is above the upper limit of the 99% confidence interval. The fraction increases when the magnitude of the outlier increases. The highest type I error rate in the medium sample setting is approximately 9% and generated by two outlier (v) at 6 IR from the “box”.

Comparison of the type I error rate between the two test statistics does not show any apparent differences, although Welch’s t yields a marginally lower fraction for each situation. Overall, the medium sample size shows similar results to the smaller sample size. The same phenomenon with decreasing type I error rates is observed for the medium sample size as in the small sample size setting when a single outlier is added. The addition of another identical outlier in each situation increase the type I error rate towards 9% which is similar to the small sample setting. However, while there arguably were distinguishable differences between the two test statistics in the small sample setting, the medium sample shows no distinct differences.

4.3 Large sample size The large sample size contains 800 observations divided by the two groups. Again, the results presented in Table 7 and 8 are the fraction of tests which rejected the null hypothesis of no difference between the population groups, out of the 10 000 replications performed.

Table 7 Type I error rate when two outliers are present. - One Outlier CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Large sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0504 0,0473 0,0486 0,0492 0,0488 0,0492 Student’s t 0,0503 0,0472 0,0486 0,0491 0,0492 0,0492

Table 7 indicates that adding a single outlier into one of the groups when the sample size is large does not affect the type I error rate. The mildest outlier yields a fraction of 4,7% and the most extreme, located 6 IR from the “box” is just below 5%. All of the fractions are within the limits of the 99% confidence interval. The presence of a single outlier, as long as it is not a severely extreme value, will thus not influence the type I error rate when the sample size is around 400 per group.

The tendency of diminishing differences between the two test statistics observed in the medium sample setting continues in the large sample size setting. As illustrated in Table 7, the type I error rate of Welch’s t and Student’s t are almost identical and both are able to maintain control over the fraction of rejections. It seems as if the advantage of Welch’s t test is most apparent when sample sizes are smaller.

Table 8 Type I error rate when two outliers are present. - Two Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Large sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers 2 Outlier (i) 2 Outlier (ii) 2 Outlier (iii) 2 Outlier (iv) 2 Outlier (v) Welch’s t 0,0521 0,0496 0,0556 0,0608 0,0655 0,0693 Student’s t 0,0518 0,0496 0,0556 0,0604 0,0655 0,0698

The effect of adding two outliers of each magnitude into one of the two groups is illustrated in Table 8. There is no clear difference between the reference group and adding two of the mildest outlier (i) as the type I error rates are 5,21% and 4,96% respectively. For the other outlier magnitudes (ii – v) the fractions are all beyond the upper limit of the 99% CI. They also seem to increase when outlier values become more extreme. However, no type I error rates are beyond the 7,5% upper limit of Bradley’s (1978) robustness interval. The most extreme outlier (v) yields a fraction of approximately 7%, or around 700 wrongfully rejected null hypotheses out of the 10 000 replications performed. Using Bradley’s alternative criterion there is not enough deviation to conclude that the statistic is not robust. The large sample size manages to maintain a better control of the type I error rate compared to the small and medium samples.

5. Conclusion The aim of this study was to investigate how outliers of different magnitude influence the type I error rate for two sample t-tests. For this purpose, the two alternative test statistics Student’s t-test and Welch’s t-test were simulated and analyzed.

The results of the simulations performed shows that there are differences in how outliers of different magnitude impact the type I error rate. The effect of one outlier added to one of the two sample groups depends on the sample size. When samples are small, the type I error rate deviates significantly for the three most extreme outliers. The medium sample manages to better resist the impact of outliers, but the most extreme outlier still causes a significantly deviant type I error rate. The large sample size setting absorbs the effect of a single outlier and there are no significant deviations.

When faced with a single outlier in the data, researchers must therefore consider both the magnitude of the outlier and the sample size. If the sample consists of a large number of observations the evidence in this study shows that both Student’s t-test and Welch’s t-test are robust. However, when sample sizes are smaller, a single outlier can impact the average probability of incorrectly rejecting the null hypothesis.

The influence of two outliers added to one of the two sample groups provides more distinct evidence. The type I error rate is significantly deviating for every magnitude and sample size, with one exception. Adding two of the mildest outliers does not cause a significantly deviating type I error rate in the large sample size setting. Although the effects are mitigated by increasing the sample size, the results show an unequivocal sensitivity towards two outliers in the same sample group. Researchers must therefore tread with caution when faced with multiple outliers in samples of data. Support for this conclusion was also found in three additional simulations in which the outliers were added to the groups in alternative ways. The additional simulations are presented in Appendix A.

Throughout the simulations performed, the Welch’s t-test yields more robust type I error rates than the Student’s t-test. The comparison of the two alternative statistics suggest that Welch’s t- test should be the preferred choice as it shows a superior control of type I error rates. The enhanced control is most prominent when the sample is small. It should be emphasized, however, that the focus in this study has solely been on the type I error, and not on statistical power. 21

Investigating the statistical power in the presence of outliers might be a possible topic for future research. Another suggestion for future research is to generate and investigate how outliers impact other distributions than the standard normal used in this study.

In conclusion, the results presented in this study show that outliers impact the robustness of the t- test. However, the results also show that there are cases when control over type I errors can be maintained despite the presence of outliers. Researchers need to be cautious, but there are situations when it is possible to use the t-test without deleting outliers.

References Agresti, A. & Finlay, B. (2014). “Statistical Methods for the Social Sciences”. Pearson Educated Limited: Essex.

Barnett, V. & Lewis, T. (1984). Outliers in Statistical Data, Second Edition. John Wiley & Sons.

Boneau, C. A. (1960). “The Effects of Violations of Assumptions Underlying the t Test”. Pshychological Bulletin, vol. 57(1), pp. 49-64.

Bradley, J.V. (1978). “Robustness?”. British Journal of Mathematical and Statistical Psychology, vol. 31(2), pp. 144-152.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, Second Edition. Academic Press:New York.

Delacre M., Lakens, D. & Leys, C. (2017). “Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test”. International Review of Social Psychology, vol. 30(1), pp. 92-101.

Halldestam, M. (2016). “ANOVA – The Effect of Outliers”. Bachelor’s thesis, Department of Statistics at Uppsala University.

Hampel, F.R. (1986). “Robust Statistics: The Approach Based on Influence Functions”. Wiley: New York.

Hodge, V. & Austin, J. (2004). "A Survey of Outlier Detection Methodologies". Artificial Intelligence Review, vol. 22, no. 2, pp. 85-126.

Jan, S-L. & Shieh, G. (2014). “Sample Size Determinations for Welch’s Test in One-Way Heteroscedastic ANOVA”. British Journal of Mathematical and Statistical Psychology, vol. 67, pp. 72-93.

Keselman, H.J., Othman, A.R., Wilcox, R.R. & Fradette, K. (2004). “The New and Improved Two-Sample t-Test”, Psychological Science, vol. 15(1), pp. 47-51.

Moser, B. K & Stevens, G.R. (1992). ”Homogeneity of Variance in the Two-Sample Means Test”. The American Statistician, vol. 46(1), pp. 19-21.

Ramsey, P.H., Barrera, K, Hachimine-Semprebom, P. & Liu, C-C. (2011). “Pairwise Comparisons of Means Under Realistic Nonnormality, Unequal Variances, Outliers And Equal Sample Sizes”. Journal of Statistical Computation and Simulation, vol 81(2), pp. 125-135.

Scariano, S.M. & Davenport, J.M. (1987). “The Effects of Violations of Independence Assumptions in the One-Way ANOVA”. The American Statistician, vol. 41(2), pp. 123-129.

Seely, R.J., Munyakazi, L., Haury, J. & Simmerman, H. (2003). "Application of The Weisberg t- Test For Outliers". Pharmaceutical Technology Europe, vol. 15, no. 10, pp. 37.

Sheskin, D.J. (2000). “Handbook of Parametric And Nonparametric Statistical Procedures”. 2th edn, Chapman & Hall/CRC Press: Boca Raton.

Stigler, S.M. (1977). “Do Robust Estimators Work With Real Data?”. The Annals of Statistics, vol. 5(6), pp. 1055-1098.

Stock , J.H. & Watson, M.W. (2015). “Introduction to Econometrics”. 3. rev, Global edn. Pearson Education, Harlow.

Tukey, J.W. (1915-2000 1977), “Exploratory Data Analysis”. Addison-Wesley: Reading, Mass.

Welch, B. L. (1938). “The Significance Of The Difference Between Two Means When The Population Variances Are Unequal”. Biometrika, vol. 29(3/4), pp. 350-362.

Zimmerman, D.W. (1994). “A Note on The Influence Of Outliers On Parametric And Nonparametric Tests”. The Journal of General Psychology, vol 121(4), pp. 391-396.

Zimmerman, D. W. & Zumbo, B. (2009) “Hazards In Choosing Between Pooled And Separate- Variances t Tests”. Psicológica, vol 30, pp. 371-390.

Appendix A. Additional Simulations In this appendix, additional simulations are presented. These are included to illustrate alternative “Situations” from the ones chosen and analyzed more thoroughly in the study. The results from these additional situations are presented below.

Alternative 1 – Including one positive outlier in Group A and one negative outlier of the same magnitude in group B.

Tables 9-11 show that both t-tests are sensitive to this type of outlier presence. Type I error rates are significantly deviant for all but the two mildest magnitudes in the large sample. The statistics are seemingly not robust when a positive and a negative outlier are included.

Table 9 Type I error rate. One positive in one group and one negative in the other. - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Small sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0490 0,0780* 0,0830* 0,0826* 0,0761* 0,0655 Student’s t 0,0495 0,0798* 0,0870* 0,0873* 0,0827* 0,0724

Table 10 Type I error rate. One positive in one group and one negative in the other. - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Medium Sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0523 0,0647 0,0746 0,0824* 0,0869* 0,0854* Student’s t 0,0526 0,0646 0,0755* 0,0833* 0,0857* 0,0877*

Table 11 Type I error. One positive in one group and one negative in the other. - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Large Sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0517 0,0511 0,0529 0,0599 0,0600 0,0645 Student’s t 0,0516 0,0511 0,0529 0,0559 0,0601 0,0645

Alternative 2 - Including one positive outlier and one negative outlier of the same magnitude in the same group.

When the positive and negative outlier are included in the same group, Table 12-14 show that average type I error rates are significantly deviant below the 5%-level. This is not surprising since the variance is inflated while the effect on the mean may be offset by the positive and negative outlier-values. As shown in Alternative 1, the statistics are seemingly not robust when both positive and negative outliers are included.

Table 12 Type I error rate. One positive and one negative outlier in the same group. - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Small sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*). No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0507 0,0201* 0,0091* 0,0035* 0,0008* 0,0002* Student’s t 0,0505 0,0189* 0,0084* 0,0023* 0,0006* 0,0001*

Table 13 Type I error rate. One positive and one negative outlier in the same group. - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Medium Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*). sample No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0498 0,0305 0,0206* 0,0101* 0,0057* 0,0030* Student’s t 0,0496 0,0301 0,0205* 0,0099* 0,0057* 0,0021*

Table 14 Type I error rate. One positive and one negative outlier in the same group. - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Large sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*). No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0526 0,0454 0,0423 0,0385 0,0352 0,0312 Student’s t 0,0527 0,0455 0,0425 0,0383 0,0350 0,0316

Alternative 3 – Including one positive outlier of the same magnitude in each of the two groups.

Tables 15-17 show that the two tests are not robust in the scenario. Type I error rates are significantly below 5%. The large sample setting seems to mitigate some of the effects.

Table 15 Type I error rate. One outlier in one group and one in the other (Both positive). - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Small sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*). No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0514 0,0193* 0,0071* 0,0026* 0,0006* 0,0002* Student’s t 0,0502 0,0202* 0,0079* 0,0025* 0,0007* 0,0002*

Table 16 Type I error rate. One outlier in one group and one in the other (Both positive). - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Medium Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*). sample No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0481 0,0314 0,0233* 0,0129* 0,0077* 0,0038* Student’s t 0,0483 0,0314 0,0238* 0,0134* 0,0080* 0,0039*

Table 17 Type I error rate. One outlier in one group and one in the other (Both positive). - 2 Outliers CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. - Large sample Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*). No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v) Welch’s t 0,0457 0,0443 0,0420 0,0389 0,0361 0,0321 Student’s t 0,0458 0,0444 0,0420 0,0390 0,0360 0,0321

Overall, the additional simulations show that the addition of two outliers cause a deviating type I error rate, independently of how they are added in the groups. The differences between Welch’s t-test and Student’s t-test are marginal. Large samples can limit the deviation but two-sample t- tests cannot be declared robust when two outliers exist in the data.