ANALYSIS FRAMEWORK

The goal of the DABI analysis framework is to facilitate the process of finding correlations between multiple variables and a target variable through batch processing. Using the DABI website interface, users can select a multitude of variables they are interested in from any of the projects in the archive, and then select one of those variables to be the target variable which will serve as the dependent variable in any statistical test performed. The system will perform different statistical tests depending on the type of the variables being correlated. The current supported variable types are nominal/categorical and numerical. The diagram below shows how variable types dictate which statistical tests will be used on the data.

Spearman, Pearson and One-Way ANOVA are well known statistical tests. In this document, we will discuss other statistical tests we are using and why we chose those over other tests. Theil’s U Test

Theil’s U, also referred to as the Uncertainty Coefficient, is based on the conditional entropy between x and y — or in human language, given the value of x, how many possible states does y have, and how often do they occur. Just like Cramer’s V, the output value is on the range of [0,1], with the same interpretations as before — but unlike Cramer’s V, it is asymmetric, meaning U(x,y)≠U(y,x) (while V(x,y)=V(y,x), where V is Cramer’s V). Using Theil’s U in the simple case above will let us find out that knowing y means we know x, but not vice-versa.

As an example, if we want to find the correlation between gender and an arbitrary assessment variable with a ordinal answers such as great, good, fair and poor, we will be running a Theil’s U test. The practical aspect of this test is same as Cramer’s V test but the important difference is that Cramer’s V is a symmetrical test where there is no concept of dependent and independent variables. By using Theil’s U test, we can set either gender or the assessment variable as the independent variable depending on user selection. Welch T-test1

Welch’s ANOVA is an alternative to the traditional (ANOVA) with some important benefits. One-way analysis of variance determines whether differences between the means of at least three groups are statistically significant. For decades, introductory classes have taught the classic Fishers one-way ANOVA that uses the F-test, however simulation studies have found that unequal standard deviations cause the Type I error rate to shift away from the significance level target. If the group sizes are equal and the significance level is 0.05, the actual error rate falls between 0.02 and 0.08. However, if the groups have different sizes, the error rates can be as large as 0.22.

Comparing simulation study results for the two types of analysis of variance when standard deviations are unequal, and the significance level is 0.05:

-Classic ANOVA error rates extend from 0.02 to 0.22.

-Welch’s ANOVA error rates have a much smaller range of 0.046 to 0.054. Correlation Ratio Test2:

In statistics, the correlation ratio is a measure of the relationship between the statistical dispersion within individual categories and the dispersion across the whole population or sample. The measure is defined as the ratio of two standard deviations representing these types of variation. The context here is the same as that of the coefficient, whose value is the square of the correlation ratio.

Mathematically, it is defined as the weighted variance of the mean of each category divided by the variance of all samples; in human language, the Correlation Ratio answers the following question: Given a continuous number, how well can you know to which category it belongs to? Just like the two coefficients we’ve seen before, here too the output is on the range of [0,1].

1 https://statisticsbyjim.com/anova/welchs-anova-compared-to-classic-one-way-anova/ 2 https://en.wikipedia.org/wiki/Correlation_ratio