researchpy Documentation Release 0.3.2

Corey Bryant

Jun 02, 2021

CONTENTS:

1 codebook() 3 1.1 Arguments...... 3 1.2 Examples...... 3

2 summary_cont() 5 2.1 Arguments...... 5 2.2 Examples...... 5

3 summary_cat() 7 3.1 Arguments...... 7 3.2 Examples...... 7

4 difference_test() 9 4.1 Arguments...... 9 4.2 Examples...... 11 4.3 References...... 15

5 ttest() 17 5.1 Arguments...... 17 5.2 Examples...... 19 5.3 References...... 20

6 crosstab() 23 6.1 Arguments...... 23 6.2 Effect size measures formulas...... 24 6.3 Examples...... 25 6.4 References...... 27

7 corr_case() 29 7.1 Arguments...... 29 7.2 Examples...... 29 7.3 References...... 30

8 corr_pair() 33 8.1 Arguments...... 33 8.2 Examples...... 33 8.3 References...... 34

9 Installing researchpy 35

i ii researchpy Documentation, Release 0.3.2

Researchpy produces DataFrames that contains relevant statistical testing information that is commonly required for academic research. The information is returned as Pandas DataFrames to make for quick and easy exporting of results to any format/method that works with the traditional Pandas DataFrame. Researchpy is essentially a wrapper that combines various established packages such as pandas, .stats, , and statsmodels to get all the standard required information in one method. If analyses were not available in these packages, code was developed to fill the gap. Formula’s are provided with citations if the code originated from researchpy. All output has been tested and verified by comparing to established software packages such as Stata, SAS, SPSS, and/or R. Moving forward, Researchpy will be using formula_like approach and is being streamlined during this process. For an example, see the new -difference_test- method.

Note: researchpy is only compatible with Python 3.x. Download using either: • pip install researchpy – For standard install • conda install -c researchpy researchpy – For installation through conda

CONTENTS: 1 researchpy Documentation, Release 0.3.2

2 CONTENTS: CHAPTER ONE

CODEBOOK()

Prints out descriptive information for a Pandas Series or DataFrame object.

1.1 Arguments codebook(data) • data must either be a Pandas Series or DataFrame returns • Information from a print()

1.2 Examples import researchpy as rp import pandas as pd import patsy pdf= pd.DataFrame(patsy.demo_data("y","x","age","disease", nlevels=3, min_rows= 20)) rp.codebook(pdf)

3 researchpy Documentation, Release 0.3.2

4 Chapter 1. codebook() CHAPTER TWO

SUMMARY_CONT()

Returns a nice data table as a Pandas DataFrame that includes the variable name, total number of non-missing obser- vations, standard deviation, standard error, and the 95% confidence interval. This is compatible with Pandas Series, DataFrame, and GroupBy objects.

2.1 Arguments

summary_cont(group1, conf = 0.95, decimals = 4) • group1, must either be a Pandas Series or DataFrame with multiple columns stated • conf, must be entered in decimal format. The default confidence interval being calculated is at 95% • decimals, rounds the output table to the specified decimal. returns • Pandas DataFrame

2.2 Examples

import numpy, pandas, researchpy

numpy.random.seed(12345678)

df= pandas.DataFrame(numpy.random.randint(10, size=(100,2)), columns=[ 'healthy', 'non-healthy']) df['tx']="" df['tx'].iloc[0:50]="Placebo" df['tx'].iloc[50:101]="Experimental"

df['dose']="" df['dose'].iloc[0:26]="10 mg" df['dose'].iloc[26:51]="25 mg" df['dose'].iloc[51:76]="10 mg" df['dose'].iloc[76:101]="25 mg"

# Summary statistics for a Series (single variable) researchpy.summary_cont(df['healthy'])

5 researchpy Documentation, Release 0.3.2

# Summary statistics for multiple Series researchpy.summary_cont(df[['healthy', 'non-healthy']])

# Easy to export results, assign to Python object which will have # the Pandas DataFrame class results= researchpy.summary_cont(df[[ 'healthy', 'non-healthy']]) results.to_csv("results.csv", index= False)

# This works with GroupBy objects as well researchpy.summary_cont(df['healthy'].groupby(df['tx']))

# Even with a GroupBy object with a hierarchical index researchpy.summary_cont(df.groupby(['tx', 'dose'])['healthy', 'non-healthy'])

# Above is the default output, but if the results want to be compared # above/below each other use .apply() df.groupby(['tx', 'dose'])['healthy', 'non-healthy'].apply(researchpy.summary_cont)

6 Chapter 2. summary_cont() CHAPTER THREE

SUMMARY_CAT()

Returns a data table as a Pandas DataFrame that includes the counts and percentages of each category. If there are missing data present (numpy.nan), they will be excluded from the counts. However, if the missing data is coded as a string, it will be included as it’s own category.

3.1 Arguments

summary_cat(group1, ascending= False) • group1, can be a Pandas Series or DataFrame with multiple columns stated • ascending, determines the output ascending order or not. Default is descending. returns • Pandas DataFrame

3.2 Examples

import numpy, pandas, researchpy

numpy.random.seed(123)

df= pandas.DataFrame(numpy.random.randint(2, size=(101,2)), columns=[ 'disease', 'treatment'])

# Handles a single Pandas Series researchpy.summary_cat(df['disease'])

# Can handle multiple Series, although the output is not pretty researchpy.summary_cat(df[['disease', 'treatment']])

# If missing is a string, it will show up as it's own category df['disease'][0]="" researchpy.summary_cat(df['disease'])

7 researchpy Documentation, Release 0.3.2

# However, is missing is a numpy.nan, it will be excluded from the counts df['disease'][0]= numpy.nan researchpy.summary_cat(df['disease'])

# Results can easily be exported using many methods including the default # Pandas exporting methods results= researchpy.summary_cat(df[ 'disease'])

results.to_csv("summary_cats.csv", index= False)

# This is the default, showing for comparison of immediately below researchpy.summary_cat(df['disease'], ascending= False)

researchpy.summary_cat(df['disease'], ascending= True)

8 Chapter 3. summary_cat() CHAPTER FOUR

DIFFERENCE_TEST()

Conducts a few different statistical tests which test for a difference between independent or related samples withor without equal variances and has the ability to calculate the effect size of the observed difference. The data is returned in a Pandas DataFrame by default, but can be returned as a dictionary if specified. This method is similar to researchpy.ttest(), except it allows the user to use the formula syntax. This method can perform the following tests: • Independent sample t-test [] – psudo-code: difference_test(formula_like, data, equal_variances = True, independent_samples = True) • Paired sample t-test [] – psudo-code: difference_test(formula_like, data, equal_variances = True, independent_samples = False) • Welch’s t-test [] – psudo-code: difference_test(formula_like, data, equal_variances = False, independent_samples = True) • Wilcoxon ranked-sign test [] – By default, discards all zero-differences; this is known as the ‘wilcox’ method. – psudo-code: difference_test(formula_like, data, equal_variances = False, independent_samples = False) 2 objects will be returned for all available tests; the first object will be a descriptive summary table and the second will be the testing result information which will include the effect size measures if indicated.

4.1 Arguments difference_test(formula_like, data = {}, conf_level = 0.95, equal_variances = True, independent_samples =True, wilcox_parameters = {“zero_method” : “wilcox”, “correction” : False, “mode” : “auto”}, **keywords) • formula_like: A valid formula ; for example, “DV ~ IV”. • data: data to perform the analysis on - contains the dependent and independent variables. • conf_level: Specify the confidence interval to be calculated. • equal_variances: Boolean to indicate if equal variances are assumed. • independent_samples: Boolean to indicate if groups and independent of each other.

9 researchpy Documentation, Release 0.3.2

• wilcox_parameters: A dictionary with optional methods for calculating the Wilcoxon signed-rank test. For more information, see scipy.stats.wilcoxon. conduct(return_type = “Dataframe”, effect_size = None) • return_type: Specify if the results should be returned as a Pandas DataFrame (default) or a Python dictionary (= ‘Dictionary’). • effect_size: Specify if effect sizes should be calculated, default value is None. * Available options are:None, “Cohen’s D”, “Hedge’s G”, “Glass’s delta1”, “Glass’s delta2”, “r”, and/or “all”. – User can specify any combination of effect sizes, or use “all” which will calculate all effect sizes. – Only effect size “r” is supported for the Wilcoxon ranked-sign test. returns • 2 objects will be returned within a tuple: – First object will be the descriptive summary information. – Second object will be the statistical testing information.

Note: This can be a one step, or two step process. One step .. code:: python difference_test(“DV ~ IV”, data).conduct() Two step .. code:: python model = difference_test(“DV ~ IV”, data) model.conduct()

4.1.1 Effect size measures formulas

Cohen’s ds (between subjects design)

Cohen’s ds [] for a between groups design is calculated with the following equation:

푥¯1 − 푥¯2 푑푠 = √︁ 2 2 (푛1−1)푆퐷1 +(푛2−1)푆퐷2 푛1+푛2−2

Cohen’s dav (within subject design)

Another version of Cohen’s d is used in within subject designs. This is noted by the subscript “av”. The formula for Cohen’s dav [] is as follows:

푀푑푖푓푓 푑푎푣 = 푆퐷1+푆퐷2 2

10 Chapter 4. difference_test() researchpy Documentation, Release 0.3.2

Hedges’s gs (between subjects design)

Cohen’s ds gives a biased estimate of the effect size for a population and Hedges and Olkin [] provides an unbiased estimation. The differences between Hedges’s g and Cohen’s d is negligible when sample sizes are above 20,butitis still preferable to report Hedges’s g []. Hedge’s gs is calculated using the following formula: 3 Hedges’s g푠 = Cohen’s d푠 × (1 − ) 4(푛1 + 푛2 − 9)

Hedges’s gav (within subjects design)

Cohen’s dav gives a biased estimate of the effect size for a population and Hedges and Olkin [] provides a correction to be applied to provide an unbiased estimate. Hedge’s gav is calculated using the following formula [] : 3 Hedges’s g푎푣 = Cohen’s d푎푣 × (1 − ) 4(푛1 + 푛2 − 9)

Glass’s ∆ (between or within subjects design)

Glass’s ∆ is the mean differences between the two groups divided by the standard deviation of the first condition/group or by the second condition/group. When used in a within subjects design, it is recommended to use the pre- standard deviation in the denominator []; the following formulas are used to calculate Glass’s ∆:

(¯푥1 − 푥¯2) ∆1 = 푆퐷1 (¯푥1 − 푥¯2) ∆2 = 푆퐷2

Point-Biserial correlation coefficient r (between or within subjects design)

Tthe following formula to calculate the Point-Biserial correlation coefficient r using the t-value and degrees of freedom: 푡 푟 = √︀푡2 + 푑푓

The following formula is used to calculate the Point-Biserial correlation coefficient r using the W-value and N. This formula is used to calculate the r coefficient for the Wilcoxon ranked-sign test.

√︃ 푊 푟 = ∑︀ rank

4.2 Examples

First let’s create an example data set to work through the examples. This will be done using numpy (to create fake data) and pandas (to hold the data in a data frame).

4.2. Examples 11 researchpy Documentation, Release 0.3.2

import numpy, pandas, researchpy numpy.random.seed(12345678) df= pandas.DataFrame(numpy.random.randint(10, size=(100,2)), columns=[ 'No', 'Yes']) df["id"]= range(1, df.shape[0]+1) df.head()

No Yes id 321 412 013 824 665

If one has data like this and doesn’t want to reshape the data, then researchpy.different_test() will not work and one should use researchpy.ttest() instead. However, moving forward researchpy will be going in the direction of syntax style input and it is recommended to get comfortable using this approach if one plans to use researchpy in the future. Currently the data is in a wide format and it needs to be in a long format, i.e. one variable with the dependent variable data and another with the independent variable data. The current data structure won’t work and it needs to be reshaped; there are a few ways to do this, one will be shown below. df2= pandas.melt(df, id_vars="id", value_vars=["No","Yes"], var_name="Exercise", value_name="StressReactivity") df2.head() id Exercise StressReactivity 1 No3 2 No4 3 No0 4 No8 5 No6

Now the data is in the correct structure. # Independent t-test

# If you don't store the 2 returned DataFrames, it outputs as a tuple and # is displayed difference_test("StressReactivity ~ C(Exercise)", data= df2, equal_variances= True, independent_samples= True).conduct(effect_size="all")

( Variable N Mean SD SE 95% Conf. Interval 0 healthy 100.0 4.590 2.749086 0.274909 4.044522 5.135478 1 non-healthy 100.0 4.160 3.132495 0.313250 3.538445 4.781555 2 combined 200.0 4.375 2.947510 0.208420 3.964004 4.785996, (continues on next page)

12 Chapter 4. difference_test() researchpy Documentation, Release 0.3.2

(continued from previous page) Independent t-test results 0 Difference (healthy- non-healthy)= 0.4300 1 Degrees of freedom= 198.0000 2t= 1.0317 3 Two side test p value= 0.3035 4 Difference<0 p value= 0.8483 5 Difference>0 p value= 0.1517 6 Cohen's d = 0.1459 7 Hedge's g = 0.1454 8 Glass's delta = 0.1564 9r= 0.0731)

# Otherwise you can store them as objects summary, results= difference_test("StressReactivity ~ C(Exercise)", data= df2, equal_variances= True, independent_samples= True).conduct(effect_size="all

˓→")

summary

Name N Mean Variance SD SE 95% Conf. Interval 0 No 100 4.590 7.55747 2.74909 0.274909 4.044522 5.135478 1 Yes 100 4.160 9.81253 3.1325 0.313250 3.538445 4.781555 2 combined 200 4.375 8.68781 2.94751 0.208420 3.964004 4.785996 3 diff 0.430 0.416773-0.391884 1.251884

results

Independent samples t-test Results 0 Difference (No- Yes) 0.430000 1 Degrees of freedom= 198.000000 2t= 1.031736 3 Two sided test p-value= 0.303454 4 Difference<0p-value= 0.848273 5 Difference>0p-value= 0.151727 6 Cohen's Ds 0.145909 7 Hedge's G 0.145356 8 Glass's delta1 0.156416 9 Glass's delta2 0.137271 10 Point-Biserial r 0.073126

# Paired samples t-test summary, results= difference_test("StressReactivity ~ C(Exercise)", data= df2, equal_variances= True, independent_samples= False).conduct(effect_size=

˓→"all") summary

4.2. Examples 13 researchpy Documentation, Release 0.3.2

Name N Mean Variance SD SE 95% Conf. Interval 0 No 100 4.59 7.55747 2.749086 0.274909 4.044522 5.135478 1 Yes 100 4.16 9.81253 3.132495 0.313250 3.538445 4.781555 3 diff 0.43 4.063275 0.406327-0.376242 1.236242

results

Paired samples t-test Results 0 Difference (No- Yes) 0.430000 1 Degrees of freedom= 99.000000 2t= 1.058260 3 Two sided test p-value= 0.292512 4 Difference<0p-value= 0.853744 5 Difference>0p-value= 0.146256 6 Cohen's Dav 0.146219 7 Hedge's Gav 0.145665 8 Glass's delta1 0.156416 9 Glass's delta2 0.137271 10 Point-Biserial r 0.105763

# Welch's t-test summary, results= difference_test("StressReactivity ~ C(Exercise)", data= df2, equal_variances= False, independent_samples= True).conduct(effect_size="all

˓→")

summary

Name N Mean Variance SD SE 95% Conf. Interval 0 No 100 4.590 7.55747 2.74909 0.274909 4.044522 5.135478 1 Yes 100 4.160 9.81253 3.1325 0.313250 3.538445 4.781555 2 combined 200 4.375 8.68781 2.94751 0.208420 3.964004 4.785996 3 diff 0.430 0.416773-0.391919 1.251919

results

Welch's t-test Results 0 Difference (No- Yes) 0.430000 1 Degrees of freedom= 196.651845 2t= 1.031736 3 Two sided test p-value= 0.303476 4 Difference<0p-value= 0.848268 5 Difference>0p-value= 0.151732 6 Cohen's Ds 0.145909 7 Hedge's G 0.145356 8 Glass's delta1 0.156416 9 Glass's delta2 0.137271 10 Point-Biserial r 0.073375

14 Chapter 4. difference_test() researchpy Documentation, Release 0.3.2

# Wilcoxon signed-rank test summary, results= difference_test("StressReactivity ~ C(Exercise)", data= df2, equal_variances= False, independent_samples= False).conduct(effect_size="r

˓→") summary

Name N Mean Variance SD SE 95% Conf. Interval 0 No 100 4.59 7.55747 2.74909 0.274909 4.044522 5.135478 1 Yes 100 4.16 9.81253 3.1325 0.313250 3.538445 4.781555

results

Wilcoxon signed-rank test Results 0 (No= Yes) 1W= 1849.5 2 Two sided p-value= 0.333755 3 Point-Biserial r 0.366238

# Exporting descriptive table (summary) and result table (results) to same # csv file summary.to_csv("C:\\Users\\...\\test.csv", index= False) results.to_csv("C:\\Users\\...\\test.csv", index= False, mode= 'a')

4.3 References

• scipy.stats.chi2_contingency. The Scipy community, 2016. Retrived when last updated May 12, 2016. URL: http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.chi2_contingency.html. • scipy.stats.fisher_exact. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html. • scipy.stats.kendalltau. 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.kendalltau.html. • scipy.stats.pearsonr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https: //docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html. • scipy.stats.spearmanr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html. • scipy.stats.ttest_ind. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html. • scipy.stats.ttest_rel. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html. • scipy.stats.wilcoxon. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html.

4.3. References 15 researchpy Documentation, Release 0.3.2

• statsmodels.stats.contingency_tables.mcnemar. Statsmodels-developers, 2018. URL: https://www.statsmodels. org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html. • Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, second edition, 1988. ISBN 0-8058-0283-5. • Harald Cramér. Mathematical methods of statistics (PMS-9). Volume 9. Princeton university press, 2016. • Larry Hedges and Ingram Olkin. Journal of Educational Statistics, chapter Statistical Methods in Meta-Analysis. Volume 20. Academic Press, Inc., 1985, 10.2307/1164953. • Rex B. Kline. Beyond significance testing: Reforming methods in behavioral research. American Psychological Association, 2004. http://dx.doi.org/10.1037/10693-000. • Daniel Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. Frontiers in Psychology, November 2013. doi:10.3389/fpsyg.2013.00863. • Robert Rosenthal. The hand-book of research synthesis, chapter Parametric measures of effect size, pages 231–244. New York, NY: Russel Sage Foundation, 1994.

16 Chapter 4. difference_test() CHAPTER FIVE

TTEST()

Returns data tables as Pandas DataFrames with relevant information pertaining to the statistical test conducted. Returns 2 DataFrames so all information can easily be exported, except for Wilcoxon ranked-sign test- only 1 DataFrame is returned. DataFrame 1 (all except Wilcoxon ranked-sign test) has summary statistic information including variable name, total number of non-missing observations, standard deviation, standard error, and the 95% confidence interval. This is the same information returned from the summary_cont() method. DataFrame 2 (all except Wilcoxon ranked-sign test) has the test results for the statistical tests. Included in this is an effect size measures of r, Cohen’s d, Hedge’s g, and Glass’s ∆ for the independent sample t-test, paired sample t-test, and Welch’s t-test. For the Wilcoxon ranked-sign test, the returned DataFrame contains the mean for both comparison points, the T-value, the Z-value, the two-sided p-value, and effect size measure r. This method can perform the following tests: • Independent sample t-test [], • Paired sample t-test [], • Welch’s t-test [], and • Wilcoxon ranked-sign test []

5.1 Arguments ttest(group1, group2, group1_name= None, group2_name= None, equal_variances= True, paired= False, cor- rection= None) • group1 and group2, requires the data to be a Pandas Series • group1_name and group2_name, will override the series name • equal_variances, tells whether equal variances is assumed or not. If not, Welch’s t-test is used if data is un- paired, or Wilcoxon rank-signed test is used if data is paired. The default is True. • paired, tells whether the data is paired. If data is paired and equal variance is assumed, a paired sample t- test is conducted, otherwise a Wilcoxon ranked-sign test is conducted. The default is False. returns • 2 Pandas DataFrames as a tuple; – First returned DataFrame is the summary statistics – Second returned DataFrame is the test results.

17 researchpy Documentation, Release 0.3.2

– Except for Wilcoxon ranked-sign test, only 1 DataFrame is returned

Note: Wilcoxon ranked-sign test: a 0 difference between the 2 groups is discarded from the calculation. Thisisthe ‘wilcox’ method apart of scipy.stats.wilcoxon

5.1.1 Effect size measures formulas

Cohen’s ds (between subjects design)

Cohen’s ds [] for a between groups design is calculated with the following equation:

푥¯1 − 푥¯2 푑푠 = √︁ 2 2 (푛1−1)푆퐷1 +(푛2−1)푆퐷2 푛1+푛2−2

Hedges’s gs (between subjects design)

Cohen’s ds gives a biased estimate of the effect size for a population and Hedges and Olkin [] provides an unbiased estimation. The differences between Hedges’s g and Cohen’s d is negligible when sample sizes are above 20,butitis still preferable to report Hedges’s g []. Hedge’s gs is calculated using the following formula: 3 Hedges’s g푠 = Cohen’s d푠 × (1 − ) 4(푛1 + 푛2 − 9)

Glass’s ∆ (between or within subjects design)

Glass’s ∆ is the mean differences between the two groups divided by the standard deviation of the control group. When used in a within subjects design, it is recommended to use the pre- standard deviation in the denominator []; the following formula is used to calculate Glass’s ∆: (¯푥 − 푥¯ ) ∆ = 1 2 푆퐷1

Cohen’s dz (within subject design)

Another version of Cohen’s d is used in within subject designs. This is noted by the subscript “z”. The formula for Cohen’s dz [] is as follows:

푀푑푖푓푓 푑푧 = √︁ ∑︀ 2 (푋푑푖푓푓 −푀푑푖푓푓 ) 푁−1

Pearson correlation coefficient r (between or within subjects design)

Rosenthal [] provided the following formula to calculate the Pearson correlation coefficient r using the t-value and degrees of freedom: √︃ 푡2 푟 = 푡2 + 푑푓

18 Chapter 5. ttest() researchpy Documentation, Release 0.3.2

Rosenthal [] provided the following formula to calculate the Pearson correlation coefficient r using the z-value andN. This formula is used to calculate the r coefficient for the Wilcoxon ranked-sign test. √︃ 푍 푟 = √ 푁

5.2 Examples

import numpy, pandas, researchpy

numpy.random.seed(12345678)

df= pandas.DataFrame(numpy.random.randint(10, size=(100,2)), columns=[ 'healthy', 'non-healthy'])

# Independent t-test

# If you don't store the 2 returned DataFrames, it outputs as a tuple and # is displayed researchpy.ttest(df['healthy'], df['non-healthy'])

( Variable N Mean SD SE 95% Conf. Interval 0 healthy 100.0 4.590 2.749086 0.274909 4.044522 5.135478 1 non-healthy 100.0 4.160 3.132495 0.313250 3.538445 4.781555 2 combined 200.0 4.375 2.947510 0.208420 3.964004 4.785996, Independent t-test results 0 Difference (healthy- non-healthy)= 0.4300 1 Degrees of freedom= 198.0000 2t= 1.0317 3 Two side test p value= 0.3035 4 Difference<0 p value= 0.8483 5 Difference>0 p value= 0.1517 6 Cohen's d = 0.1459 7 Hedge's g = 0.1454 8 Glass's delta = 0.1564 9r= 0.0731)

# Otherwise you can store them as objects des, res= researchpy.ttest(df[ 'healthy'], df['non-healthy'])

des

res

# Paired samples t-test des, res= researchpy.ttest(df[ 'healthy'], df['non-healthy'], paired= True)

des

5.2. Examples 19 researchpy Documentation, Release 0.3.2

res

# Welch's t-test des, res= researchpy.ttest(df[ 'healthy'], df['non-healthy'], equal_variances= False) des

res

# Wilcoxon signed-rank test researchpy.ttest(df['healthy'], df['non-healthy'], equal_variances= False, paired= True)

# Exporting descriptive table (des) and result table (res) to same # csv file des, res= researchpy.ttest(df[ 'healthy'], df['non-healthy'])

des.to_csv("C:\\Users\\...\\test.csv", index= False) res.to_csv("C:\\Users\\...\\test.csv", index= False, mode= 'a')

5.3 References

• scipy.stats.chi2_contingency. The Scipy community, 2016. Retrived when last updated May 12, 2016. URL: http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.chi2_contingency.html. • scipy.stats.fisher_exact. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html. • scipy.stats.kendalltau. 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.kendalltau.html. • scipy.stats.pearsonr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https: //docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html. • scipy.stats.spearmanr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html. • scipy.stats.ttest_ind. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html. • scipy.stats.ttest_rel. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html. • scipy.stats.wilcoxon. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html. • statsmodels.stats.contingency_tables.mcnemar. Statsmodels-developers, 2018. URL: https://www.statsmodels. org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html. • Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, second edition, 1988. ISBN 0-8058-0283-5. • Harald Cramér. Mathematical methods of statistics (PMS-9). Volume 9. Princeton university press, 2016.

20 Chapter 5. ttest() researchpy Documentation, Release 0.3.2

• Larry Hedges and Ingram Olkin. Journal of Educational Statistics, chapter Statistical Methods in Meta-Analysis. Volume 20. Academic Press, Inc., 1985, 10.2307/1164953. • Rex B. Kline. Beyond significance testing: Reforming data analysis methods in behavioral research. American Psychological Association, 2004. http://dx.doi.org/10.1037/10693-000. • Daniel Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. Frontiers in Psychology, November 2013. doi:10.3389/fpsyg.2013.00863. • Robert Rosenthal. The hand-book of research synthesis, chapter Parametric measures of effect size, pages 231–244. New York, NY: Russel Sage Foundation, 1994.

5.3. References 21 researchpy Documentation, Release 0.3.2

22 Chapter 5. ttest() CHAPTER SIX

CROSSTAB()

Returns up to 3 DataFrames depending on what desired. Can calculate row, column, or cell percentages if requested. Otherwise, counts are returned as the default. DataFrame 1 is always the crosstabulation results, the other 2 DataFrames returned depends on the options selected which is determined by the arguments test and expected_freqs. If all 3 options are returned, then the order of the returned DataFrames is as follows: crosstabulation results, 휒2 test results with effect size of Cramer’s Phi or V depending on size of table, and the expected frequency table.

6.1 Arguments crosstab(group1, group2, prop= None, test = False, margins= True, correction = None, cramer_correction = None, exact = False, expected_freqs= False) • group1 and group2, requires the data to be a Pandas Series • prop, can either be ‘row’, ‘col’, or ‘cell’. ‘row’ will calculate the row percentages, ‘column’ will calculate the column percentages, and ‘cell’ will calculate the cell percentage based on the entire sample • test, can take “chi-square”, “g-test”, “mcnemar”, or “fisher”. – If “chi-square”, the chi-square (휒2) test of independence [] will be calculated and returned in a second DataFrame. – If “g-test”, will conduct the G-test (likelihood-ratio 휒2) [] and the results will be returned in a second DataFrame. – If “fisher”, will conduct Fisher’s exact test []. – If “mcnemar”, will conduct the McNemar 휒2 [] test for paired nominal data. • margins, if False will return a crosstabulation table without the total counts for each group. This argument is only supported for counts; the margins will always be returned for the percentages • correction, if True, applies the Yates’ correction for continuity. Valid argument for chi-square”, “g-test”, and “mcnemar”. • cramer_correction, if True, applies the bias correction developed by Tschuprow (1925) to Cramer’s V. • exact, is only a valid option for when the “mcnemar” test is selected. In that case, exact = True will then the binomal distribution will be used. If false (default), the 휒2 distribution is used. • expected_freqs, if True, will return a DataFrame that contains the expected counts for each cell. Not a valid argurment for mcnemar test. returns • Up to 3 Pandas DataFrames as a tuple;

23 researchpy Documentation, Release 0.3.2

– First DataFrame is always the crosstab table with either the counts, cell, row, or column percentages – Second DataFrame is either the test results or the expected frequencies. If a test is selected and expected frequencies are desired, the second DataFrame will be the test results; otherwise, if just expected frequencies are desired, the second DataFrame will be that and there will not be a third DataFrame returned. – Third DataFrame is always the expected frequencies

Note: If conducting a McNemar test, make sure the outcomes in both variables are labelled the same.

6.2 Effect size measures formulas

Note: If adjusted 휒2 values are used in the test’s calculation, then those adjusted 휒2 values are also used to calculate effect size.

6.2.1 Cramer’s Phi (2x2 table)

For analyses were it’s a 2x2 table, the following formula is used to calculate Cramer’s Phi (휑) []: √︂ 휒2 휑 = 푁 Where N = total number of observations in the analysis

6.2.2 Cramer’s V (RxC where R or C > 2)

For analyses were it’s a table that is larger than a 2x2, the following formula is used to calculate Cramer’s V []: √︃ 휒2 푉 = (푁 * (푘 − 1))

Where K is the number of categories for either R or C (whichever has fewer categories) √︃ 휑˜2 푉˜ = min(˜푟 − 1, 푐˜ − 1)

Where r is the number of rows and c is the number of columns, and 휒2 (푐 − 1)(푟 − 1) 휑˜2 = max(0, − ) 푛 푛 − 1 (푐 − 1)2 푐˜ = 푐 − 푛 − 1 (푟 − 1)2 푟˜ = 푟 − 푛 − 1

24 Chapter 6. crosstab() researchpy Documentation, Release 0.3.2

6.3 Examples import researchpy, pandas, numpy numpy.random.seed(123) df= pandas.DataFrame(numpy.random.randint(3, size=(101,3)), columns=[ 'disease', 'severity', 'alive']) df.head()

# If only two Series are passed it will output a crosstabulation with margin totals. # This is the same as pandas.crosstab(), except for researchpy.crosstab() returns # a table with hierarchical indexing for better exporting format style. researchpy.crosstab(df['disease'], df['alive'])

# Demonstration of calculating cell proportions crosstab= researchpy.crosstab(df[ 'disease'], df['alive'], prop="cell") crosstab

# Demonstration of calculating row proportions crosstab= researchpy.crosstab(df[ 'disease'], df['alive'], prop="row") crosstab

# Demonstration of calculating column proportions crosstab= researchpy.crosstab(df[ 'disease'], df['alive'], prop="col") crosstab

# To conduct a Chi-square test of independence, pass "chi-square" in the "test ="␣

˓→argument. # This will also output an effect size; either Cramer's Phi if it a 2x2 table, or # Cramer's V is larger than 2x2.

# This will return 2 DataFrames as a tuple, 1 with the crosstabulation and the other␣

˓→with the # test results. It's rather ugly, the recommended way to output is in the next example researchpy.crosstab(df['disease'], df['alive'], test="chi-square")

( alive 012 All disease 09 147 30 179 15 31 (continues on next page)

6.3. Examples 25 researchpy Documentation, Release 0.3.2

(continued from previous page) 27 17 16 40 All 23 40 38 101, Chi-square test results 0 Pearson Chi-square ( 4.0)= 5.1573 1p-value= 0.2715 2 Cramer's V = 0.3196)

# To clean up the output, assign each DataFrame to an object. This allows # for a cleaner view and each DataFrame to be exported crosstab, res= researchpy.crosstab(df[ 'disease'], df['alive'], test="chi-square") crosstab res

# To get the expected frequencies, pass "True" in "expected_freqs=" crosstab, res, expected= researchpy.crosstab(df[ 'disease'], df['alive'], test="chi-

˓→square", expected_freqs= True) expected

# Can also conduct the G-test (likelihood-ratio chi-square) crosstab, res= researchpy.crosstab(df[ 'disease'], df['alive'], test="g-test") res

# Can also conduct Fisher's exact test

# Need 2x2 data for Fisher's test. numpy.random.seed(345) df= pandas.DataFrame(numpy.random.randint(2, size=(90,2)), columns=[ 'tx', 'cured']) crosstab, res= researchpy.crosstab(df[ 'tx'], df['cured'], test="fisher") crosstab res

# Lastly, the McNemar test # Make sure your outcomes are labelled the same in # both variables numpy.random.seed(345) df= pandas.DataFrame(numpy.random.randint(2, size=(90,2)), columns=[ 'time1', 'time2'])

(continues on next page)

26 Chapter 6. crosstab() researchpy Documentation, Release 0.3.2

(continued from previous page) crosstab, res= researchpy.crosstab(df[ 'time1'], df['time2'], test="mcnemar") crosstab res

6.4 References

• scipy.stats.chi2_contingency. The Scipy community, 2016. Retrived when last updated May 12, 2016. URL: http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.chi2_contingency.html. • scipy.stats.fisher_exact. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html. • scipy.stats.kendalltau. 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.kendalltau.html. • scipy.stats.pearsonr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https: //docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html. • scipy.stats.spearmanr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html. • scipy.stats.ttest_ind. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html. • scipy.stats.ttest_rel. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html. • scipy.stats.wilcoxon. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html. • statsmodels.stats.contingency_tables.mcnemar. Statsmodels-developers, 2018. URL: https://www.statsmodels. org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html. • Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, second edition, 1988. ISBN 0-8058-0283-5. • Harald Cramér. Mathematical methods of statistics (PMS-9). Volume 9. Princeton university press, 2016. • Larry Hedges and Ingram Olkin. Journal of Educational Statistics, chapter Statistical Methods in Meta-Analysis. Volume 20. Academic Press, Inc., 1985, 10.2307/1164953. • Rex B. Kline. Beyond significance testing: Reforming data analysis methods in behavioral research. American Psychological Association, 2004. http://dx.doi.org/10.1037/10693-000. • Daniel Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. Frontiers in Psychology, November 2013. doi:10.3389/fpsyg.2013.00863. • Robert Rosenthal. The hand-book of research synthesis, chapter Parametric measures of effect size, pages 231–244. New York, NY: Russel Sage Foundation, 1994.

6.4. References 27 researchpy Documentation, Release 0.3.2

28 Chapter 6. crosstab() CHAPTER SEVEN

CORR_CASE()

Conducts Pearson (default method), Spearman rank, or Kendall’s Tau-b correlation analysis using case wise deletion. Returns the relevant information and results in 3 DataFrames for easy exporting. DataFrame 1 is the testing information, i.e. the type of correlation analysis conducted and the number of observations used. DataFrame 2 contains the r value results in a matrix style look. DataFrame 3 contains the p-values in a matrix style look.

7.1 Arguments def corr_case(dataframe, method = “pearson”) • dataframe can either be a single Pandas Series or multiple Series/an entire DataFrame. • method takes the values of “pearson” [] (the default if nothing is passed), “spearman” [], or “kendall” [].

7.2 Examples import researchpy, numpy, pandas numpy.random.seed(12345) df= pandas.DataFrame(numpy.random.randint(10, size=(100,2)), columns=[ 'beck', 'srq'])

# Since it returns 3 DataFrames for easy exporting, if the DataFrames # aren't assigned to an object the outputting tuple is rather messy # and ugly researchpy.correlation.corr_case(df[['beck', 'srq']])

( Pearson correlation test using list-wise deletion 0 Total observations used= 100, beck srq beck1 0.0029 srq 0.00291, beck srq (continues on next page)

29 researchpy Documentation, Release 0.3.2

(continued from previous page) beck 0.0000 0.9775 srq 0.9775 0.0000)

# As noted above, the 3 return DataFrames are information, r values, # and p-values # The 3 DataFrame design was decided on so each DataFrame can easily # be exported using already supported Pandas methods i, r, p= researchpy.correlation.corr_case(df[[ 'beck', 'srq']]) i

Pearson correlation test using list-wise deletion Total observations used = 100 r

beck srq

beck 1 0.0029 srq 0.0029 1 p

beck srq

beck 0.0000 0.9775 srq 0.9775 0.0000

7.3 References

• scipy.stats.chi2_contingency. The Scipy community, 2016. Retrived when last updated May 12, 2016. URL: http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.chi2_contingency.html. • scipy.stats.fisher_exact. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html. • scipy.stats.kendalltau. 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.kendalltau.html. • scipy.stats.pearsonr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https: //docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html. • scipy.stats.spearmanr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html.

30 Chapter 7. corr_case() researchpy Documentation, Release 0.3.2

• scipy.stats.ttest_ind. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html. • scipy.stats.ttest_rel. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html. • scipy.stats.wilcoxon. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html. • statsmodels.stats.contingency_tables.mcnemar. Statsmodels-developers, 2018. URL: https://www.statsmodels. org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html. • Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, second edition, 1988. ISBN 0-8058-0283-5. • Harald Cramér. Mathematical methods of statistics (PMS-9). Volume 9. Princeton university press, 2016. • Larry Hedges and Ingram Olkin. Journal of Educational Statistics, chapter Statistical Methods in Meta-Analysis. Volume 20. Academic Press, Inc., 1985, 10.2307/1164953. • Rex B. Kline. Beyond significance testing: Reforming data analysis methods in behavioral research. American Psychological Association, 2004. http://dx.doi.org/10.1037/10693-000. • Daniel Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. Frontiers in Psychology, November 2013. doi:10.3389/fpsyg.2013.00863. • Robert Rosenthal. The hand-book of research synthesis, chapter Parametric measures of effect size, pages 231–244. New York, NY: Russel Sage Foundation, 1994.

7.3. References 31 researchpy Documentation, Release 0.3.2

32 Chapter 7. corr_case() CHAPTER EIGHT

CORR_PAIR()

Conducts Pearson (default method), Spearman rank, or Kendall’s Tau-b correlation analysis using pair wise deletion. Returns the relevant information and results in 1 DataFrame for easy exporting. DataFrame 1 contains the variables being compared in the index, followed by the corresponding r value, p-value, and N for the groups being compared.

8.1 Arguments corr_pair(dataframe, method= “pearson”) • dataframe can either be a single Pandas Series or multiple Series/an entire DataFrame. • method takes the values of “pearson” [] (the default if nothing is passed), “spearman” [], or “kendall” [].

8.2 Examples import researchpy, numpy, pandas numpy.random.seed(12345) df= pandas.DataFrame(numpy.random.randint(10, size=(100,4)), columns=[ 'mental_score', 'physical_score', 'emotional_score', 'happiness_index'])

# Can pass the entire DataFrame or multiple Series researchpy.correlation.corr_pair(df)

# Demonstrating how the output looks if there are different Ns for groups df['happiness_index'][0:30]= numpy.nan researchpy.correlation.corr_pair(df)

33 researchpy Documentation, Release 0.3.2

8.3 References

• scipy.stats.chi2_contingency. The Scipy community, 2016. Retrived when last updated May 12, 2016. URL: http://lagrange.univ-lyon1.fr/docs/scipy/0.17.1/generated/scipy.stats.chi2_contingency.html. • scipy.stats.fisher_exact. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html. • scipy.stats.kendalltau. 2018. Retrieved when last updated on May 5, 2018. URL: https://docs.scipy.org/doc/ scipy/reference/generated/scipy.stats.kendalltau.html. • scipy.stats.pearsonr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https: //docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html. • scipy.stats.spearmanr. The SciPy community, 2018. Retrieved when last updated on May 11, 2014. URL: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html. • scipy.stats.ttest_ind. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html. • scipy.stats.ttest_rel. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html. • scipy.stats.wilcoxon. The SciPy community, 2018. Retrieved when last updated on May 5, 2018. URL: https: //docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html. • statsmodels.stats.contingency_tables.mcnemar. Statsmodels-developers, 2018. URL: https://www.statsmodels. org/dev/generated/statsmodels.stats.contingency_tables.mcnemar.html. • Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, second edition, 1988. ISBN 0-8058-0283-5. • Harald Cramér. Mathematical methods of statistics (PMS-9). Volume 9. Princeton university press, 2016. • Larry Hedges and Ingram Olkin. Journal of Educational Statistics, chapter Statistical Methods in Meta-Analysis. Volume 20. Academic Press, Inc., 1985, 10.2307/1164953. • Rex B. Kline. Beyond significance testing: Reforming data analysis methods in behavioral research. American Psychological Association, 2004. http://dx.doi.org/10.1037/10693-000. • Daniel Lakens. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and anovas. Frontiers in Psychology, November 2013. doi:10.3389/fpsyg.2013.00863. • Robert Rosenthal. The hand-book of research synthesis, chapter Parametric measures of effect size, pages 231–244. New York, NY: Russel Sage Foundation, 1994.

34 Chapter 8. corr_pair() CHAPTER NINE

INSTALLING RESEARCHPY

researchpy is available via pip and through conda. To install using pip use: • pip install researchpy To install using conda use: • conda install -c researchpy researchpy

35