<<

Research Methodology and Statistical Methods

Research Methodology and Statistical Methods

Morgan Shields www.edtechpress.co.uk

Published by ED-Tech Press, 54 Sun Street, Waltham Abbey Essex, United Kingdom, EN9 1EJ

© 2019 by ED-Tech Press Reprinted2020

Research Methodology and Statistical Methods Morgan Shields Includes bibliographical references and index. ISBN 978-1-78882-100-1

All rights reserved. No part of this publication may be reproduced, stored in retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a license permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd., Saffron House, 6-10 Kirby Street, London EC1N 8TS.

Trademark Notice: All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners.

Unless otherwise indicated herein, any third-party trademarks that may appear in this work are the property of their respective owners and any references to third-party trademarks, logos or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of ED-Tech products by the owners of such marks, or any relationship between the owner and ED-Tech Press or its affiliates, authors, licensees or distributors.

British Library Cataloguing in Publication Data. A catalogue record for this book is available from the British Library.

For more information regarding ED-Tech Press and its products, please visit the publisher’s website www.edtechpress.co.uk TABLE OF CONTENTS

Preface xi

Chapter 1 Concepts of Research ...... 1 Introduction ...... 1 Determining a Theory ...... 3 Defining Variables ...... 3 Extraneous Variables ...... 4 Intervening Variables ...... 4 Developing the Hypothesis ...... 5 Standardization ...... 5 Selecting Subjects ...... 6 Simple Random Sample ...... 6 Systematic Sample ...... 7 Stratified Random Sample ...... 7 Cluster Sample...... 8 Non-probability Sample ...... 8 Testing Subjects ...... 8 Analyzing Results ...... 9 Determining Significance...... 10 Communicating Results ...... 10 Replication ...... 11 Putting it All Together ...... 11 Determining a Theory ...... 11 Determining Hypothesis ...... 12 Standardization ...... 12 Selecting Subjects ...... 12 vi

Testing Subjects ...... 13 Analyzing Results ...... 13 Determination of Significance...... 13 Communicating Results...... 13 Replication ...... 14 Scope of Research...... 14 National Innovative Capacity: Modeling, Measuring and Comparing National Capacities ...... 14 Designing Efficient Incentive Systems for Invention and Innovation: Intellectual Property Rights, Prizes, Public Subsidies ...... 15 Research in EPFL Labs: New Economics of Science...... 15 New &D Methods and the Production of Reliable Knowledge in Sectors which Lagged Behind ...... 16 New Models of Innovation: Open, Distributed Systems and the Role of Users ...... 17 Other Issues to be Developed ...... 18 Limitations of Research ...... 18 Purposes of Research ...... 18 Data Management ...... 18 ...... 19 Types of Research ...... 21 Historical Research in Physical Activity...... 21 Meta-analysis...... 24 Descriptive Research ...... 46 Physical Activity Epidemiology Research ...... 51 Experimental Research ...... 51 Quasi-Experiment Research ...... 57

Chapter 2 Quantitative and Qualitative Research ...... 59 Quantitative Research ...... 59 in Quantitative Research ...... 60 Measurement in Quantitative Research ...... 61 Quantitative Methods ...... 62 Quantitative Research Design ...... 62 Quantitative Data Analysis ...... 67 Qualitative Research ...... 74 Primary Data: Qualitative versus Quantitative Research ...... 77 The Nature of Qualitative Research...... 78 Rationale for Using Qualitative Research ...... 79 Philosophy and Qualitative Research ...... 81 Ethnographic Research ...... 88 Grounded Theory ...... 94 vii

Action Research...... 98

Chapter 3 Research Process ...... 103 The Process of Social Research ...... 103 Formulating the Research Problem...... 105 Conceptualizing the Problem ...... 105 The Logic of Research ...... 107 The Nature of Argumentation...... 108 Some Comments ...... 110 Inductive and Deductive Reasoning ...... 110 The JFK Example ...... 111 Some Definitions ...... 113 Inductive Generalization and Retroductive Inference ...... 114 Concluding Remarks...... 114 Types of Reasoning in Social Research ...... 115 Example: Giorgi’s Study on Religious Involvement in Secularised Societies ...... 116 Summary Comments on the Example...... 120 Stages in the Research Process ...... 120 Formulating the Research Problem (Cases, Variables and Relationships) ..120 Formulating the Research Problem (Research Objectives)...... 127 Research Design ...... 131 Conceptualisation (Defining Key Concepts)...... 136 Conceptualisation (Formulating Research Hypotheses) ...... 140 Operationalisation ...... 144 ...... 149 Data Collection (Data Sources, Reactivity and Control) ...... 157 Data Collection (Sources of Error) ...... 163 Data Collection (Ensuring Reliability) ...... 170 Data Analysis and Interpretation ...... 174 Writing the Research Report ...... 182

Chapter 4 ...... 190 Definition ...... 191 Standard Normal Distribution ...... 191 General Normal Distribution ...... 191 Notation...... 192 Alternative Parameterizations ...... 192 History...... 192 Development ...... 192 Naming ...... 194 Properties ...... 195 viii

Symmetries and Derivatives ...... 195 Moments ...... 196 Fourier Transform and Characteristic Function ...... 197 Moment and Cumulant Generating Functions ...... 197 Cumulative Distribution Function ...... 198 Standard Deviation and Coverage ...... 199 Quantile Function ...... 199 Zero-variance Limit ...... 200 Occurrence and Applications...... 200 Exact Normality...... 201 Approximate Normality ...... 201 Assumed Normality ...... 201 Produced Normality ...... 202 Generating Values from Normal Distribution ...... 203

Chapter 5 Statistical Hypothesis Testing ...... 205 Variations and Sub-classes ...... 206 The Testing Process ...... 206 Interpretation ...... 208 Use and Importance ...... 208 Cautions ...... 209 Examples ...... 210 Lady Tasting Tea ...... 210 Courtroom Trial ...... 210 Philosopher’s Beans ...... 211 Clairvoyant Card Game ...... 212 Radioactive Suitcase ...... 213 Definition of Terms ...... 214 Origins and Early Controversy ...... 216 Early Choices of Null Hypothesis ...... 218 Null Hypothesis Testing...... 218 Criticism ...... 220 Alternatives ...... 221 Philosophy ...... 222 Education ...... 223 One- and Two-tailed Tests ...... 223 Applications ...... 224 Coin Flipping Example ...... 225 History...... 225 Specific Tests ...... 226 Statistical Power ...... 226 Background ...... 227 ix

Factors Influencing Power ...... 228 Interpretation ...... 229 A Priori vs. Post Hoc Analysis ...... 230 Application ...... 231 Extension ...... 231 for Power and Sample Size Calculations ...... 231 Permutation Tests ...... 232 Multiple Comparisons Problem...... 232 History...... 233 Definition ...... 233 Classification of Multiple Hypothesis Tests ...... 234 Controlling Procedures ...... 234 Large-scale Multiple Testing ...... 236

Chapter 6 ...... 238 Definitions ...... 238 Applications and Purpose...... 239 Non-parametric Models ...... 240 ...... 240 Kernel Density Estimation ...... 241 Nonparametric Regression ...... 244 Data Envelopment Analysis ...... 247 k-nearest Neighbors Algorithm ...... 251 Support Vector Machine...... 258 Methods ...... 261 Analysis of Similarities ...... 261 Anderson–Darling Test ...... 262 Cochran’s Q Test...... 263 Cohen’s Kappa...... 263 Friedman Test ...... 267 Kendall Rank Correlation Coefficient ...... 268 Kendall’s W ...... 268 Kolmogorov–Smirnov Test ...... 268 Kruskal–Wallis One-way Analysis of Variance...... 269 Kuiper’s Test ...... 270 Log-rank Test...... 271 Mann–Whitney U Test ...... 272 McNemar’s Test ...... 274 Test ...... 274 Resampling (Statistics) ...... 275 Rank Product ...... 280 Siegel–Tukey Test ...... 280 x

Sign Test ...... 281 Spearman’s Rank Correlation Coefficient...... 282 Squared Ranks Test ...... 283 Tukey–Duckworth Test ...... 283 Wald–Wolfowitz Runs Test ...... 284

Bibliography ...... 285 PREFACE

All progress is born of inquiry. Doubt is often better than over-confidence, for it leads to inquiry, and inquiry leads to invention is famous Hudson Maxim in context of which the significance of research can well be understood. Increased amounts of research make progress possible. Research inculcates scientific and inductive thinking and it promotes the development of logical habits of thinking and organization. Research in common parlance refers to a search for knowledge. Once can also define research as a scientific and systematic search for pertinent information on a specific topic. In fact, research is an art of scientific investigation. The role of research in several fields of applied economics, whether related to business or to the economy as a whole, has greatly increased in modern times. The increasingly complex nature business and government has focused attention on the use of research in solving operational problems. Research, as an aid to economic policy, has gained added importance, both for government ad business. Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation. Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods of experimental design for experiments that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population. This book consistently integrates Research Methodology and Statistical Methods, allowing students to learn concurrently about different research designs and the appropriate statistics to use when analyzing data.

– Morgan Shields

Concepts of Research 1

1

Concepts of Research

INTRODUCTION Research is the cornerstone of any science, including both the hard sciences such as chemistry or physics and the social sciences such as psychology, management, or education. It refers to the organized, structured, and purposeful attempt to gain knowledge about a suspected relationship. Many argue that the structured attempt at gaining knowledge dates back to Aristotle and his identification of deductive reasoning. Deductive reasoning refers to a structured approach utilizing an accepted premise, a related minor premise, and an obvious cease. This way of gaining knowledge has been called a syllogism, and by following downward from the general to the specific, knowledge can be gained about a particular relationship. An example of an Aristotelian syllogism might be: Major premise: All student attend school regularly Minor premise: John is a student Cease: John attends school regularly In the early 1600s, Francis Bacon identified a different approach to gaining knowledge. Rather than moving from the general to the specific, Bacon looked at the gathering of specific information in order to make general ceases. This type of reasoning is called inductive and unlike Aristotelian logic allows new major premises to be determined. Inductive reasoning has been adopted into the sciences as the preferred way to explore new relationships because it allows us to use accepted knowledge as a means to gain new knowledge. 2 Research Methodology and Statistical Methods

For example: Specific Premises: John, Sally, Lenny and Sue attended class regulary Specific Premises: John, Sally, Lenny and Sue reveived high grades Cease: Attending class regularly results in high grades Researchers combine the powers of deductive and inductive reasoning into what is referred to now as the scientific method. It involves the determination of a major premise and then the analysis of the specific examples that would logically follow. The results might look something like:

Major Premise: Attending classes regularly results in high grades Class attendance: Group 1: John, Sally, Lenny and Sue attend classes regularly (Suspected cause) Group 2: Heather, Lucinda, Ling, and Bob do not attend classes regularly Grades: Group1: John, Sall (Suspected effects) Group2: Heather, Lucinda, Ling and Bob received C’s and D’s Cease Attending class regularly results in higher grades when compared with not attending class regularly (the major premise or Hypothesis is therefore supported)

Utilizing the scientific method for gaining new information and testing the validity of a major premise, John Dewey suggested a series of logical steps to follow when attempting to support a theory or hypothesis with actual data. In other words, he proposed using deductive reasoning to develop a theory followed by inductive reasoning to support it. These steps can be found in Table. Dewey’s Scientific Method 1. Identify and define the problem 2. Determine the hypothesis or reason why the problem exists 3. Collect and analyse data 4. Formulate conclusions 5. Apply conclusion to the original hypothesis. The steps involved in the research process can vary depending on the type of research being done and the hypothesis being tested. The most stringent types of research, such as experimental methods, contain the most structured process. Naturalistic observation, surveys, and other non-intrusive studies, are often less structured. A general process guide for doing research, especially laboratory research, can be found in Table. Steps Involved in the Research Process 1. Determine your theory or educated guess about a relationship. Involves identifying and defining a problem and reviewing the current literature. 2. Operationally define all variables to be involved in the research. 3. Develop hypothesis by plugging variables into original theory. A hypothesis is a testable theory with operationally defined variables. 4. Standardized the research methods by developing a research protocol to be used with every subject. Include in this step the methods for subject selection and assignment as well as how you will attempt to control for any extraneous variable. Concepts of Research 3

5. Select subjects and assign to groups using the protocol developed in step 4. 6. Test subjects using the protocol developed in step 4. 7. Analyse results. 8. Determine the significance of the results and how these results compare with other studies. Critique your research and suggest needs for further research based on your findings. 9. Communicate results through journal publication, book, book chapter, repor, presentation, or any means that will benefit those to whom the research applies. 10. Replicate, while part of the research process, this step is most often completed by a different researcher.

DETERMINING A THEORY While you may see a theory as an absolute, such as the theory of gravity or the theory of relativity, it is actually a changing phenomenon, especially in the soft or social sciences. Theories are developed based on what is observed or experienced, often times in the real world. In other words, a theory may have no additional backing other than an educated guess or a hunch about a relationship. For example, while teaching a college course in research, we notice that non-traditional students tend to be more involved in class lectures and perform better on class exams than traditional students. Our theory, then, could be that older students are more dedicated to their education than younger students. At this point, however, we have noticed only a trend within a single class that may or may not exist. We have developed a theory based on our observations and this theory, at least at this point, has no practical applications. Most theories are less concerned with application and more concerned with explanations. For example, we could assume, based on our observations, that older students have witnessed the importance of education through their work and interactions with others. With this explanation, we now have a theoretical cause and effect relationship: Students who have had prior experience in the workforce are more dedicated to their education than students who have not had this experience. This point it is always wise to do a literature review on your topic and areas related to your topic. Results from this search will likely help you determine how to proceed with your research. If, for example, you find that several studies have already been completed on this topic with similar results, doing yet another experiment may add little to what is already known. If this is the case, you would need to rethink your ideas and perhaps replicate the previous research using a different type of subject or a different situation or you may choose to scrap the study all together.

DEFINING VARIABLES Variables can be defined as any aspect of a theory that can vary or change as part of the interaction within the theory. In other words, variables are anything can effect or 4 Research Methodology and Statistical Methods change the results of a study. Every study has variables as these are needed in order to understand differences. In our theory, we have proposed that students exposed to the workforce take a more active role in their education than those who have no exposure. Looking at this theory, you might see that several obvious variables are at play, including ‘prior work experience’ and ‘age of student.’ However, other variables may also play a role in or influence what we observed. It is possible that older students have better social skills causing them to interact more in the classroom. They may have learned better studying skills, resulting in higher examination grades. They may feel awkward in a classroom of younger students or doubt their ability more and therefore try harder to succeed. All of these potential explanations or variables need to be addressed for the results of research to be valid. Let’s start with the variables that are directly related to the theory. First, the prior work experience is what we are saying has the effect on the classroom performance. We could say that work history is therefore the cause and classroom grades are the effect. In this example, our independent variable, the variable we start with is work experience. Our dependent variable or the variable we end up with is grades. We could add additional variables to our list to create more complex research. If we also looked at the affect of study skills on grades, study skills would become a second independent variable. If we wanted to measure the length of time to graduation along with grades, this would become a second dependent variable. There is no limit to the number of variables that can be measured, although the more variables, the more complex the study and the more complex the statistical analysis. The most powerful benefit of increasing our variables, however, is control. If we suspect something might impact our outcome, we need to either include it as a variable or hold it constant between all groups. If we find a variable that we did not include or hold constant to have an impact on our outcome, the study is said to be confounded. Variables that can confound our results, called confounding variables, are categorized into two groups: Extraneous and intervening.

EXTRANEOUS VARIABLES Extraneous variables can be defined as any variable other than the independent variable that could cause a change in the dependent variable. In our study we might realise that age could play a role in our outcome, as could family history, education of parents or partner, interest in the class topic, or even time of day, preference for the instructor’s teaching style or personality. The list, unfortunately, could be quite long and must be dealt with in order to increase the probability of reaching valid and reliable results.

INTERVENING VARIABLES Intervening variables, like extraneous variables, can alter the results of our research. These variables, however, are much more difficult to control for. Intervening variables Concepts of Research 5 include motivation, tiredness, boredom, and any other factor that arises during the course of research. For example, if one group becomes bored with their role in the research more so than the other group, the results may have less to do with our independent variable, and more to do with the boredom of our subjects.

DEVELOPING THE HYPOTHESIS The hypothesis is directly related to a theory but contains operationally defined variables and is in testable form. Hypotheses allow us to determine, through research, if our theory is correct. In other words, does prior work experience result in better grades? When doing research, we are typically looking for some type of difference or change between two or more groups. In our study, we are testing the difference between having work experience and not having work experience on college grades. Every study has two hypotheses; one stated as a difference between groups and one stated as no difference between groups. When stated as a difference between groups, our hypothesis would be, “students with prior work experience earn higher grades than students without prior work experience.” This is called our research or scientific hypothesis. Because most statistics test for no difference, however, we must also have a null hypothesis. The null hypothesis is always written with the assumption that the groups do not differ. In this study, our null hypothesis would state that, “students with work experience will not receive different grades than students with no work experience.” The null hypothesis is what we test through the use of statistics and is abbreviated H0. Since we are testing the null, we can assume then that if the null is not true then some alternative to the null must be true. The research hypothesis stated earlier becomes our alternative, abbreviated H1. In order to make research as specific as possible we typically look for one of two outcomes, either the null or the alternative hypothesis. To conclude that there is no difference between the two groups means we are accepting our null hypothesis. If we, however, show that the null is not true then we must reject it and therefore conclude that the alternative hypothesis must be true. While there may be a lot of gray area in the research itself, the results must always be stated in black and white.

STANDARDIZATION Standardization refers to methods used in gathering and treating subjects for a specific study. In order to compare the results of one group to the results of a second group, we must assure that each group receives the same opportunities to succeed. Standardized tests, for instance, painstakingly assure that each student receives the same questions in the same order and is given the same amount of time, the same resources, and the same type of testing environment. Without standardization, we could never adequately compare groups. 6 Research Methodology and Statistical Methods

For example, imagine that one group of students was given a particular test and allowed four hours to complete it in a quiet and well lit room. A second group was given the same test but only allowed 30 minutes to complete it while sitting in a busy school lunchroom full of laughing and talking children. If group 1 scored higher than group 2 could we truly say that they did better? The answer is obviously ‘no.’ To make sure we can compare results, we must make everything equal between the two or more groups. Only then could we say that group 1 performed better than group 2. Standardization of the research methods is often a lengthy process. The same directions must be read to each student, the same questions must be given, and the same amount of time must be assured. All of these factors must be decided before the first subject can be tested. While standardization refers mainly to the testing situation itself, these principles of ‘sameness’ involve the selection of subjects as well.

SELECTING SUBJECTS If we want to know if Billy performed better than Sally, or if boys scored higher than girls in our class, or even if Asian children receive higher grades in our school than Caucasian children, the selection of subjects is rather simple. When we are testing the entire population of possible subjects, we are adequately assured that no subject bias has occurred. A population refers to the entire pool of possible subjects. In a classroom or other setting where the entire population is relatively small, testing all subjects may be simple. However, if we are attempting to understand or gain knowledge related to a large population, such as all third grade children, all depressed adults, or all retail employees, gathering and testing everyone would be relatively impossible. In this situation, we would need to gather a sample of the population, test this sample, and then make inferences aimed at the entire population of which they represent. When determining which potential subjects from a large population to include in our study there are several approaches to choose from. Each of these sampling techniques have its own strengths and, of course, its own weaknesses. The idea behind adequate sampling, however, remains the same: to gather a sample of subjects that is representative of the greater population. The ideal research sample is therefore often referred to as a representative sample.

SIMPLE RANDOM SAMPLE To assure that the sample of subjects taken from a known population truly represents the population, we could test every subject in the population and choose only those who fall around the mean of the entire population. This technique is usually pointless because doing so means we could just as easily have tested the entire population on our independent and dependent variables. Therefore in order to make sure all possible subjects have an equal opportunity to be chosen, simple random sampling is most often the selection method used. Concepts of Research 7

To choose a random group of 10 students from a class of 30, for example, we could put everyone’s name in a hat and use the first ten names drawn as our sample. In this method, subjects are chosen just as ‘B6’ is chosen in a game of BINGO. This technique can work well with a small population but can be time consuming and archaic when the population size is large. To choose 30 students from a class of 250 students would be easier utilizing technology and what is referred to as a random number table. A random number table is a computer generated list of numbers placed in random order. Each of the 250 students would be randomly assigned a number between one and 250. Then the groups would be formed once again using a random number generator.

SYSTEMATIC SAMPLE When a population is very large, assigning a number to each potential subject could also be tiresome and time consuming. A systematic sample is a random sample compiled in a systematic manner. If you had a list of all licensed teachers, for example, and wanted to mail a survey to 200 of them, systematic sampling might be the sampling method of choice. For this example, a page and a teacher number on that page are determined at random. This would represent the first subject and the starting point for choosing the remaining subjects. A random number would be generated, for example 150. Then every 150th teacher would become a subject until you have selected enough for your study. If you complete the list before selecting enough subjects, you would continue back at the beginning of the list. Once the subjects are selected, the technique of random assignment can be used to assign subjects to particular groups.

STRATIFIED RANDOM SAMPLE The use of a stratified sample refers to the breaking down of the population into specific subsets before choosing which ones will take part in the study. For example, if you are studying all third grade students in your state, you may want to make sure that every county in your state is represented in your study. If you used a simple random sampling technique, you could conceivably end up with many subjects from one county and no subjects from other counties. A stratified sample allows you to choose your subject pool randomly from a predetermined set of subsets. In this example, we may want to choose 10 subjects at random from each county within the state. Other subsets can also be used, such as age, race, or socioeconomic background. If you wanted to make sure that there were an equal number of males and females, you could use sex as your subset and then randomly choose the same number of subjects from each subset. This type of sampling is useful when the population has some known differences that could result in different outcomes. For instance, if you already know that 80% of the students are male, you may want to select 40 male students and 10 female students so that your sample represents the breakdown of sex within the population. 8 Research Methodology and Statistical Methods

CLUSTER SAMPLE Cluster sampling could be considered a more specific type of stratified sample. When this technique is used, potential subsets of subjects are first randomly eliminated and then the remaining subsets are used to randomly select the sample of subjects to be used in the study. For example, if you are measuring the effect of prior work experience on college grades in a particular state, you may first make a list of all colleges in the state. Then you would randomly select a number of colleges to either include or eliminate in the selection process. Once you have a subset of colleges, you could use the same technique to randomly include or eliminate the specific classes. From the remaining classes, you would then randomly select a group of students with work experience and a group of students with no work experience to be placed in your two groups.

NON-PROBABILITY SAMPLE Non-probability refers to a group of subjects chosen based on their availability rather then their degree of representativeness of the population. Surveys are often done in this manner. Imagine going to the local mall to gather information about the buying habits of mall shoppers. Your subject pool does not represent all mall shoppers but rather those mall shoppers who happen to walk by your location on that day. The same would hold true for a survey over the phone or via mail. Those who respond to your questions or return the mailed survey do not necessarily represent the population at large. Instead, they represent the population who was home and was willing to respond to your questions or those who took the time to complete and return the survey. While at first glance this method seems unprofessional, it allows for the gathering of information in a short amount of time. It is not considered standardized research and would be scrutinized if submitted to a professional journal, but it does have its place. If you’ve ever visited a web site and seen a survey, you might have felt compelled to click on the results link. When watching a news programme you may have not changed channels because you are waiting for the results of a survey that will be reported at the end of the programme. We are highly interested in these ‘informal’ polls and using a non-probability sample is a quick way to gather large amounts of information in a relatively short amount of time. TESTING SUBJECTS Once you determined your variables, applied the concept of standardization, and selected your subjects, you are almost ready to begin the testing process. The concept of testing refers to the application or analysis of your independent and dependent variables. If there is any manipulation of the subjects in your study, it occurs during this phase. Before testing any human subject, however, some type of consent form is necessary. Consent forms basically describe the study, how the results will be used, and any possible negative effects that may occur. They also give the subject the right to withdrawal from the study at any time without consequence. Your specific hypothesis Concepts of Research 9 does not need to be disclosed but each subject must be made aware of any general concerns and be able to ask questions before testing can begin.. If your hypothesis, for example, asked if there is a difference in effectiveness of different treatments for depression, you might assign your subjects to one of several different groups: cognitive therapy, dynamic therapy, humanistic therapy and possibly no therapy. Each subject would likely be tested prior to the study to determine a baseline for his or her level of depression and would then begin a predetermined and standardized treatment plan. Because you are standardizing your study, each subject should get identical treatment short o the independent variable. In other words, the only thing you want to be different between the groups is the type of therapy received. The no therapy group would be considered a control group and may participate in some type of non-therapy related activity while the other subjects receive therapy. This group is used to determine if time plays a role in the decrease of depressive symptoms. Without a control group you couldn’t say that any particular therapy was more helpful than no therapy because subjects may have improved merely because of some outside factor unrelated to treatment. These factors are called extraneous variables, and control groups, along with randomization, help to keep the impact of these variables to a minimum.

ANALYZING RESULTS The specific analysis performed on the subjects depends on the type of subjects, the type of questions being asked, and the purpose of the research. When we gathered and tested all possible subjects from a known population we would use descriptive statistics to analyse our results. Descriptive statistics require the testing of everyone in the population and are used to describe the qualities of the population in numerical format. For example, we could say that the mean score on an IQ test for all third graders at Jefferson High School is 102. We could also state that there is no difference between the IQs of boys and girls within our subjects if the data support these ceases. When we are using a sample of subjects smaller than the entire population, we must make some inferences using what we call inferential statistics. Like any inferences, we also assume a certain degree of error when making determinations about a population based on a sample of that population. Because of this, the results of inferential statistics are often stated within a predetermined level of confidence. If we found that the mean of one group was 10 points higher than the mean of a second group in our work experience and college grades study, we could not assume that the population means are identical. We could, however, state that the means of the entire population are likely to differ by five to 15 points or that there is a 95% probability that the means of the entire population differs by ten points. In this sense, we are predicting the scores of the entire population based on the scores of our sample and stating them within a range or a predetermined level of confidence. This allows us to include the likely error that occurs whenever an inference is made. 10 Research Methodology and Statistical Methods

DETERMINING SIGNIFICANCE The term significance when related to research has a very specific role. Significance refers to the level of certainty in the results of a study. We can say that our subjects differed by an average of ten points with 100% certainty because we personally witnessed this difference. To say that the population will differ is another story. To do this, we must determine how valid our results are based on a statistical degree of error. If we find, through the use of inferential statistics, that the grades of those with and without work experience are different me must state the estimated error involved in this inference. While the standard acceptable error is 5%, it can be as high as 20% or as low as 0.1%. The amount of error to be accepted in any study must be determined prior to beginning the study. In other words, if we want to be 95% confident in our results, we set the significance level at.05. If we want to be 99% confident, our significance level is set at.01. We can then state that there is a difference in the population means at the 95% significance level or at the 99% significance level if our statistics support this statement. If our statistics estimate that there is 10% error and we said we would accept only 5%, the results of our study would be stated as ‘not significant.’ When determining significance, we are saying that a difference exists within our acceptable level of error and we must therefore reject the null hypothesis. When results are found to be not significant, the only option available is to accept the null hypothesis.

COMMUNICATING RESULTS Results of a study are disseminated in many forms. The highest level of communicating results is often in a peer-reviewed professional journal. Peer-reviewed refers to a group of professionals in a particular field who read all submissions and publish only those that meet the highest degree of scrutiny and applicability. When errors are found in the sampling of subjects, the statistical analysis, or the inferences made, the study will often be rejected or returned to the author for revisions. Published substances in peer-reviewed would likely be the best source for research when you begin looking into your theory. Results of research studies are also disseminated through textbooks, book chapters, conferences, presentations, and newsletters. For example, a study comparing the average salary in a particular county might be published in the local newspaper or in a brochure for the chamber of commerce. Our study of non-traditional students and work experience might be summarized in a board meeting of the college’s department of student retention or published in a trade journal such as the “Journal of Higher Education.” Some studies are never released, especially if the results do not add to the already available research. Other studies are meant only to provide direction for larger studies. Our study of college students may be used only to determine if a larger study is likely to result in important findings. If we get significant results then a larger study, including a Concepts of Research 11 broader subject pool, may then be conducted. These types of studies are often called pilot studies because the goal is not to gather knowledge about the population, but rather to guide further research in a particular area.

REPLICATION Replication is the key to the support of any worthwhile theory. Replication involves the process of repeating a study using the same methods, different subjects, and different experimenters. It can also involve applying the theory to new situations in an attempt to determine the generalizability to different age groups, locations, races, or cultures. For example, our study of non-traditional students may be completed using students from another college or from another state. It may be changed slightly to add additional variables such as age, sex, or race to determine if these variables play any role in our results. Replication, therefore, is important for a number of reasons, including: • Assurance that results are valid and reliable; • Determination of generalizability or the role of extraneous variables; • Application of results to real world situations; and • Inspiration of new research combining previous findings from related studies.

PUTTING IT ALL TOGETHER We asked the question, ‘do college students with work experience earn better grades than those without work experience.’ Knowing the steps involved in doing research and now having a basic understanding of the process, we could design our experiment and with fictional results could determine our ceases and how to report our findings to the world. To do this, lets start with our theory and progress through each of the ten steps.

DETERMINING A THEORY Theories are developed through our interaction with our environment. For our particular theory, we observed that older college students tend to perform better on classroom tests than younger students. As we attempt to explain why, we developed our theory that real world work experience creates a motivation in students that allows them to perform better than students without this motivation. Our theory, therefore, states that prior work experience will result in higher grades.

Defining Variables Every experiment has an independent and a dependent variable. The independent variable is what we start with; it refers to the separation of our groups. In our case, we want to look at prior work experience so the presence or absence of this would constitute our experimental groups. We may place those students who have been in the workforce 12 Research Methodology and Statistical Methods for more than one year in group 1 and those with less than one year in group 2. Our dependent variable is our outcome measure so in our case we are looking for a difference in class grades. To operationally define the variable grades, we might use the final course average as our outcome measure. If the independent and dependent variable(s) are difficult to determine, you can always complete the following statement to help narrow them down: The goal of this study is to determine what effect ______(IV) has on ______(DV). For us, the goal is to determine what effect one year or more of prior work experience has on course average.

DETERMINING HYPOTHESIS When we plug our variables into our original theory we get our research hypothesis. Simply stated, Students with one or more years of prior work experience will receive higher final course averages than students with less than one year of prior work experience. Since statistical analysis often tests the null hypothesis or the idea that there is no difference between groups, our null hypothesis could be stated as: Final course averages of students with one or more years of prior work experience will not differ from final course averages of students with less than one year of prior work experience.

STANDARDIZATION To make sure that each subject, no matter which group they belong to, receives the same treatment, we must standardize our research. While this may be difficult in the real world, our goal is to get as close as possible to the ideal. Therefore, we may choose to gather subjects from a general psychology class since this is a class required of most students and will not be affected by college major. We may further decide to research only those students who have a specific instructor to keep the instruction between the two groups as similar as possible. Remember, our goal is to assure, at least as much as possible, that the only difference between the two groups is the independent variable.

SELECTING SUBJECTS Because our population consists of all college students, it will be impossible to include everyone in the study. Therefore we need to apply some type of random selection. Since we want to use only those students who have the same instructor, we may ask all of this instructors students, prior to any teaching, how much work experience they have had. Those who report a year or more become the potential subject pool for group 1 and those who have less than one year become the subject pool for group 2. We could, at this point decide to include all of these subjects or to further reduce the subjects randomly. To reduce the subject pool we could assign each student in each group a random number and then choose, at random, a specific number of students to become subjects in our study. For the purpose of this example, as suggested, randomly choose 20 students in each group to participate in our study. Concepts of Research 13

TESTING SUBJECTS Since we are not applying any type of treatment to our subjects, this phase in the procession can be omitted. If we were determining if the teaching styles of different instructors played a role in grades, we would randomly assign each student to a teacher. In that case, teaching style would become an independent variable in our study.

ANALYZING RESULTS Our original question asked if final averages would be different between our two groups. To determine this as suggested, look at the mean of each group. Therefore, as suggested, add up the averages of the 20 subjects in each group and divide each of these by 20. If, after comparing the means of each group, we find that group 1 has a mean of 88 and group 2 has a mean of 82 then we can descriptively state that there is a six-point difference between the means of the two groups. Based on this statistic, we would then begin to show support for our alternative hypothesis and can progress to the next step.

DETERMINATION OF SIGNIFICANCE Our goal was not to describe what their averages were, but rather to make inferences about what is likely happening in the entire population. We must therefore apply inferential statistics to our results to determine the significant or lack of significant findings. As suggested, set our confidence level at 95 per cent and then apply statistical analysis to our results to see if the difference of six points with a sample size of 40 is significant. Imagine that we did find a significant difference. In this case we could say that with a 95% confidence level, students with one year of more work experience receive higher averages than those with less than one year of work experience. Since the null hypothesis, which stated that no difference exists between the two groups, was not correct, we must reject it. And by rejecting the null, we automatically accept our alternative hypothesis.

COMMUNICATING RESULTS. When communicating the results of our study we need to do several things. We need to make a case for why we did this research, which is often based on our literature search. We then need to report the process we took in gathering our sample and applying the treatment. We can then report our results and argue that there is a difference between the two groups and that this difference is significant enough to infer it will be present in the entire population. Finally, we must evaluate our research in terms of its strengths, weaknesses, applicability, and needs for further study. In terms of strengths, we might include the rigors of gathering subjects and the fact that we used a random sample of students. We may argue that the statistical methods used were ideal for the study or that we considered the recommendations of previously completed studies in this area. Weaknesses might include the small sample size, the limited pool from which our sample was gathered, or the reliance on self-reported work experience. 14 Research Methodology and Statistical Methods

To discuss applicability and needs for further studies we could suggest that more studies be completed that use a broader base of subjects or different instructors. We could recommend that other variables be investigated such as student age, type and location of college, family educational history, sex, race, or socioeconomic background. We might even suggest that while our findings were significant they are not yet applicable until these other variables are investigated.

REPLICATION The final step in any research is replication. This can be done by us but is most often completed by other researchers based on their own review of the literature and the recommendations made by previous researchers. If others compete a similar study, or look at different variables and continue to find the same results, our results become stronger. When ten other studies agree with ours, the chances are greatly improved that our results were accurate. If ten other studies disagree with our findings then the validity of our study will be, and most certainly should be, called into question. By replicating studies and using previously gained knowledge to search for new answers, our profession continues to move forward. After all, we used the ideas of other researchers to design our research, and future researchers may incorporate our findings to make recommendations in their research. The cycle is never ending and allows for perpetual seeking of new knowledge. SCOPE OF RESEARCH

NATIONAL INNOVATIVE CAPACITY: MODELING, MEASURING AND COMPARING NATIONAL CAPACITIES National innovative capacity is the ability of a country to produce and commercialise a flow of innovative technology over the long term. It depends on: • The strength of a nation’s common infrastructure; • The cluster-specific innovation environment. • The quality of linkages. This research strand aims at building innovation indexes and measuring various dimensions of national innovation capacities. For instance: • Strategic capacity: it deals with the ability to mobilise and concentrate resources under some centralised decision making processes to achieve a critical scientific or technological objective. • Revolutionary capacities: it deals with the ability to shift resources out of areas of lower and into areas of higher productivity and greater yield. This is a capacity to manage transitions. The difficulty is that such a capacity involves various dimensions which can be conflicting. Concepts of Research 15

DESIGNING EFFICIENT INCENTIVE SYSTEMS FOR INVENTION AND INNOVATION: INTELLECTUAL PROPERTY RIGHTS, PRIZES, PUBLIC SUBSIDIES One central problem in the economics of knowledge is the design of incentive systems that both reward inventors/knowledge producers and encourage dissemination of their output. Several scholars have described the two regimes that allocate resources for the creation of new knowledge: one is the system of granting intellectual property rights, as exemplified by modern patent and copyright systems, the other is the open science regime, as often found in the realm of pure scientific research and sometimes in the realm of commercial technological innovation, often in infant industries. A large range of issues have to be addressed to elucidate the problem of designing efficient incentive systems: • What is the best solution in case of particular kind of new technologies? • What is the nature of the tension that arises when the two systems come up against each other? • How designing proper incentive systems to encourage research and innovation in areas of high social return and low private profitability? • In what condition a prize-based reward system provides a more efficient solution than granting intellectual property rights? • Is there an economic case for granting intellectual property rights in the domain of research tools, instruments, basic knowledge?

RESEARCH IN EPFL LABS: NEW ECONOMICS OF SCIENCE CEMI will be at the forefront of the College to develop and undertake research in the field of “economics of science” with EPFL as the main case. In this perspective, several topics are obvious: • Assessing the impact of organizational practice on the productivity of university technology transfer offices. • Measuring the social value of basic research and the local spillovers. Accounting for the effects associated with mobility. • Scale, scope and spillovers: the determinants of research productivity in several fields. • Exploring the role of patents in knowledge transfer from EPFL. • Exploring the effect of the patenting of research tools and biomedical innovation: transfer opportunities and social costs. • Access policy for large scale research instrument, data bases. All these topics should give rise to research design in close collaboration with the other EPFL Schools in order to benefit from the great opportunity to be located in an Institute of Technology. These projects will be designed in close collaboration with Jan-Anders Manson, vice-president for Innovation and Knowledge Transfer. 16 Research Methodology and Statistical Methods

NEW R&D METHODS AND THE PRODUCTION OF RELIABLE KNOWLEDGE IN SECTORS WHICH LAGGED BEHIND Unequal access to pertinent knowledge bases may well constitute an important condition underlying perceptible differences in the success with which different areas of endeavour are pursued within the same society, and the pace at which productivity advances in different sectors of the economy during a given historical epoch. Today, it remains astonishing to observe the contrast between fields of economic activity where improvements in practice are closely reflecting rapid advances in human knowledge - such as is the case for information technologies, transportation, and certain areas of medical care —and other areas where the state of knowledge appears to be far more constraining. The fact is that knowledge is not being developed to the same degree in every sector. A major policy concern is to understand the factors at the origin of such uneven development, and to implement a proper strategy in order to fill the gap between sectors with fast knowledge accumulation processes and those in which these processes remain weak. To summarise, rapid and effective creation of know-how is most likely to occur when the following conditions converge: • Practice in the field needs to be well specified, sustainable, replicable, imitable; • There needs to be ability to learn from experience and experiment; • The ability to experiment offline, with less expense than that would be involved in online experimentation, and to gain reliable information relevant to online use, greatly facilitates progress. • A strong body of “scientific” knowledge greatly facilitates effective offline experimentation, and also quick and reliable evaluation of varying practice online. Part of the problem in sectors which are lagging behind deals with the limited ability to conduct experiments. The main research issue here is to analyse the impact of new experimental methods and design, which have the potential to profoundly transform the way reliable knowledge is produced in these sectors. For instance, one of the most significant developments in modern medicine has been the randomised controlled trial, the significance and use of which grew rapidly after its application to tuberculosis in the 1940s. Today the RCT is widely treated as the evidential ‘gold standard’ for demonstrating ‘what works’ and what is medical ‘best practice’. Education might be the next sector to be profoundly transformed through the application of RCTs. The growth of RCTs as an approach in educational research has been pushed forward by three important factors: computers, statistical techniques and demand for accountability in both practice and research. There is, therefore, a favourable context. The question is whether this new feature can change and transform the way knowledge is produced and distributed in a sector like education. Concepts of Research 17

NEW MODELS OF INNOVATION: OPEN, DISTRIBUTED SYSTEMS AND THE ROLE OF USERS This project involves the contribution of users in the innovation process not only in terms of sending market signals but also in terms of actively contributing to the modification of the product. This project emphasizes, therefore, the functional source of innovation: while an innovation is considered a manufacturer’s innovation when the developer expects to benefit by selling it, an innovation is a user innovation when the user expects to benefit by using it. This research aims at understanding the capabilities and limitations of user innovation processes, which involve quite often an open and distributed system. Its advocates claim that user innovation, involving freely revealing, is an efficient means of producing socially desirable innovation and maximizing “spillovers,” or knowledge transfer/ leakage. The generation of innovation by users may be a complement or it may compete with innovations produced by manufacturers. In its role as a complement, user innovation may extend the diversity of products without endangering market positions of manufacturers and may help manufacturing firms to mitigate information asymmetry problems vis-à-vis future market needs. As a competitor, user innovation may offer products that better meet user needs. The model involves two major deviations from the private investment model of innovation, which assumes that manufacturers innovate in products and processes to improve their competitive position and that returns to innovation result from excluding other manufacturers from adopting it. First, users of technologies, rather than manufacturers, are often the innovators. Second, user-innovators often freely reveal the proprietary knowledge they have developed at their private expense. A host of empirical studies, mainly conducted by Eric von Hippel, his research group at MIT and his colleagues, show that user innovation is an important economic phenomenon. It constitutes the main source of knowledge in some sectors or an important contributor in others. Deepening our understanding of the conditions leading to user innovation and of its economic impact is, therefore, a relevant issue: • For a better assessment both of intangibles and intellectual capital at the firm level and of innovation capacities at the national level; and • For a better understanding of some new organizational forms, such as user communities, which appear to be becoming more relevant in a knowledge society. Thus our main research questions are the following: • What are the different channels through which user innovations influence the economy and how should manufacturers adapt and respond to user innovations? • What kinds of learning processes/dynamic capabilities do user innovations enable across product/technological generations? • What kind of policy issues and challenges pertain to user innovation? Given the fact that user innovations contribute significantly to productivity growth 18 Research Methodology and Statistical Methods

and national competitiveness, what kinds of policy should be devised to promote them.

OTHER ISSUES TO BE DEVELOPED • The economics of knowledge policy: While it is relatively easy to provide a long list of policy recommendations which are of some relevance in the context of the knowledge economy, it is far more difficult to develop the welfare economics of knowledge investment in order to build a framework for addressing policy issues. • Methodology for the optimal allocation of R&D funds to new technologies: How does the R&D manager maximize the probability of developing a commercializable technology over a specific period.

LIMITATIONS OF RESEARCH The limitations of the research are first addressed by discussing the purposes of the research, then discussing the limitations of the data management, data analysis, and the validity of the analysis.

PURPOSES OF RESEARCH The “trustworthiness” of research depends on “What counts as knowledge?”. The general purposes of research can include knowledge production, understanding, and prediction. This study has focused on the first two purposes: production of applied knowledge and process knowledge, both of which are suitable for developmental research. Applied knowledge is context-specific, useful for the solution of practical problems, while process knowledge is usually specified in terms of models. A qualitative approach is called for in the study of context and contextual influences. A second purpose of research is promoting understanding of a process; in this case, the instructional approach and the underlying phenomena, including participation and dialogue. A secondary aspect of understanding is a better understanding of what it requires in using the development cycle as a form of inquiry.

ADAT MANAGEMENT Data management involves the procedures used for a systematic, coherent process of data collection, storage and retrieval for the purpose of high quality, accessible data, the documentation of analysis, and retention of data. The design decisions, implementation, and evaluation of the instructional design course were documented in eight data sources: working logs, electronic mail, syllabi, conferences, draft ID projects, completed ID projects, course evaluations, and self-evaluations. These data sources, which were based on observations, interviews, or documents, were event-driven, meaning that they served our instructional needs to watch, ask, and examine. These Concepts of Research 19 observations, interviews, and documents were in place prior to the conceptual framework of the study. As a result, the data sources were not as complete, tightly defined, or structured across the six cases if they had been researcher-driven. Some data sources, such as syllabi, course evaluations, and selfevaluations, evolved to suit the learning needs of the students. However, because we had presented at research conferences, we had collected and stored data for each case, as well as conducted analysis with most of the data sources, although using different methodologies. These research efforts can be regarded as interim analyses in which we became familiar with procedures in recording observations and personal conferences, as well as retaining and analyzing documents. Over the six cases, we came to have a better understanding of the instructional setting, being sensitive to research opportunities and more systematic in our data collection and management efforts, but also retaining instruction and responsiveness to learners as our top priority.

DATA ANALYSIS • Within-Case Analysis: Data displays, structured summaries, and charts allowed a condensed view of the data sources and revealed that some further analysis was needed, such as coding of structured summaries to reveal themes as well as to identify exceptions and differences. • Across-Case Analysis: A means to report the changes in design decisions, implementation, and evaluation of the model across the six Cases.”Each case has a specific history-which we discard at our peril-but it is a history contained within the general principles that influence its development”. This summary attempted to preserve the uniqueness of each case, yet also make comparisons along the developmental cycle. In an effort to extend external validity, what participants’ “did, said, or designed,” were examined in multiple settings. The processes of participation and dialogue within participation were examined in six different, developmental configurations and can be viewed as a replication of the focus of the study. A set of generalizations on how the model was implemented, as well as conditions necessary for its use. The danger to this generalization is that “multiple cases will be analysed at high levels of inference, aggregating out the local webs of causality and ending with a smoothed set of generalizations that may not apply to any single case”. The goal was to better understand the overall processes at work across the cases; in this case, teacher and student thinking, participation and dialogue, and teacher responsivity, represented within design decisions, class implementation, and model evaluation. We did not average, for example, course evaluation results, as one way to avoid misinterpretation and superficiality and to preserve case uniqueness. • Analytic Validity: In traditional instances of qualitative data collection and analysis, the research”shifts between cycles of inductive data collection and 20 Research Methodology and Statistical Methods

analysis to deductive cycles of testing and verification”. In this study, sources of data were already in place prior to conceptualizing a conceptual study framework. However, in this study the details of the conceptual framework and the subsequent data analysis of the six cases cycled back and forth to realise more appropriate matches of methodology and method to existing data sources and research objectives. The analytic cycle for this study could be better described as one which moved between conceptual framework, case analysis, and being clear as to the purpose of the study. Although being clear as to the purpose of the study is preferable before constructing a methodology, such purposes are not always clear based on the complexity of processes to be studies, amount of data, and personal involvement over time. Potential shortcomings in this research that are sources for bias include the large amount of data which may have led to missing important information or overweighting some findings due to focusing on a particular and large set of data. Personal involvement with the course also increases the possibility that our recorded observations in working logs highlighted particular incidents while ignoring others. On the other hand, the working log recorded observations or design decisions that would have been lost to our collective memories over the five years of involvement with the course. Personal involvement as a co-instructor also implied a danger in being selective and overconfident with some data. Another shortcoming was not checking descriptions with each case of students and additional peer review outside of the co-instructor. To address these shortcomings, we used multiple data sources for triangulation to achieve an agreement of one data source with another. Multiple sources of data, such as working logs, e-mail, and syllabi, also provided different strengths and complemented each other. Syllabi compactly recorded design decisions, while working logs and e-mail documented our thinking that influenced these design decisions. The data sources were a mix of student-generated and instructor-generated data. A significant amount of time and effort was spent in carefully assembling the data reduction notebooks, which included data displays of structured summaries and charts. The effort was made towards a goal of being able to track in the documents the reasoning behind descriptions and summary and the generalizations made about the model. During the analysis of these data sources we looked for contrasts, comparisons, and exemplars and reported these during the data reduction so as not to filter out outliers and extreme instances. For example, in the ID project analysis we added a column to record any unique features of the project that might not be addressed by the completeness, consistency, and coherency criteria. Replication of the conceptual framework across multiple cases helped to provide evidence that what was Concepts of Research 21

described in one Case was based upon the details of the instructional approach and uniqueness of the setting and participants. We attempted to remain “descriptive” and to report what decisions were made for the ID course, what occurred during the implementation of the reflexive approach, and systematically report the evaluation of the model in terms of student performance on ID projects, their perceptions of their learning and the course and data informing our responsivity to learner needs. Another means of addressing verification of findings and ceases was an “auditing” by the dissertation chair who was also co-instructor in the cases under study. Through periodic reviews of methodology and analysis, numerous inconsistencies in design decisions, for example, were identified and prompted for clarification. Feedback also prompted me to clarify procedures used to analyse the different data sources. This feedback, although one of the responsibilities of a doctoral chairperson, is another aspect of the reflexive stance that we had used on previous research: the need to assume regular, ongoing, and self-conscious documentation of our teaching. The working logs also served as a “reflexivity journal” to record these efforts.

TYPES OF RESEARCH

HISTORICAL RESEARCH IN PHYSICAL ACTIVITY

Student Learning Outcomes Historians Scientists Study the past Study the present Lack of certainty in conclusions Conclusions have higher degree of certainty Writing often has storytelling format Distinct format Often study social phenomena Usually study natural phenomena Despite some differences historians still try to have a “sciencelike” approach and work to disprove findings. They approach their topics in a logical and systematic manner. Historians have different views about social phenomena than scientists do about natural phenomena. Historians assume that humans make and give meaning to social phenomena. Historians are therefore less able to develop predictive theories than scientists who mostly deal with natural phenomena. Historians do however try to generalize based on their interpretations of the data they collect. These generalizations draw on social science theories and can be examined using data taken from other periods or places.

Paradigms When historians study a problem they use paradigms. A paradigm encompasses the historian’s beliefs that have developed as a result of previous reading, research, 22 Research Methodology and Statistical Methods experiences, etc. It takes a long time to develop an approach that is coherent and identifiable.

Possible Topics for Historical Research Any topic that has occurred in your field of interest could be studied. Possible topics might include: the development of the Health Education Department at CWU; changes in the way aerobics are taught; the evolution of female participation in sport; the marketing of university athletics; drug use over the past decade; changes in tourism; impact of the tobacco industry on health legislation, and many more. The Journal of Sport History is one of the primary sources for sport related investigation.

Using Secondary Sources to Locate Information When beginning to look for information on any topic we usually turn to secondary sources. Secondary sources are reports - books, articles, videos, etc. - about an event of interest. Primary sources of information - firsthand accounts of the event - can often be identified as a result of reviewing these secondary sources. This process is similar to any literature review. If you know exactly what you are seeking it is advisable to use keywords in indexes and databases. Otherwise you just begin scanning the literature in hope of locating information of interest. Based on this review you can become knowledgeable about general topics, find some specific related information, and identify additional resources to review. For the historical researcher this review helps in the development of specific questions to guide research.

Developing Good Questions As in all research, the historical researcher attempts to develop a problem by asking good questions. These must be grounded in the research and answerable. As illustrated in the examples in the text, good questions can be developed by delimiting general topics or by having a specific topic in mind and attempting to frame specific questions. Seems to me to be comparable to the concepts of deductive and inductive reasoning that we discussed earlier.

Historical Research Design While scientists typically use one of three categories of design, description, correlation, or experimentation, historical researchers tend to focus on description and analysis.

Descriptive Research Descriptive history constructs a map of the past and is often a first approach to a topic about which little is known. It involves locating events in time and place. Concepts of Research 23

Analytic Research Whereas descriptive research establishes background information about events it does not attempt to explain how or why events occurred. In my dissertation I attempted to explain reasons for the Soviet Union’s success in sports. I concluded that the creation and support of an extensive youth sport programme provided the foundation of a talented reserve of athletes that enabled the nation to maintain a highly competitive sports programme. In essence, I think I adopted a way of thinking that might be considered to fall under the scope of analytic research. Whereas scientists in behavioral research might design cause and effect studies, historical researchers must examine relationships between variables and try to determine causes. This process is complicated because there is rarely a single cause and you can never be certain that you have omitted consideration of any important variable.

Working with Historical Evidence Historical evidence can be found in varying formats - literature, photos, artifacts, oral tales, songs, etc. The historian can be likened to a detective who searches for clues using whatever sources are available. As noted earlier, primary sources are firsthand sources of evidence rather than the reports of others. Once located all sources of evidence must be evaluated for their validity. All sources of information need to be subjected to evaluation through processes known as external and internal criticism. External criticism refers to the authenticity of the information. Is the document real or a fake? Internal criticism refers to whether the information is credible. Is it consistent and accurate? When I interview Soviet coaches I quickly learned that many were willing to respond to my questions whether or not they were really knowledgeable about the topic. Yes, they were real coaches, but the information they were providing was not valid. Historians follow three rules when applying the process of internal criticism: • Rule of context - does the information fit with what was said before and after? • Rule of perspective - did the source have a bias? Do the sources of information have an affiliation with a particular organization? For example, one might be skeptical of a document on the health aspects of smoking published by the tobacco industry. • Rule of omission - was any information either intentionally or unintentionally omitted? I wonder how extensive the reporting was of Jesse Owens four gold medals in the German newspapers at the Nazi sponsored Berlin Olympics? Historical researchers must ask themselves, “What is this evidence, evidence of?” We must always consider the material and the source because there is often a very narrow interpretation. For example, if you were to read the CWU Strategic Plan you would learn a lot about what is supposed to be occurring across campus. But is it really happening? A Strategic Plan might not accurately convey what departments and faculty are actually doing. I discovered when conducting research on sports in the former Soviet 24 Research Methodology and Statistical Methods

Union that what was written often did not accurately reflect what was occurring in practice. A question then that I should have asked myself when reading the official documents was, “What does this tell me and what doesn’t this tell me?” A final consideration when working with historical evidence is that of context. Context refers to the whole environment in which the evidence existed and was reported. Would it for example be appropriate to accept as accurate the opinion of an individual who had just been fired from a health organization you are investigating? While all evidence will have some degree of personal bias, when people have an obvious reason to be biased you must consider the evidence they present with caution.

Making Sense of the Evidence The final question facing the historical researcher is similar to the question facing all researchers - “What does it mean?” Evidence must be laid out in a logical order and subjected to intense review. Remember from earlier in the course that truth exists only to the extent that it can be disproved. Your presentation of information must withstand rigorous scrutiny. For this reason it would be especially wise to share your ideas with reviewers regularly and attempt to address questions well in advance of any formal presentation.

META-ANALYSIS In statistics, a meta-analysis combines the results of several studies that address a set of related research hypotheses. This is normally done by identification of a common measure of effect size, which is modelled using a form of meta-regression. Resulting overall averages when controlling for study characteristics can be considered meta- effect sizes, which are more powerful estimates of the true effect size than those derived in a single study under a given single set of assumptions and conditions. From the perspective of the systemic TOGA meta-theory, a meta-analysis is an analysis performed on the level of tools applied to a particular or more general domain-oriented analysis. Roughly speaking, any analysis A relates to a problem P in an arbitrarily selected ontological domain D. It is performed using some analysis tools T. Therefore we may write: A (P, D, T). In such declarative formalization, for a given Ax, a meta-analysis domain is Tx (such as method, algorithm, methodology) in the Px, Dx context, it means MA(Pm, Dm =(Px,Tx), Tm), where MA denotes a meta-analysis operator. The above systemic definition is congruent with definitions of meta-theory, meta-knowledge and meta-system.

Backdrop The first meta-analysis was performed by in 1904, in an attempt to overcome the problem of reduced statistical power in studies with small sample sizes; analyzing the results from a group of studies can allow more accurate data analysis. However, the first meta-analysis of all conceptually identical experiments concerning a Concepts of Research 25 particular research issue, and conducted by independent researchers, has been identified as the 1940 book-length publication Extra-sensory perception after sixty years, authored by Duke University psychologists J. G. Pratt, J. B. Rhine, and associates. This encompassed a review of 145 reports on ESP experiments published from 1882 to 1939, and included an estimate of the influence of unpublished papers on the overall effect. Although meta-analysis is widely used in epidemiology and evidence-based medicine today, a meta-analysis of a medical treatment was not published until 1955. In the 1970s, more sophisticated analytical techniques were introduced in educational research, starting with the work of Gene V. Glass, Frank L. Schmidt and John E. Hunter. The online Oxford English Dictionary lists the first usage of the term in the statistical sense as 1976 by Glass. The statistical theory surrounding meta-analysis was greatly advanced by the work of Nambury S. Raju, Larry V. Hedges, Harris Cooper, Ingram Olkin, John E. Hunter, Jacob Cohen, Thomas C. Chalmers, and Frank L. Schmidt.

Advantages of Meta-analysis Advantages of meta-analysis include: • Derivation and statistical testing of overall factors/effect size parameters in related studies. • Generalization to the population of studies. • Ability to control for between-study variation. • Including moderators to explain variation. • Higher statistical power to detect an effect than in ‘n=1 sized study sample’.

Steps in a Meta-analysis • Search of literature • Selection of studies – Based on quality criteria, e.g., the requirement of randomization and blinding in a clinical trial – Selection of specific studies on a well-specified subject, e.g., the treatment of breast cancer. – Decide whether unpublished studies are included to avoid publication bias • Decide which dependent variables or summary measures are allowed. For instance: – Differences – Means – Hedges’ g is a popular summary measure for continuous data that is standardized in order to eliminate scale differences, but it incorporates an index of variation between groups: µ− µ – δ = t c , in which ìt is the treatment mean, µ is the control mean, σ2 σ c the pooled variance. 26 Research Methodology and Statistical Methods

• Model selection.

Meta-regression Models Generally, three types of models can be distinguished in the literature on meta- analysis: simple regression, fixed effects meta-regression and random effects meta- regression.

Simple Regression The model can be specified as,

yj=β0 + β 1 x 1 j + β 2 x 2 j + ⋅⋅⋅+ ε

Where yj is the effect size in study j and β0 the estimated overall effect size. xi (i = 1...k ) are parameters specifying different study characteristics. ε specifies the between study variation. Note that this model does not allow specification of within study variation.

Fixed-effects Meta-regression Fixed-effects meta-regression assumes that the true effect size θ is normally distributed N 2 with (θ, σθ ) whereσθ is the within study variance of the effect size. A fixed effects meta-regression model thus allows for within study variability, but no between study variability because all studies have expected fixed effect size θ, i.e.,ε = 0.

yj =β0 + β 1 x 1j + β 2 x 2 j +⋅⋅⋅+ nj

2 Where σ nj is the variance of the effect size in study j. Fixed effects meta-regression ignores between study variation. As a result, parameter estimates are biased if between study variation can not be ignored. Furthermore, generalizations to the population are not possible.

Random Effects Meta-regression N Random effects meta-regression rests on the assumption that θ in (θ, σ i ) is a N random variable following a (hyper-)distribution (θ, σθ ) .

yj =β0 + β 1 x 1j + β 2 x 2 j +⋅⋅⋅+ η + ε j σ 2 Where again ε j is the variance of the effect size in study j. Between study variance 2 ση is estimated using common estimation procedures for random effects models.

Applications in Modern Science Modern statistical meta-analysis does more than just combine the effect sizes of a set of studies. It can test if the studies’ outcomes show more variation than the variation that is expected because of sampling different research participants. If that is the case, Concepts of Research 27 study characteristics such as measurement instrument used, population sampled, or aspects of the studies’ design are coded. These characteristics are then used as predictor variables to analyze the excess variation in the effect sizes. Some methodological weaknesses in studies can be corrected statistically. For example, it is possible to correct effect sizes or correlations for the downward bias due to measurement error or restriction on score ranges. Meta-analysis leads to a shift of emphasis from single studies to multiple studies. It emphasizes the practical importance of the effect size instead of the statistical significance of individual studies. This shift in thinking has been termed “meta-analytic thinking”. The results of a meta-analysis are often shown in a forest plot. Results from studies are combined using different approaches. One approach frequently used in meta-analysis in health care research is termed ‘inverse variance method’. The average effect size across all studies is computed as a weighted mean, whereby the weights are equal to the inverse variance of each studies’ effect estimator. Larger studies and studies with less random variation are given greater weight than smaller studies. Other common approaches include the Mantel Haenszel method and the Peto method. A recent approach to studying the influence that weighting schemes can have on results has been proposed through the construct of gravity, which is a special case of combinatorial meta-analysis. Signed differential mapping is a statistical technique for meta-analyzing studies on differences in brain activity or structure which used neuroimaging techniques such as fMRI, VBM or PET.

Weaknesses Meta-analysis can never follow the rules of hard science, for example being double- blind, controlled, or proposing a way to falsify the theory in question. It only be a statistical examination of scientific studies, not an actual scientific study, itself. A weakness of the method is that sources of bias are not controlled by the method. A good meta-analysis of badly designed studies will still result in bad statistics. Robert Slavin has argued that only methodologically sound studies should be included in a meta- analysis, a practice he calls ‘best evidence meta-analysis’. Other meta-analysts would include weaker studies, and add a study-level predictor variable that reflects the methodological quality of the studies to examine the effect of study quality on the effect size.

File Drawer Problem Another weakness of the method is the heavy reliance on published studies, which may create exaggerated outcomes, as it is very hard to publish studies that show no significant results. This file drawer problem results in the distribution of effect sizes that are biased, skewed or completely cut off, creating a serious base rate fallacy. If there were fifty tests, and only ten got results, then the real outcome is only 20% as significant as it appears, except that the other 80% were thrown out by publishers as 28 Research Methodology and Statistical Methods uninteresting. This should be seriously considered when interpreting the outcomes of a meta-analysis. This can be visualized with a funnel plot which is a scatter plot of sample size and effect sizes. There are several procedures available that attempt to correct for the file drawer problem, once identified, such as guessing at the cut off part of the distribution of study effects. Other weaknesses are Simpson’s Paradox; the coding of an effect is subjective; the decision to include or reject a particular study is subjective; there are two different ways to measure effect: correlation or standardized mean difference; the interpretation of effect size is purely arbitrary; it has not been determined if the statistically most accurate method for combining results is the fixed effects model or the random effects model; and, for medicine, the underlying risk in each studied group is of significant importance, and there is no universally agreed-upon way to weight the risk. The example provided by the Rind et al. controversy illustrates an application of meta-analysis which has been the subject of subsequent criticisms of many of the components of the meta-analysis.

Dangers of Agenda Driven Bias The most severe weakness and abuse of meta-analysis often occurs when the person or persons doing the meta-analysis have an economic, social, or political agenda such as the passage or defeat of legislation. Those persons with these types of agenda have a high likelihood to abuse meta-analysis due to personal bias. For example, researchers favorable to the author’s agenda are likely to have their studies “cherry picked” while those not favorable will be ignored or labeled as “not credible”. In addition, the favored authors may themselves be biased or paid to produce results that support their overall political, social, or economic goals in ways such as selecting small favorable data sets and not incorporating larger unfavorable data sets. If a meta-analysis is conducted by an individual or organization with a bias or predetermined desired outcome, it should be treated as highly suspect or having a high likelihood of being “junk science”. From an integrity perspective, researchers with a bias should avoid meta-analysis and use a less abuse-prone form of research.

Combinatorial Meta-analysis Combinatorial meta-analysis (CMA) is the study of the behaviour of statistical properties of combinations of studies from a meta-analytic dataset. In an article that develops the notion of “gravity” in the context of meta-analysis, Dr. Travis Gee proposed that the jackknife methods applied to meta-analysis in that article could be extended to examine all possible combinations of studies or random subsets of studies.

Concept In the original article, k objects are combined k-1 at a time, resulting in k estimates. It is observed that this is a special case of the more general approach of CMA which Concepts of Research 29 computes results for k studies taken 1, 2, 3... k – 1, k at a time. Where it is computationally feasible to obtain all possible combinations, the resulting distribution of statistics is termed “exact CMA.” Where the number of possible combinations is prohibitively large, it is termed “approximate CMA.” CMA makes it possible to study the relative behaviour of different statistics under combinatorial conditions. This differs from the standard approach in meta-analysis of adopting a single method and computing a single result, and allows significant triangulation to occur, by computing different indices for each combination and examining whether they all tell the same story.

Implications An implication of this is that where multiple random intercepts exist, the heterogeneity within certain combinations will be minimized. CMA can thus be used as a data mining method to identify the number of intercepts that may be present in the dataset by looking at which studies are included in the local minima that may be obtained through recombination. A further implication of this is that arguments over inclusion or exclusion of studies may be moot when the distribution of all possible results is taken into account. A useful tool developed by Dr. Gee is the “PPES” plot. For each subset of combinations, where studies are taken j = 1, 2, ... k – 1, k at a time, the proportion of results that show a positive effect size is taken, and this is plotted against j. This can be adapted to a “PMES” plot, where the proportion of studies exceeding some minimal effect size is taken for each value of j = 1, 2, ... k – 1, k. Where a clear effect is present, this plot should asymptote to near 1.0 fairly rapidly. With this, it is possible then that, for instance, disputes over the inclusion or exclusion of two or three studies out of a dozen or more may be framed in the context of a plot that shows a clear effect for any combination of 7 or more studies. It is also possible through CMA to examine the relationship of covariates with effect sizes. For example, if industry funding is suspected as a source of bias, then the proportion of studies in a given subset that were industry funded can be computed and plotted directly against the effect size estimate. If average age in the various studies was itself fairly variable, then the mean of these means across studies in a given combination can be obtained, and similarly plotted.

Limitations CMA does not solve meta-analysis’s problem of “garbage in, garbage out.” However, when a class of studies is deemed garbage by a critic, it does offer a way of examining the extent to which those studies may have changed a result. Similarly, it offers no direct solution to the problem of which method to choose for combination or weighting. What it does offer, as noted above, is triangulation, where agreements between methods may be obtained, and disagreements between methods understood across the range of possible combinations of studies. 30 Research Methodology and Statistical Methods

Critical Appraisal Critical appraisal is the use of explicit, transparent methods to assess the data in published research, applying the rules of evidence to factors such as validity, adherence to reporting standards, methods, conclusions and generalizability. Critical appraisal methods form a central part of the systematic review process. They are used in evidence- based health care training to assist clinical decision-making, and are increasingly used in evidence-based social care and education provision.

Effect Size In statistics, an effect size is a measure of the strength of the relationship between two variables in a , or a sample-based estimate of that quantity. An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential statistics such as p-values. Among other uses, effect size measures play an important role in meta-analysis studies that summarize findings from a specific area of research, and in statistical power analyses. The concept of effect size appears already in everyday language. For example, a weight loss programme may boast that it leads to an average weight loss of 30 pounds. In this case, 30 pounds is an indicator of the claimed effect size. Another example is that a tutoring programme may claim that it raises school performance by one letter grade. This grade increase is the claimed effect size of the programme. These are both examples of “absolute effect sizes,” meaning that they convey the average difference between two groups without any discussion of the variability within the groups. For example, if the weight loss programme results in an average loss of 30 pounds, we do not know if every participant loses exactly 30 pounds, or if half the participants lose 60 pounds and half the participants lose no weight at all. Reporting effect sizes is considered good practice when presenting empirical research findings in many fields. Effect sizes are particularly prominent in social and medical research. Relative and absolute measures of effect size convey different information, and can be used complementarily. A prominent task force in the psychology research community expressed the following recommendation: Always present effect sizes for primary outcomes... If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d). – L. Wilkinson and APA Task Force on Statistical Inference

Population and Sample Effect Sizes The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical statistical population. Conventions for distinguishing sample from population effect sizes follow standard statistical practices—one common approach Concepts of Research 31 is to use Greek letters like ρ to denote population parameters and Latin letters like r to denote the corresponding statistic; alternatively, a “hat” can be placed over the population parameter to denote the statistic, e.g.,. with pˆ being the estimate of the parameter ρ. As in any statistical setting, effect sizes are estimated with error, and may be biased unless the effect size estimator that is used is appropriate for the manner in which the data were sampled and the manner in which the measurements were made. An example of this is publication bias, which occurs when scientists only report results when the estimated effect sizes are large or are statistically significant. As a result, if many researchers are carrying out studies under low statistical power, the reported results are biased to be stronger than the true effects.

Relationship to Test Statistics Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength of an apparent relationship, rather than assigning a significance level reflecting whether the relationship could be due to chance. The effect size does not determine the significance level, or vice-versa. Given a sufficiently large sample size, a statistical comparison will always show a significant difference unless the population effect size is exactly zero. For example, a sample Pearson correlation coefficient of 0.1 is strongly statistically significant if the sample size is 1000. Reporting only the small p-value from this analysis could be misleading if a correlation of 0.1 is too small to be of interest in a particular application.

Standardized and Unstandardized Effect Sizes The term effect size can refer to a standardized measures of effect, or to an unstandardized measure. Standardized effect size measures are typically used when the metrics of variables being studied do not have intrinsic meaning, when results from multiple studies are being combined when some or all of the studies use different scales, or when it is desired to convey the size of an effect relative to the variability in the population. In meta-analysis, standardized effect sizes are used as a common measure that can be calculated for different studies and then combined into an overall summary.

Types

Pearson r Correlation Pearson’s correlation, often denoted r and introduced by Karl Pearson, is widely used as an effect size when paired quantitative data are available; for instance if one were studying the relationship between birth weight and longevity. The correlation coefficient can also be used when the data are binary. Pearson’s r can vary in magnitude from –1 to 1, with –1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation, and 0 indicating no linear relation between two variables. Cohen gives the following guidelines for the social sciences: small effect size, r = 0.1 – 0.23; 32 Research Methodology and Statistical Methods medium, r = 0.24 – 0.36; large, r = 0.37 or larger. A related effect size is the coefficient of determination (the square of r, referred to as “r-squared”). In the case of paired data, this is a measure of the proportion of variance shared by the two variables, and varies from 0 to 1. An r2 of 0.21 means that 21% of the variance of either variable is shared with the other variable. The r2 is positive, so does not convey the polarity of the relationship between the two variables.

Effect Sizes Based on Means A (population) effect size θ based on means usually considers the standardized mean difference between two populations, µ− µ θ = 1 2 , σ where µ1 is the mean for one population, µ2 is the mean for the other population, and σ is a standard deviation based on either or both populations. In the practical setting the population values are typically not known and must be estimated from sample statistics. The several versions of effect sizes based on means differ with respect to which statistics are used. This form for the effect size resembles the computation for a t-test statistic, with the critical difference that the t-test statistic includes a factor of n . This means that for a given effect size, the significance level increases with the sample size. Unlike the t-test statistic, the effect size aims to estimate a population parameter, so is not affected by the sample size.

Cohen’s d Cohen’s d is defined as the difference between two means divided by a standard deviation for the data,

x− x d = 1 2 , s What precisely the standard deviation s is was not originally made explicit by Jacob Cohen because he defined it (using the symbol “σ”) as “the standard deviation of either population (since they are assumed equal)”. Other authors make the computation of the standard deviation more explicit with the following definition for a pooled standard deviation,

()()n−1 s2 + n − 1 s2 s = 1 2 2 , n1+ n 2 with xk and sk as the mean and standard deviation for group k, for k = 1, 2.

n1 2 1 2 s1=∑()x 1,i − x n1 −1 i=1 Concepts of Research 33

This definition of “Cohen’s d” is termed the maximum likelihood estimator by Hedges and Olkin, and it is related to Hedges’ g by a scaling, n+ n − 2 g = 1 2 d n1+ n 2

Glass’s ∆ In 1976 Gene V. Glass proposed an estimator of the effect size that uses only the standard deviation of the second group,

x1− x 2 ∆ = . s2 The second group may be regarded as a control group, and Glass argued that if several treatments were compared to the control group it would be better to use just the standard deviation computed from the control group, so that effect sizes would not differ under equal means and different variances. Under an assumption of equal population variances a pooled estimate for σ is more precise.

Hedges’ g Hedges’ g, suggested by Larry Hedges in 1981, is like the other measures based on a standardized difference, x− x g = 1 2 s * but its pooled standard deviation s * is computed slightly differently from Cohen’s d, ()()n−1 s2 + n − 1 s2 s* = 1 1 2 2 . n1+ n 2 − 2 As an estimator for the population effect size θ it is biased. However, this bias can be corrected for by multiplication with a factor,  3  g* = J() n1 + n 2 −2 g ≈ 1 −  g ,  4()n1+ n 2 − 9  Hedges and Olkin refer to this unbiased estimator g * as d, but it is not the same as Cohen’s d. The exact form for the correction factor J() involves the gamma function, Γ(a / 2) J() a = . a/ 2Γ()() a − 1 / 2

Distribution of Effect Sizes Based on Means

Provided that the data is Gaussian distributed a scaled Hedges’ g, (()n1 n 2/, n 1+ n 2 g 34 Research Methodology and Statistical Methods

follows a non-central t-distribution with the non-centrality parameter (n1 2/n() 1+ n 2 nθ and n1 + n2 – 2 degrees of freedom. Likewise, the scaled Glass’ ∆ is distributed with n2 –1 degrees of freedom. From the distribution it is possible to compute the expectation and variance of the effect sizes. In some cases large sample approximations for the variance are used. One suggestion for the variance of Hedges’ unbiased estimator is,

2 n+ n ()g * σˆ 2 ()g * =1 2 + . n1 n 22() n 1+ n 2

Cohen’s ƒ2 Cohen’s ƒ2 is an appropriate effect size measure to use in the context of an F-test for ANOVA or multiple regression. The ƒ2 effect size measure for multiple regression is defined as:

R2 f 2 = 1− R2 where R2 is the squared multiple correlation. The f2 effect size measure for hierarchical multiple regression is defined as:

2 2 2 RRAB− A f = 2 1− RAB where R2A is the variance accounted for by a set of one or more independent variables 2 A, and R AB is the combined variance accounted for by A and another set of one or more independent variables B. By convention, ƒ2A effect sizes of 0.02, 0.15, and 0.35 are termed small, medium, and large, respectively. Cohen’s fˆ can also be found for factorial analysis of variance (ANOVA, aka the F- test) working backwards using:

ˆ feffect = ( dfeffect / N)( Feffect − 1) . In a balanced design (equivalent sample sizes across groups) of ANOVA, the corresponding population parameter of f2 is, SS (µ, µ ,..., µ ) 1 2 K , K ×σ 2 wherein µj denotes the population mean within the jth group of the total K groups, and σ the equivalent population standard deviations within each groups. SS is the sum of squares manipulation in ANOVA. Concepts of Research 35

φ, Cramér’s φ, or Cramér’s V The best measure of association for the chi-square test is phi. Phi is related to the point-biserial correlation coefficient and Cohen’s d and estimates the extent of the relationship between two variables (2 × 2). Cramér’s Phi may be used with variables having more than two levels. Phi can be computed by finding the square root of the chi- square statistic divided by the sample size. Similarly, Cramér’s phi is computed by taking the square root of the chi-square statistic divided by the sample size and the length of the minimum dimension (k is the smaller of the number of rows r or columns c).

φc is the intercorrelation of the two discrete variables and may be computed for any value of r or c. However, as chi-square values tend to increase with the number of cells, the greater the difference between r and c, the more likely φc will tend to 1 without strong evidence of a meaningful correlation. Cramér’s phi may also be applied to ‘goodness of fit’ chi-square models (i.e., those where c = 1). In this case it functions as a measure of tendency towards a single outcome (i.e., out of k outcomes).

Odds Ratio The odds ratio is another useful effect size. It is appropriate when both variables are binary. For example, consider a study on spelling. In a control group, two students pass the class for every one who fails, so the odds of passing are two to one (or more briefly 2/1 = 2). In the treatment group, six students pass for every one who fails, so the odds of passing are six to one (or 6/1 = 6). The effect size can be computed by noting that the odds of passing in the treatment group are three times higher than in the control group (because 6 divided by 2 is 3). Therefore, the odds ratio is 3. However, odds ratio statistics are on a different scale to Cohen’s d. So, this ‘3’ is not comparable to a Cohen’s d of 3.

Relative Risk The relative risk (RR), also called risk ratio, is simply the risk (probability) of an event relative to some independent variable. This measure of effect size differs from the odds ratio in that it compares probabilities instead of odds, but asymptotically approaches the latter for small probabilities. Using the example above, the probabilities for those in the control group and treatment group passing is 2/3 (or 0.67) and 6/7 (or 0.86), respectively. The effect size can be computed the same as above, but using the probabilities instead. Therefore, the relative risk is 1.28. Since rather large probabilities of passing were used, there is a large difference between relative risk and odds ratio. Had failure (a smaller probability) been used as the event (rather than passing), the difference between the two measures of effect size would not be so great. While both measures are useful, they have different statistical uses. In medical research, for example, the odds ratio is favoured for case-control studies and retrospective studies. Relative risk is used in randomized controlled trials and cohort studies. It is also worth noting that one cannot automatically predict the other, though with small probabilities one can be considered a fairly good estimate of the other. 36 Research Methodology and Statistical Methods

Confidence Interval and Relation to Non-central Parameters of unstandardized effect size like difference of means (µ1 – µ2) can be found in common statistics textbooks and software, while confidence intervals

ɶ µ1− µ 2 ɶ 2 SS (µ1, µ 2 ,..., µk ) of standardized effect size, especially Cohen’s d := and f := , σ K ⋅σ 2 rely on the calculation of confidence intervals of non-central parameters (ncp). A common approach to construct (1– α) confidence interval of ncp is to find the critical ncp values to fit the observed statistic to tail quantiles α/2 and (1– α/2). SAS and R-package MBESS provide functions for critical ncp. An online calculator based on R and MediaWiki provides interactive interface, which requires no coding and welcomes copy-left collaboration.

T Test for Mean Difference of Single Group or Two Related Groups In case of single group, M (µ) denotes the sample (population) mean of single group, and SD (σ) denotes the sample (population) standard deviation. N is the sample size of the group. T test is used for the hypothesis on the difference between mean and a baseline µbaseline. Usually, µbaseline is zero, while not necessary. In case of two related groups, the single group is constructed by difference in each pair of samples, while SD (σ) denotes the sample (population) standard deviation of differences rather than within original two groups.

M − µ µ− µ NN+ baseline M t : = σ σ SD/ N SD σ µ− µ ncp= N baseline σ and Cohen’s M − µ d : baseline SD is the point estimate of, µ− µ baseline . σ So, ncp dɶ = . N

T Test for Mean Difference between Two Independent Groups

n1 or n2 is sample size within the respective group. Concepts of Research 37

MM− t : 1 2 , n1 n 2 SDwithin / n1+ n 2 wherein,

2 2 SSwithin ()()n1 −1 SD + n2 −1 SD2 SDwithin : = . dfwithin n1+ n 2 − 2 n n µ− µ ncp = 1 2 1 2 n1+ n 2 σ and Cohen’s

MM1− 2 µ− µ d : is the point estimate of 1 2 . SDwithin σ So, ɶ ncp d = . n1 n 2

n1+ n 2

One-way ANOVA Test for Mean Difference Across Multiple Independent Groups One-way ANOVA test applies non-central F distribution. While with a given population standard deviation σ, the same test question applies non-central chi-square distribution.

SSbetween 2 / dfbetween F : σ SS within / df σ 2 within

For each j-th sample within i-th group Xi,j, denote, n i X ∑ w=1 i, w M i()X i,, j :;:.= µi()X i j= µ i ni While,

SS( Mi( X i, j ); i= 1,2,..., K , j = 1,2,... ni ) = σ 2   MXi() i, j− µ i µ = SS  +i ;i = 1,2,..., K j = 1,2,... n   i   σ σ  38 Research Methodology and Statistical Methods

   µi( X i, j ) ~ χ 2  df= K −1, ncp = SS ; i= 1,2,... K , j = 1,2,... n    i    σ  So, both ncp(s) of F and χ2 equate,

SS(µi( X i, j ) / σ ; i= 1,2,... K , j = 1,2,... ni ).

In case of n:= n1 = n 2 = ⋅⋅⋅ = nk for K independent groups of same size, the total sample size is N:= n·K.

SSµ X/ σ ; i= 1,2,... K , j = 1,2,.., n ɶ 2 SS ()µ1, µ 2 ,..., µK ( i( i, j ) i ) ncp ncp Cohensf : = = = . K ⋅σ 2 n ⋅ K n ⋅ K N T-test of pair of independent groups is a special case of one-way ANOVA. Note that non-central parameter ncpF of F is not comparable to the non-central parameter ncpt of ɶ ɶ d the corresponding t. Actually, ncp= ncp2 , and f = in the case. F t 2

“Small”, “Medium”, “Large” Some fields using effect sizes apply words such as “small”, “medium” and “large” to the size of the effect. Whether an effect size should be interpreted small, medium, or big depends on its substantial context and its operational definition. Cohen’s conventional criteria small, medium, or big are near ubiquitous across many fields. Power analysis or sample size planning requires an assumed population parameter of effect sizes. Many researchers adopt Cohen’s standards as default alternative hypotheses. Russell Lenth criticized them as T-shirt effect sizes. This is an elaborate way to arrive at the same sample size that has been used in past social science studies of large, medium, and small size (respectively). The method uses a standardized effect size as the goal. Think about it: for a “medium” effect size, you’ll choose the same n regardless of the accuracy or reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important considerations are being ignored here. “Medium” is definitely not the message! For Cohen’s d an effect size of 0.2 to 0.3 might be a “small” effect, around 0.5 a “medium” effect and 0.8 to infinity, a “large” effect. Cohen’s text anticipates Lenth’s concerns: • “The terms ‘small,’ ‘medium,’ and ‘large’ are relative, not only to each other, but to the area of behavioral science or even more particularly to the specific content and research method being employed in any given investigation.... In the face of this relativity, there is a certain risk inherent in offering conventional operational definitions for these terms for use in power analysis in as diverse a field of inquiry as behavioral science. This risk is nevertheless accepted in the belief that more is to be gained than lost by supplying a common Concepts of Research 39

conventional frame of reference which is recommended for use only when no better basis for estimating the ES index is available.” The last two decades have seen the widespread use of these conventional operational definitions as standard practice in the calculation of sample sizes despite the fact that Cohen was clear that the small-medium-large categories were not to be used by serious researchers, except perhaps in the context of research with entirely novel variables.

Fisher’s Method In statistics, Fisher’s method, also known as Fisher’s combined probability test, is a technique for data fusion or “meta-analysis”. It was developed by and named for . In its basic form, it is used to combine the results from several independent tests bearing upon the same overall hypothesis (H0).

Application to Independent Test Statistics Fisher’s method combines extreme value probabilities from each test, commonly known as “p-values”, into one test statistic (X2) using the formula,

k 2 X = −2∑ loge()p i , i=1 where pi is the p-value for the ith hypothesis test. When the p-values tend to be small, the test statistic X2 will be large, which suggests that the null hypotheses are not true for every test. When all the null hypotheses are true, and the pi (or their corresponding test statistics) are independent, X2 has a chi-square distribution with 2k degrees of freedom, where k is the number of tests being combined. This fact can be used to determine the p-value for X2. The null distribution of X2 is a chi-square distribution for the following reason. Under the null hypothesis for test i, the p-value pi follows a uniform distribution on the interval [0,1]. The negative natural logarithm of a uniformly distributed value follows an . Scaling a value that follows an exponential distribution by two yields a quantity that follows a chi-square distribution with two degrees of freedom. Finally, the sum of k independent chi-square values, each with two degrees of freedom, follows a chi-square distribution with 2k degrees of freedom.

Interpretation Fisher’s method is applied to a collection of independent test statistics, typically based on separate studies having the same null hypothesis. The meta-analysis null hypothesis is that all of the separate null hypotheses are true. The meta-analysis alternative hypothesis is that at least one of the separate alternative hypotheses is true. In some settings, it makes sense to consider the possibility of “heterogeneity,” in which the null hypothesis holds in some studies but not in others, or where different 40 Research Methodology and Statistical Methods alternative hypotheses may hold in different studies. A common reason for heterogeneity is that effect sizes may differ among populations. For example, consider a collection of medical studies looking at the risk of a high glucose diet for developing type II diabetes. Due to genetic or environmental factors, the risk associated with a given level of glucose consumption may be greater in some human populations than in others. In other settings, rejecting the null hypothesis for one study implies that the alternative hypothesis holds for all studies. For example, consider several experiments designed to test a particular physical law. When there is no heterogeneity, any discrepancies among the results from separate studies or experiments are due to chance, possibly driven by differences in power, rather than reflecting differences in the true states of the populations being investigated. In the case of a meta-analysis using two-sided tests, it is possible to reject the meta- analysis null hypothesis even when the individual studies show strong effects in differing directions. In this case, we are rejecting the hypothesis that the null hypothesis is true in every study, but this does not imply that there is a uniform alternative hypothesis that holds across all studies. Thus, two sided meta-analysis is particularly sensitive to heterogeneity in the alternative hypotheses. One sided meta-analyses can detect heterogeneity in the effect magnitudes, but is insensitive to heterogeneity in the effect directions.

Relation to Stouffer’s Z-score Method A closely related approach to Fisher’s method is based on Z-scores rather than p- values. If we let Zi = φ – 1(1 – pi), where F is the standard normal cumulative distribution function, then,

k ∑ Zi Z ~i=1 , k is a Z-score for the overall meta-analysis. This Z-score is appropriate for one-sided right-tailed p-values; minor modifications can be made if two-sided or left-tailed p-values are being analyzed. This method is named for the sociologist Samuel A. Stouffer. Since Fisher’s method is based on the average of –log(pi) values, and the Z-score method is based on the average of the Zi values, the relationship between these two approaches follows from the relationship between z and –log(p) = –log(1 – F(z)). For the normal distribution, these two values are not perfectly linearly related, but they follow a highly linear relationship over the range of Z-values most often observed, from 1 to 5. As a result, the power of the Z-score method is nearly identical to the power of Fisher’s method. One advantage of the Z-score approach is that it is straightforward to introduce weights. If the ith Z-score is weighted by wi, then the meta-analysis Z-score is, Concepts of Research 41

k ∑ w Z Z ~i=1 i i , k w2 ∑i=1 i which follows a standard normal distribution under the null hypothesis. While weighted versions of Fisher’s statistic can be derived, the null distribution becomes a weighted sum of independent chi-square statistics, which is less convenient to work with.

Extension to Dependent Test Statistics In the case that the tests are not independent, the null distribution of X2 is more complicated. Dependence among the pi does not affect the expected value of X2, which continues to be 2k under the null hypothesis. If the covariance matrix of the logepi is known, then it is possible to calculate the variance of X2, and from this a normal approximation could be used to form a p-value for X2. Dependence among statistical tests is generally positive, which means that the p-value of X2 is too small if the dependency is not taken into account. Thus, if Fisher’s method for independent tests is applied in a dependent setting, and the p-value is not small enough to reject the null hypothesis, then that conclusion will continue to hold even if the dependence is not properly accounted for. However, if positive dependence is not accounted for, and the meta-analysis p-value is found to be small, the evidence for the alternative hypothesis is generally overstated.

Forest Plot A forest plot is a graphical display designed to illustrate the relative strength of treatment effects in multiple quantitative scientific studies addressing the same question. It was developed for use in medical research as a means of graphically representing a meta-analysis of the results of randomized controlled trials. In the last twenty years, similar meta-analytical techniques have been applied in observational studies and forest plots are often used in presenting the results of such studies also. Although forest plots can take several forms, they are commonly presented with two columns. The left-hand column lists the names of the studies (frequently randomized controlled trials or epidemiological studies), commonly in chronological order from the top downwards. The right-hand column is a plot of the measure of effect for each of these studies incorporating confidence intervals represented by horizontal lines. The graph may be plotted on a natural logarithmic scale when using odds ratios or other ratio-based effect measures, so that the confidence intervals are symmetrical about the means from each study and to ensure undue emphasis is not given to odds ratios greater than 1 when compared to those less than 1. The area of each square is proportional to the study’s weight in the meta-analysis. The overall meta-analysed measure of effect is often represented on the plot as a vertical line. This meta-analysed measure of effect is commonly plotted as a diamond, the lateral points of which indicate confidence intervals for this estimate. 42 Research Methodology and Statistical Methods

A vertical line representing no effect is also plotted. If the confidence intervals for individual studies overlap with this line, it demonstrates that at the given level of confidence their effect sizes do not differ from no effect for the individual study. The same applies for the meta-analysed measure of effect: if the points of the diamond overlap the line of no effect the overall meta-analysed result cannot be said to differ from no effect at the given level of confidence. Forest plots date back to at least the 1970s. One plot is shown in a 1985 book about meta-analysis. The first use in print of the word “forest plot” may be 1996. The name refers to the forest of lines produced. In September 1990, joked that the plot was named after a breast cancer researcher called Pat Forrest and the name has sometimes been spelt “forrest plot”.

Homogeneity (Statistics) In statistics, homogeneity arises in describing the properties of a dataset, or several datasets, and relates to the validity of the often convenient assumption that the statistical properties of any one part of an overall dataset are the same as any other part. In meta- analysis, which combines the data from several studies, homogeneity measures the differences or similarities between the several studies. Homogeneity can be studied to several degrees of complexity. For example, considerations of homoscedasticity examine how much the variability of data-values changes throughout a dataset. However, questions of homogeneity apply to all aspects of the statistical distributions, including the location parameter. Thus, a more detailed study would examine changes to the whole of the marginal distribution. An intermediate- level study might move from looking at the variability to studying changes in the skewness. In addition to these, questions of homogeneity apply also to the joint distributions. The concept of homogeneity can be applied in many different ways and, for certain types of statistical analysis, it is used to look for further properties that might need to be treated as varying within a dataset once some initial types of non-homogeneity have been dealt with.

Newcastle–Ottawa Scale In statistics, the Newcastle–Ottawa scale is a method for assessing the quality of non-randomised studies in meta-analyses. The scales allocate stars, maximum of nine, for quality of selection, comparability, exposure and outcome of study participants. The method was developed as a collaboration between the Universities of Newcastle, Australia and Ottawa, Canada.

Signed Differential Mapping Signed differential mapping or SDM is a statistical technique for meta-analyzing studies on differences in brain activity or structure which used neuroimaging techniques Concepts of Research 43 such as fMRI, VBM or PET. It may also refer to a specific piece of software created by the SDM Project to carry out such meta-analyses.

The Signed Differential Mapping Approach • An Overview of the method: SDM adopted and combined various positive features from previous methods, such as ALE or MKDA, and introduced a series of improvements and novel features. One of the new features, introduced to avoid positive and negative findings in the same voxel as seen in previous methods, was the representation of both positive differences and negative differences in the same map, thus obtaining a signed differential map (SDM). The method has three steps. First, studies and coordinates of cluster peaks are selected according to SDM inclusion criteria. Second, these coordinates are used to create an SDM map for each study. Finally, study maps are meta- analyzed using several different tests to complement the main outcome with sensitivity and heterogeneity analyses. • Inclusion criteria: It is not uncommon in neuroimaging studies that some regions are more liberally thresholded than the rest of the brain. However, a meta-analysis of studies with such regional differences in thresholds would be biased towards these regions, as they are more likely to be reported just because authors apply more liberal thresholds in them. In order to overcome this issue SDM introduced a criterion in the selection of the coordinates, which consists in only including those results that appeared statistically significant at the whole-brain level using the same threshold throughout the brain. • Pre-processing of studies: After conversion of coordinates to Talairach space, an SDM map is created for each study. This consists in recreating the clusters of difference by means of an un-normalized Gaussian Kernel, so that voxels closer to the peak coordinate have higher values. A rather large full-width at half-maximum (FWHM) of 25mm is used to account for different sources of spatial error, e.g., coregistration mismatch in the studies, the size of the cluster or the location of the peak within the cluster. Within a study, values obtained by close Gaussian kernels are summed, though values are limited to [-1,1] to avoid a bias towards studies reporting various coordinates in close proximity. • Statistical comparisons: SDM provides several different statistical analyses in order to complement the main outcome with sensitivity and heterogeneity analyses. The main statistical analysis is the mean analysis, which consists in calculating the mean of the voxel values in the different studies. This mean is weighted by the sample size so that studies with large sample sizes contribute more. The descriptive analysis of quartiles describes the weighted proportion of studies with strictly positive (or negative) values in a voxel, thus providing a p-value-free measure of the effect size. Subgroup analyses are mean analyses applied to groups of studies to allow the study of heterogeneity. Linear model 44 Research Methodology and Statistical Methods

analyses are a generalization of the mean analysis to allow the study of possible confounds. It must be noted that a low variability of the regressor is critical in meta-regressions, so they are recommended to be understood as exploratory and to be more conservatively thresholded. Jack-knife analysis consists in repeating a test as many times as studies have been included, discarding one different study each time, i.e., removing one study and repeating the analyses, then putting that study back and removing another study and repeating the analysis, and so on. The idea is that if a significant brain region remains significant in all or most of the combinations of studies it can be concluded that this finding is highly replicable. The statistical significance of the analyses is checked by standard randomization tests. It is recommended to use uncorrected p-values = 0.001, as this significance has been found in this method to be approximately equivalent to a corrected p-value = 0.05. A false discovery rate (FDR) = 0.05 has been found in this method to be too conservative. Values in a Talairach label or coordinate can also be extracted for further processing or graphical presentation.

SDM Software SDM is software written by the SDM project to aid the meta-analysis of voxel-based neuroimaging data. It is distributed as including a graphical interface and a menu/command-line console. It can also be integrated as an SPM extension.

Study Heterogeneity In statistics, study heterogeneity is a problem that can arise when attempting to undertake a meta-analysis. Ideally, the studies whose results are being combined in the meta-analysis should all be undertaken in the same way and to the same experimental protocols: study heterogeneity is a term used to indicate that this ideal is not fully met. Meta-analysis is a method used to combine the results of different trials in order to obtain a quantified synthesis. The size of individual clinical trials is often too small to detect treatment effects reliably. Meta-analysis increases the power of statistical analyses by pooling the results of all available trials. As one tries to use the meta-analysis to estimate a combined effect from a group of similar studies, there needs to be a check that the effects found in the individual studies are similar enough that one can be confident that a combined estimate will be a meaningful description of the set of studies. However, the individual estimates of treatment effect will vary by chance, because of randomization. Thus some variation is expected. The question is whether there is more variation than would be expected by chance alone. When this excessive variation occurs, it is called statistical heterogeneity, or just heterogeneity. When there is heterogeneity that cannot readily be explained, one analytical approach is to incorporate it into a random effects model. A random effects meta-analysis model Concepts of Research 45 involves an assumption that the effects being estimated in the different studies are not identical, but follow some distribution. The model represents the lack of knowledge about why real, or apparent, treatment effects differ by treating the differences as if they were random. The centre of this symmetric distribution describes the average of the effects, while its width describes the degree of heterogeneity. The conventional choice of distribution is a normal distribution. It is difficult to establish the validity of any distributional assumption, and this is a common criticism of random effects meta- analyses. The importance of the particular assumed shape for this distribution is not known.

Systematic Review A systematic review is a literature review focused on a single question that tries to identify, appraise, select and synthesize all high quality research evidence relevant to that question. Systematic reviews of high-quality randomized controlled trials are crucial to evidence-based medicine. An understanding of systematic reviews and how to implement them in practice is becoming mandatory for all professionals involved in the delivery of health care. Systematic reviews are not limited to medicine and are quite common in other sciences such as psychology, educational research and sociology.

Characteristics A systematic review is a summary of research that uses explicit methods to perform a thorough literature search and critical appraisal of individual studies to identify the valid and applicable evidence. It is often applied in the biomedical or health care context, but systematic reviews can be applied in any field of research and groups like the Campbell Collaboration are promoting their use in policy-making beyond just health care. It often, but not always, uses statistical techniques to combine these valid studies, or at least uses grading of the levels of evidence depending on the methodology used. A systematic review uses an objective and transparent approach for research synthesis, with the aim of minimizing bias. While many systematic reviews are based on an explicit quantitative meta-analysis of available data, there are also qualitative reviews which adhere to the standards for gathering, analyzing and reporting evidence. The EPPI- Centre have been influential in developing methods for combining both qualitative and quantitative research in systematic reviews. Recent developments in systematic reviews include realist reviews, developed by Ray Pawson and Trisha Greenhalgh, and the meta-narrative approach by Greenhalgh and colleagues. These approaches try to overcome the problems of methodological and epistemological heterogeneity in the diverse literatures existing on some subjects.

Strengths and Weaknesses While systematic reviews are regarded as the strongest form of medical evidence, a review of 300 studies found that not all systematic reviews were equally reliable, and 46 Research Methodology and Statistical Methods that their reporting could be improved by a universally agreed upon set of standards and guidelines. A further study by the same group found that of 100 guidelines reviewed, 4% required updating within a year, and 11% after 2 years; this figure was higher in rapidly-changing fields of medicine, especially cardiovascular medicine. 7% of systematic reviews needed updating at the time of publication. A 2003 study suggested that extending searches beyond major databases, perhaps into gray literature, would increase the effectiveness of reviews.

DESCRIPTIVE RESEARCH Descriptive research is also called Statistical Research. The main goal of this type of research is to describe the data and characteristics about what is being studied. The idea behind this type of research is to study frequencies, averages, and other statistical calculations. Although this research is highly accurate, it does not gather the causes behind a situation. Descriptive research is mainly done when a researcher wants to gain a better understanding of a topic for example, a frozen ready meals company learns that there is a growing demand for fresh ready meals but doesnt know much about the area of fresh food and so has to carry out research in order to gain a better understanding. It is quantitative and uses surveys and panels and also the use of probability sampling.

Survey Techniques Imagine yourself the athletic director at Central Washington University. In addition to creating successful programme you would clearly want a programme that attracts the support of the student population at the university. While you might suppose that students would flock to support winning programmes, this assumption may not be true. There have been many instances elsewhere of winning programmes failing to attract support. Rather than let this important component of your programme rest with fate, you might wonder what you could do to generate student interest for your athletic programme. Although you could initiate a variety of non-scientific methods to “research” the problem, as an individual knowledgeable about descriptive research methods you could design a more logical and systematic approach to investigating this challenge.

Questionnaires Distributing carefully designed questionnaires to all or a sample of your student population would be one possible approach. As noted in your text there are some key steps to follow when constructing questionnaires. By following these steps you enhance the quality of the information you are able to obtain and also ensure that this information is in a form that can be objectively analyzed. Do remember however, that an important limitation to questionnaires is that they report what people say and not necessarily what they DO. Below the key steps involved in constructing a questionnaire are listed: • Determine the Objectives: From the example described earlier, it would appear Concepts of Research 47

that one of our principal objectives might be to determine why students either choose to attend or avoid athletic events. We might also want to know what could be done to make attendance more enjoyable. A recommended step is to list objectives, then think about the kind of responses you might anticipate, and plan how you might analyze these data. Planning the questionnaire is obviously a vital first step, and failure to plan well will likely undermine the value of the entire study. • Delimiting the Sample: Hopefully, you remember our earlier discussions about sampling. Some of the considerations in our example might be whether or not to distribute questionnaires to the entire student population or to select a representative sample. What would a representative sample look like? How will we sample? Do we need equal number of males and females? Do we want equal representation from freshmen, sophomores, juniors, and seniors? Do we need to consider racial composition? How many students do we need? Answers to some of these questions will depend on time and money. Certainly however, you can see that we need to think carefully about the sample we use in the study. • Constructing the Questionnaire: When people first attempt to build questionnaires they quickly discover that questions that appear clear to them are often open to many different interpretations. It is a time consuming task and once again you would need to consider how you plan to analyze the possible responses. As you formulate your questions you must consider the most appropriate format Open-Ended Questions allow the responder a variety of response options. The good part is that responders are free to say what they like. The bad part is that they take more time to answer and are tougher to analyze. In our example, we could ask “In what ways could the Athletic Department make your attendance at an athletic event more enjoyable?” Closed Questions direct responders to certain choices among provided options. We can ask responders to rank order choices, select a score on a scale, or respond to provided categories. A closed version of the question posed above might be, “Should the Athletic Department sell concessions at athletic events?”. Alternatively, “Which would be your preferred choice of days to attend basketball games?. Whichever format you choose the wording of the questions, and the length and appearance of the questionnaire requires careful consideration if you are to maximize your returns. • Conducting a Pilot Study: A pilot study is an essential first step. Some time ago I wanted to know what motivate children to participate in ski school trips. I constructed a questionnaire with the reasons I believed would explain the children’s motivation. Fortunately, before I distributed the questionnaire I shared it with a colleague who pointed out that I did not include the option of “Was signed up by parents.” As it turned out several children did not actually 48 Research Methodology and Statistical Methods

choose to learn to ski but were simply signed up by their parents. If I had not asked this question I would have missed a significant reason. As it turned out I made a second error by not piloting the questionnaire with kids. On the questionnaire students were asked to rank order their reasons for taking ski classes. We discovered that the younger students did not understand what was meant by “rank order” and other students were not sure what to do if they wanted to score two choices at the same rank. • Cover Letters: Many questionnaires will not be returned. Some people will discard them immediately upon receipt because they feel they don’t have time. Others will look at the length or the type of information requested and then ignore or discard. The cover letter you include with the questionnaire creates a first impression and may sway whether or not you get a return. A brief, grammatically well written letter outlining clearly why you are requesting assistance may keep the questionnaire out of the discard pile. Usually, the cover letter will include a requested date to receive the response. • Sending the Questionnaire: As you can imagine your response rate from university students would probably not be high if they received the questionnaire during finals week. A person sending a questionnaire to coaches would be well advised to avoid the height of the coaching season. To increase the response rate a stamped, self-addressed envelope is advised. Some people will feel guilty not using the stamp and may be more inclined to respond to your questionnaire. • Follow up: Response rates are typically much lower than expected. Some types of questionnaires with certain samples might elicit much higher than average responses. A 50% response rate is often as good as it gets. To increase responses many researchers develop a system of follow-ups. These might begin with a postcard reminder, then be followed up with another questionnaire, and finally - if the response is vital - with a phone call. The authors of your text note that when response rates are extremely low the value of the findings is highly questionable. • Analyzing the Results and Preparing the Report: Once the questionnaires have been returned with a satisfactory response rate, the data must be analyzed and reported. Most often descriptive statistics will be used. Remember however, that methods of analysis will of course have been decided in the planning phase of the study. To a great extent, the nature of the discussion section in your study will depend on the results you obtain.

The Method As explained in your text the Delphi method is a survey technique that involves the use of questionnaires in an attempt to get consensus on a topic. Subjects respond to a first questionnaire, then based on these responses a second questionnaire is developed Concepts of Research 49 and administered. Each time the questionnaire is administered is called a “round.” Suppose for example, you were interested in investigating the impact of the 1980 Soviet boycott of the Olympic Games. Using the Delphi method you could survey knowledgeable Russians and Americans, each time sharing the different perspectives obtained, in an attempt to identify critical issues and perspectives.

Personal Interviews Personal interviews are similar to questionnaires except in the manner in which they are administered. Some of the advantages and disadvantages to this method are indicated below: Advantages Disadvantages/Challenges Greater confidentiality possible because Fewer subjects can be sampled of personal contact Flexibility to give follow-up questions More expensive because of travel or phone Opportunity to clarify questions Need to be able to take notes quickly or get permission to tape Can judge adequacy of replies Need to be able to listen to one reply and be ready to follow-up immediately with the next question Higher return rate Requires skilled interviewer As noted in the text, conducting an interview effectively requires practice. You will notice from watching TV interviews how the interviewer usually tries to control the pace of the interview. Some people will talk for a long time but say very little if permitted to do so. Another problem is that some people tend to give an opinion when they really are not very knowledgeable about the topic. With practice the researcher quickly identifies whether a subject truly is qualified to respond to a particular question.

Normative Surveys Although it is highly unlikely that you as graduate students would be involved in normative types of surveys, you should be aware that they have been widely used in physical education and health. Two sets of norms often used in public school physical education are the President’s Physical Fitness Challenge and AAHPERD’s Physical Best. Both consist of fitness data collected through nationwide normative surveys that teachers use to compare the fitness levels of children in their schools. In these types of surveys, the intent of the researchers is typically to establish performance norms to which the performance of others can be compared. As you can imagine this endeavor is often the subject of criticism by those who claim that scores taken in one environment lack validity when applied to a different environment. If you are considering research involving questionnaires or interviews be sure to carefully examine the many aspects involved in the successful application of these types of research 50 Research Methodology and Statistical Methods methods. An excellent resource for further reading is a book entitled Research Processes in Physical Education, by Clark and Clark.

Developmental Research Some years ago a researcher at the University of Oregon named H. Harrison Clarke initiated a growth study in the Oregon town of Medford. This project spurned many research papers, professional presentations, and graduate theses and dissertations. For several years researchers would visit Medford and track the growth and development of children in the public schools. This project provides an example of a longitudinal study - in other words a study of the same subjects over a period of several years. Were these same researchers to conduct a cross-sectional study we might anticipate that they would have visited Medford on one occasion and taken growth measurements of different subjects at several grade levels. Developmental studies can provide some fascinating insights and although only descriptive in nature can spark the researcher’s curiosity for more controlled experimental studies.

Case Studies Some years ago a graduate student at our university was interested in examining the topic of teenage pregnancy. Although she could have researched the topic with a questionnaire she decided to use a case-study approach. The advantage of this method was that the topic could be examined in-depth, albeit gathering information from a much more limited sample of subjects than would have been possible with a questionnaire. In contrast to the questionnaire approach - which requires the researcher to have excellent knowledge of the topic when designing questions - case study researchers often approach their subjects with an inquisitive mind and an openness that permits subjects to respond in an unlimited number of directions. As you can imagine, this less structured approach may take researchers down avenues they did not anticipate traveling and open doors to new kinds of understanding. In terms of the types of case study identified in your text, in the example given above the graduate student was probably conducting an interpretive study. Certainly, however she included a great deal of description. Were she to have approached the topic with the intent of identifying better ways of preventing teenage pregnancies then she would probably have conducted more of an evaluative study. In summary, case studies tend to provide in-depth information about a limited number of subjects, and may produce new insights that generate additional studies.

Observational Research A topic of great interest to those of us in teacher education has been in identifying those factors that contribute to effective teaching. Research on effective teaching has been through several stages. Initially, it was believed that the most effective teachers had special personality characteristics. Later it was suggested that the key to effective Concepts of Research 51 teaching might lie in the methods used. Research in this area has often involved the observation of teachers and the categorization of behaviours. In observing the behaviours of teachers in classrooms in which learning is apparently occurring, certain commonalties have been identified. The researchers who have investigated this area have engaged themselves in a type of observational research. Instead of approaching the question of effective teaching by asking questions, they chose to observe the behaviours of teachers. As noted in your text observational research necessitates adherence to certain guidelines in order to be considered as valid and reliable. As some of you know, here at CWU we use several of these observational methods in our undergraduate preparation of PE majors.

Correlational Research Although we discussed the techniques of correlational research in our discussion about statistics, it is important to appreciate that correlational research is descriptive. Because there is no manipulation of variables or controls, in correlational research is impossible to conclude that something “caused” something to occur. Remember that correlations are indicators of a relationship and not an effect.

PHYSICAL ACTIVITY EPIDEMIOLOGY RESEARCH Physical activity is assumed to be a key factor in obesity and metabolic disease, as well as in many other chronic diseases. However, for several reasons these associations are still not completely clear. A major reason for this is that physical activity is difficult to accurately measure in epidemiological studies. The main aim of our research is to improve the methods that are used to assess physical activity and energy expenditure. These methods can then be adopted to study the role of physical activity in the development of obesity, diabetes and other metabolic diseases. The overall aims of our research are to: • Develop and evaluate methods for reliable and valid measurement of physical activity and energy expenditure during physical activity in daily life, • Understand the role of physical activity and sedentary behaviour in the development of disease and the means by which activity can contribute to preventing disease. • Understand how genes and other biological factors influence physical activity and sedentary behaviour.

EXPERIMENTAL RESEARCH

Experimental Research and Design • Experimental Research: An attempt by the researcher to maintain control over all factors that may affect the result of an experiment. In doing this, the researcher attempts to determine or predict what may occur. • Experimental Design: A blueprint of the procedure that enables the researcher 52 Research Methodology and Statistical Methods

to test his hypothesis by reaching valid conclusions about relationships between independent and dependent variables. It refers to the conceptual framework within which the experiment is conducted. • Steps involved in conducting an experimental study: Identify and define the problem. Formulate hypotheses and deduce their consequences. Construct an experimental design that represents all the elements, conditions, and relations of the consequences. • Select sample of subjects. • Group or pair subjects. • Identify and control non experimental factors. • Select or construct, and validate instruments to measure outcomes. • Conduct pilot study. • Determine place, time, and duration of the experiment. • Conduct the experiment. • Compile raw data and reduce to usable form. • Apply an appropriate test of significance.

Essentials of Experimental Research Manipulation of an independent variable. An attempt is made to hold all other variables except the dependent variable constant - control. Effect is observed of the manipulation of the independent variable on the dependent variable - observation. Experimental control attempts to predict events that will occur in the experimental setting by neutralizing the effects of other factors.

Methods of Experimental Control • Physical Control: – Gives all subjects equal exposure to the independent variable. – Controls non experimental variables that affect the dependent variable. • Selective Control - Manipulate indirectly by selecting in or out variables that cannot be controlled. • Statistical Control - Variables not conducive to physical or selective manipulation may be controlled by statistical techniques.

Validity of Experimental Design Internal Validity asks did the experimental treatment make the difference in this specific instance rather than other extraneous variables? External Validity asks to what populations, settings, treatment variables, and measurement variables can this observed effect be generalized?

Factors Jeopardizing Internal Validity • History: The events occurring between the first and second measurements in addition to the experimental variable which might affect the measurement. Concepts of Research 53

• Example: Researcher collects gross sales data before and after a 5 day 50% off sale. During the sale a hurricane occurs and results of the study may be affected because of the hurricane, not the sale. • Maturation: The process of maturing which takes place in the individual during the duration of the experiment which is not a result of specific events but of simply growing older, growing more tired, or similar changes. Example: Subjects become tired after completing a training session, and their responses on the Posttest are affected. • Pre-testing: The effect created on the second measurement by having a measurement before the experiment. Example: Subjects take a Pretest and think about some of the items. On the Posttest they change to answers they feel are more acceptable. Experimental group learns from the pretest. • Measuring Instruments: Changes in instruments, calibration of instruments, observers, or scorers may cause changes in the measurements. Example: Interviewers are very careful with their first two or three interviews but on the 4th, 5th, 6th become fatigued and are less careful and make errors. • Statistical Regression: Groups are chosen because of extreme scores of measurements; those scores or measurements tend to move towards the mean with repeated measurements even without an experimental variable. Example: Managers who are performing poorly are selected for training. Their average Posttest scores will be higher than their Pretest scores because of statistical regression, even if no training were given. • Differential Selection: Different individuals or groups would have different previous knowledge or ability which would affect the final measurement if not taken into account. Example: A group of subjects who have viewed a TV programme is compared with a group which has not. There is no way of knowing that the groups would have been equivalent since they were not randomly assigned to view the TV programme. • Experimental Mortality: The loss of subjects from comparison groups could greatly affect the comparisons because of unique characteristics of those subjects. Groups to be compared need to be the same after as before the experiment. Example: Over a 6 month experiment aimed to change accounting practices, 12 accountants drop out of the experimental group and none drop out of the control group. Not only is there differential loss in the two groups, but the 12 dropouts may be very different from those who remained in the experimental group. • Interaction of Factors, such as Selection Maturation, etc.: Combinations of these factors may interact especially in multiple group comparisons to produce erroneous measurements. Factors Jeopardizing External Validity or Generalizability • Pre-Testing: Individuals who were pretested might be less or more sensitive to the experimental variable or might have “learned” from the pre-test making 54 Research Methodology and Statistical Methods

them unrepresentative of the population who had not been pre-tested. Example: Prior to viewing a film on Environmental Effects of Chemical, a group of subjects is given a 60 item antichemical test. Taking the Pretest may increase the effect of the film. The film may not be effective for a non-pretested group. • Differential Selection: The selection of the subjects determines how the findings can be generalized. Subjects selected from a small group or one with particular characteristics would limit generalizability. Randomly chosen subjects from the entire population could be generalized to the entire population. Example: Researcher, requesting permission to conduct experiment, is turned down by 11 corporations, but the 12th corporation grant permission. The 12th corporation is obviously different then the others because they accepted. Thus subjects in the 12th corporation may be more accepting or sensitive to the treatment. • Experimental Procedures: The experimental procedures and arrangements have a certain amount of effect on the subjects in the experimental settings. Generalization to persons not in the experimental setting may be precluded. Example: Department heads realize they are being studied, try to guess what the experimenter wants and respond accordingly rather than respond to the treatment. • Multiple Treatment Interference: If the subjects are exposed to more than one treatment then the findings could only be generalized to individuals exposed to the same treatments in the same order of presentation. Example: A group of CPA’s is given training in working with managers followed by training in working with comptrollers. Since training effects cannot be deleted, the first training will affect the second.

Tools of Experimental Design Used to Control Factors Jeopardizing Validity • Pre-Test: The pre-test, or measurement before the experiment begins, can aid control for differential selection by determining the presence or knowledge of the experimental variable before the experiment begins. It can aid control of experimental mortality because the subjects can be removed from the entire comparison by removing their pre-tests. However, pre-tests cause problems by their effect on the second measurement and by causing generalizability problems to a population not pre-tested and those with no experimental arrangements. • Control Group: The use of a matched or similar group which is not exposed to the experimental variable can help reduce the effect of History, Maturation, Instrumentation, and Interaction of Factors. The control group is exposed to all conditions of the experiment except the experimental variable. • Randomization: Use of random selection procedures for subjects can aid in control of Statistical Regression, Differential Selection, and the Interaction of Concepts of Research 55

Factors. It greatly increases generalizability by helping make the groups representative of the populations. • Additional Groups: The effects of Pre-tests and Experimental Procedures can be partially controlled through the use of groups which were not pre-tested or exposed to experimental arrangements. They would have to be used in conjunction with other pre-tested groups or other factors jeopardizing validity would be present. The method by which treatments are applied to subjects using these tools to control factors jeopardizing validity is the essence of experimental design. Table: Tools of Control. Internal Sources Pre-Test/ Control Group Randomization Additional Groups Post Test History X Maturation X Pre-Testing X Measuring Instrument X Statistical Regression XX Differential Selection XX Experimental Mortality X Interaction of Factors XX External Sources Pre-Testing X Differential Selection XX Procedures X Multiple Treatment

Experimental Designs Table. Pre-Experimental Design - loose in structure, could be biased.

Aim of the Research Name of the Notation Comments Design Paradigm To attempt to explain a One-shot X » O An approach that prematurely links consequent by an experimental case antecedents and consequences. The antecedent study least reliable of all experimental approaches. To evaluate the influence One group pretest- O » X » O An approach that provides a of a variable posttest measure of change but can provide no conclusive results. To determine the influence Static group Group 1: X » O Weakness lies in no examination of of a variable on one group comparison Group 2: - » O pre-experimental equivalence of and not on another groups. Conclusion is reached by comparing the performance of each group to determine the effect of a variable on one of them.

56 Research Methodology and Statistical Methods

Table. True Experimental Design - greater control and refinement, greater control of validity

Aim of the Name of the Design Notation Comments Research Paradigm To study the effect Pretest-posttest R - - [ O » X » This design has been called "the old of an influence on control group O workhorse of traditional a carefully [ O » - » O experimentation." If effectively carried controlled sample out, this design controls for eight threats of internal validity. Data are analyzed by analysis of covariance on posttest scores with the pretest the covariate. To minimize the Solomon four-group R - - [ O » X » This is an extension of the pretest- effect of pretesting design O posttest control group design and [ O » - » O probably the most powerful experimental [- » X » O approach. Data are analyzed by analysis [ - » - » O of variance on posttest scores. To evaluate a Posttest only control R - - [ X » O An adaptation of the last two groups in situation that group [ - » O the Solomon four-group design. cannot be pretested Randomness is critical. Probably, the simplest and best test for significance in this design is the t-test.

Table. Quasi-Experimental Design - not randomly selected. Aim of the Name of the Design Notation Paradigm Comments Research To investigate a Nonrandomized O » X » O One of the strongest and most situation in which control group pretest- O » - » O widely used quasi-experimental random selection posttest designs. Differs from experimental and assignment are designs because test and control not possible groups are not equivalent. Comparing pretest results will indicate degree of equivalency between experimental and control groups. To determine the O » O » X » O » O If substantial change follows influence of a experiment introduction of the variable, then the variable introduced variable can be suspect as to the only after a series cause of the change. To increase of initial external validity, repeat the observations and experiment in different places under only where one different conditions. group is available To bolster the Control group time O » O » X » O » O A variant of the above design by validity of the series O » O » - » O » O accompanying it with a parallel set above design with of observations without the the addition of a introduction of the experimental control group variable. To control history Equivalent time- [X1 » O1] »[X0 » An on-again, off-again design in in time designs samples O2] » [x1 » O3] which the experimental variable is with a variant of sometimes present, sometimes the above design absent.

Table. Correlational and Ex Post Facto Design. Aim of the Research Name of the Design Notation Paradigm Comments To seek for cause-effect Causal-comparative -» A very deceptive procedure that relationships between two correlational studies Oa ¥ Ob requires much insight for its use. sets of data <<- Causality cannot be inferred merely because a positive and close correlation ratio exists. Concepts of Research 57

Aim of the Research Name of the Design Notation Paradigm Comments To search backward from Ex post facto studies This approach is consequent data for experimentation in reverse. antecedent causes Seldom is proof through data substantiation possible. Logic and inference are the principal tools of this design

SELF ASSESSMENT • Define experimental research. • Define experimental design. • List six steps involved in conducting an experimental study. • Describe the basis of an experiment. • Name three characteristics of experimental research. • State the purpose of experimental control. • State three broad methods of experimental control. • Name two type of validity of experimental design. • Define eight factors jeopardizing internal validity of a research design. • Define four factors jeopardizing external validity. • Describe the tools of experimental design used to control the factors jeopardizing validity of a research design. • Define the essence of experimental design. • Name and describe the four types of experimental designs.

QUASI-EXPERIMENT RESEARCH Quasi-experiment is a research design having some but not all of the characteristics of a true experiment. The element most frequently missing is random assignment of subjects to the control and experimental conditions. Examples of quasi-experiment research design are the natural experiment or trend analysis. The word “quasi” means as if or almost, so a quasi-experiment means almost a true experiment. There are many varieties of quasi-experimental research designs, and there is generally little loss of status or prestige in doing a quasi-experiment instead of a true experiment, although you occasionally run into someone who is biased against quasi-experiments. Some common characteristics of quasi-experiments include the following: • Matching instead of randomization is used. For example, someone studying the effects of a new police strategy in town would try to find a similar town somewhere in the same geographic region, perhaps in a 5-state area. That other town would have citizen demographics that are very similar to the experimental town. That other town is not technically a control group, but a comparison group, and this matching strategy is sometimes called non- equivalent group design. 58 Research Methodology and Statistical Methods

• Time series analysis is involved. A time series is perhaps the most common type of longitudinal research found in criminal justice. A time series can be interrupted or non-interrupted. Both types examine changes in the dependent variable over time, with only an interrupted time series involving before and after measurement. For example, someone might use a time series to look at crime rates as a new law is taking effect. This kind of research is sometimes called impact analysis or policy analysis. • the unit of analysis is often something different than people. Of course, any type of research can study anything—people, cars, crime statistics, neighborhood blocks. However, quasi-experiments are well suited for “fuzzy” or contextual concepts such as sociological quality of life, anomie, disorganization, morale, climate, atmosphere, and the like. This kind of research is sometimes called contextual analysis. Quantitative and Qualitative Research 59

2

Quantitative and Qualitative Research

QUANTITATIVE RESEARCH In the social sciences, quantitative research refers to the systematic empirical investigation of quantitative properties and phenomena and their relationships. The objective of quantitative research is to develop and employ mathematical models, theories and/or hypotheses pertaining to phenomena. The process of measurement is central to quantitative research because it provides the fundamental connection between empirical observation and mathematical expression of quantitative relationships. Quantitative research is used widely in social sciences such as sociology, anthropology, and political science. Research in mathematical sciences such as physics is also ‘quantitative’ by definition, though this use of the term differs in context. In the social sciences, the term relates to empirical methods, originating in both philosophical positivism and the , which contrast qualitative research methods. Qualitative methods produce information only on the particular cases studied, and any more general conclusions are only hypotheses. Quantitative methods can be used to verify, which of such hypotheses are true. Quantitative research is generally made using scientific methods, which can include: • The generation of models, theories and hypotheses • The development of instruments and methods for measurement • Experimental control and manipulation of variables • Collection of empirical data • Modeling and analysis of data 60 Research Methodology and Statistical Methods

• Evaluation of results. In the social sciences particularly, quantitative research is often contrasted with qualitative research which is the examination, analysis and interpretation of observations for the purpose of discovering underlying meanings and patterns of relationships, including classifications of types of phenomena and entities, in a manner that does not involve mathematical models. Approaches to quantitative psychology were first modelled on quantitative approaches in the physical sciences by Gustav Fechner in his work on psychophysics, which built on the work of Ernst Heinrich Weber. Although a distinction is commonly drawn between qualitative and quantitative aspects of scientific investigation, it has been argued that the two go hand in hand. For example, based on analysis of the history of science, Kuhn concludes that “large amounts of qualitative work have usually been prerequisite to fruitful quantification in the physical sciences”. Qualitative research is often used to gain a general sense of phenomena and to form theories that can be tested using further quantitative research. For instance, in the social sciences qualitative research methods are often used to gain better understanding of such things as intentionality and meaning. Although quantitative investigation of the world has existed since people first began to record events or objects that had been counted, the modern idea of quantitative processes have their roots in Auguste Comte’s positivist framework.

STATISTICS IN QUANTITATIVE RESEARCH Statistics is the most widely used branch of mathematics in quantitative research outside of the physical sciences, and also finds applications within the physical sciences, such as in statistical mechanics. Statistical methods are used extensively within fields such as economics, social sciences and biology. Quantitative research using statistical methods starts with the collection of data, based on the hypothesis or theory. Usually a big sample of data is collected - this would require verification, validation and recording before the analysis can take place. Software packages such as SPSS and R are typically used for this purpose. Causal relationships are studied by manipulating factors thought to influence the phenomena of interest while controlling other variables relevant to the experimental outcomes. In the field of health, for example, researchers might measure and study the relationship between dietary intake and measurable physiological effects such as weight loss, controlling for other key variables such as exercise. Quantitatively based opinion surveys are widely used in the media, with statistics such as the proportion of respondents in favour of a position commonly reported. In opinion surveys, respondents are asked a set of structured questions and their responses are tabulated. In the field of climate science, researchers compile and compare statistics such as temperature or atmospheric concentrations of carbon dioxide. Empirical relationships and associations are also frequently studied by using some form of General linear model, non-linear model, or by using factor analysis. A Quantitative and Qualitative Research 61 fundamental principle in quantitative research is that correlation does not imply causation. This principle follows from the fact that it is always possible a spurious relationship exists for variables between which covariance is found in some degree. Associations may be examined between any combination of continuous and categorical variables using methods of statistics.

MEASUREMENT IN QUANTITATIVE RESEARCH Views regarding the role of measurement in quantitative research are somewhat divergent. Measurement is often regarded as being only a means by which observations are expressed numerically in order to investigate causal relations or associations. However, it has been argued that measurement often plays a more important role in quantitative research. For example, Kuhn argued that within quantitative research, the results that are shown can prove to be strange. This is because accepting a theory based on results of quantitative data could prove to be a natural phenomenon. He argued that such abnormalities are interesting when done during the process of obtaining data, as seen below: When measurement the parts from theory, it is likely to yield mere numbers, and their very neutrality makes them particularly sterile as a source of remedial suggestions. But numbers register the departure from theory with an authority and finesse that no qualitative technique can duplicate, and that departure is often enough to start a search. In classical physics, the theory and definitions which underpin measurement are generally deterministic in nature. In contrast, probabilistic measurement models known as the Rasch model and Item response theory models are generally employed in the social sciences. Psychometrics is the field of study concerned with the theory and technique for measuring social and psychological attributes and phenomena. This field is central to much quantitative research that is undertaken within the social sciences. Quantitative research may involve the use of proxies as stand-ins for other quantities that cannot be directly measured. Tree-ring width, for example, is considered a reliable proxy of ambient environmental conditions such as the warmth of growing seasons or amount of rainfall. Although scientists cannot directly measure the temperature of past years, tree-ring width and other climate proxies have been used to provide a semi- quantitative record of average temperature in the Northern Hemisphere back to 1000 A.D. When used in this way, the proxy record only reconstructs a certain amount of the variance of the original record. The proxy may be calibrated to determine how much variation is captured, including whether both short and long term variation is revealed. In the case of tree-ring width, different species in different places may show more or less sensitivity to, say, rainfall or temperature: when reconstructing a temperature record there is considerable skill in selecting proxies that are well correlated with the desired variable. 62 Research Methodology and Statistical Methods

QUANTITATIVE METHODS Quantitative methods are research techniques that are used to gather quantitative data - information dealing with numbers and anything that is measurable. Statistics, tables and graphs, are often used to present the results of these methods. They are therefore to be distinguished from qualitative methods. In most physical and biological sciences, the use of either quantitative or qualitative methods is uncontroversial, and each is used when appropriate. In the social sciences, particularly in sociology, social anthropology and psychology, the use of one or other type of method has become a matter of controversy and even ideology, with particular schools of thought within each discipline favouring one type of method and pouring scorn on to the other. Advocates of quantitative methods argue that only by using such methods can the social sciences become truly scientific; advocates of qualitative methods argue that quantitative methods tend to obscure the reality of the social phenomena under study because they underestimate or neglect the non-measurable factors, which may be the most important. The modern tendency is to use eclectic approaches. Quantitative methods might be used with a global qualitative frame. Qualitative methods might be used to understand the meaning of the numbers produced by quantitative methods. Using quantitative methods, it is possible to give precise and testable expression to qualitative ideas. This combination of quantitative and qualitative data gathering is often referred to as mixed-methods research.

QUANTITATIVE RESEARCH DESIGN

Key Points • The aim of quantitative research is to determine how one thing affects another in a population. • Quantitative research designs are either descriptive or experimental. • A descriptive study establishes only associations between variables. An experiment establishes causality. • A descriptive study usually needs a sample of hundreds or even thousands of subjects for an accurate estimate of the relationship between variables. An experiment, especially a crossover, may need only tens of subjects. • The estimate of the relationship is less likely to be biased if you have a high participation rate in a sample selected randomly from a population. In experiments, bias is also less likely if subjects are randomly assigned to treatments, and if subjects and researchers are blind to the identity of the treatments. • In all studies, measure everything that could account for variation in the outcome variable. • In an experiment, try to measure variables that might explain the mechanism of the treatment. In an unblinded experiment, such variables can help define the magnitude of any placebo effect. Quantitative and Qualitative Research 63

TYPES OF DESIGN Research studies aimed at quantifying relationships are of two kinds: descriptive and experimental. In a descriptive study, no attempt is made to change behaviour or conditions—you measure things as they are. In an experimental study you take measurements, try some sort of intervention, then take measurements again to see what happened.

Types of Research Design Descriptive or observational • Case • Case series • Cross-sectional • Cohort or prospective or longitudinal • Case-control or retrospective Experimental or longitudinal or repeated-measures • Without a control group • Time series • Crossover • With a control group.

Descriptive Studies Descriptive studies are also called observational, because you observe the subjects without otherwise intervening. The simplest descriptive study is a case, which reports data on only one subject; examples are studies of an outstanding athlete or of an athlete with an unusual injury. Descriptive studies of a few cases are called case series. In cross-sectional studies variables of interest in a sample of subjects are assayed once and analyzed. In prospective or cohort studies, some variables are assayed at the start of a study, then after a period of time the outcomes are determined. Another label for this kind of study is longitudinal, although this term also applies to experiments. Case-control studies compare cases with controls; comparison is made of the exposure to something suspected of causing the cases, for example volume of high intensity training, or number of cigarettes smoked per day. Case-control studies are also called retrospective, because they focus on conditions in the past that might cause subjects to become cases rather than controls. A common case-control design in the exercise science literature is a comparison of the behavioral, psychological or anthropometric characteristics of elite and sub-elite athletes: you are interested in what the elite athletes have been exposed to that makes them better than the sub-elites. Another type of study compares athletes with sedentary people on some outcome such as an injury, disease, or disease risk factor. Here you know the difference in exposure, so these studies are really cohort or prospective, even 64 Research Methodology and Statistical Methods though the exposure data are gathered retrospectively at only one time point. They are therefore known as historical cohort studies.

Experimental Studies Experimental studies are also known as longitudinal or repeated-measures studies, for obvious reasons. They are also referred to as interventions, because you do more than just observe the subjects. In the simplest experiment, a time series, one or more measurements are taken on all subjects before and after a treatment. A special case of the time series is the so-called single-subject design, in which measurements are taken repeatedly before and after an intervention on one or a few subjects. Time series suffer from a major problem: any change you see could be due to something other than the treatment. For example, subjects might do better on the second test because of their experience of the first test, or they might change their diet between tests because of a change in weather, and diet could affect their performance of the test. The crossover design is one solution to this problem. Normally the subjects are given two treatments, one being the real treatment, the other a control or reference treatment. Half the subjects receive the real treatment first, the other half the control first. After a period of time sufficient to allow any treatment effect to wash out, the treatments are crossed over. Any effect of retesting or of anything that happened between the tests can then be subtracted out by an appropriate analysis. Multiple crossover designs involving several treatments are also possible. If the treatment effect is unlikely to wash out between measurements, a control group has to be used. In these designs, all subjects are measured, but only some of them—the experimental group—then receive the treatment. All subjects are then measured again, and the change in the control group is compared with the change in the experimental group. If the subjects are assigned randomly to experimental and control groups or treatments, the design is known as a randomized controlled trial. Random assignment minimizes the chance that either group is not typical of the population. If the subjects are blind to the identity of the treatment, the design is a single-blind controlled trial. The control or reference treatment in such a study is called a placebo: the name physicians use for inactive pills or treatments that are given to patients in the guise of effective treatments. Blinding of subjects eliminates the placebo effect, whereby people react differently to a treatment if they think it is in some way special. In a double-blind study, the experimenter also does not know which treatment the subjects receive until all measurements are taken. Blinding of the experimenter is important to stop him or her treating subjects in one group differently from those in another. In the best studies even the data are analyzed blind, to prevent conscious or unconscious fudging or prejudiced interpretation. Ethical considerations or lack of cooperation by the subjects sometimes prevent experiments from being performed. For Quantitative and Qualitative Research 65 example, a randomized controlled trial of the effects of physical activity on heart disease has yet to be reported, because it is unethical and unrealistic to randomize people to 10 years of exercise or sloth. But there have been many short-term studies of the effects of physical activity on disease risk factors.

Quality of Designs The various designs differ in the quality of evidence they provide for a cause-and- effect relationship between variables. Cases and case series are the weakest. A well- designed cross-sectional or case-control study can provide good evidence for the absence of a relationship. But if such a study does reveal a relationship, it generally represents only suggestive evidence of a causal connection. A cross-sectional or case-control study is therefore a good starting point to decide whether it is worth proceeding to better designs. Prospective studies are more difficult and time-consuming to perform, but they produce more convincing conclusions about cause and effect. Experimental studies are definitive about how something affects something else, and with far fewer subjects than descriptive studies! Double-blind randomized controlled trials are the best experiments. Confounding is a potential problem in descriptive studies that try to establish cause and effect. Confounding occurs when part or all of a significant association between two variables arises through both being causally associated with a third variable. For example, in a population study you could easily show a negative association between habitual activity and most forms of degenerative disease. But older people are less active, and older people are more diseased, so you’re bound to find an association between activity and disease without one necessarily causing the other. To get over this problem you have to control for potential confounding factors. For example, you make sure all your subjects are the same age, or you do sophisticated statistical analysis of your data to try to remove the effect of age on the relationship between the other two variables.

SAMPLES You almost always have to work with a sample of subjects rather than the full population. But people are interested in the population, not your sample. To generalize from the sample to the population, the sample has to be representative of the population. The safest way to ensure that it is representative is to use a random selection procedure. You can also use a stratified random sampling procedure, to make sure that you have proportional representation of population subgroups. Selection bias occurs when the sample is not representative of the population. More accurately, a sample statistic is biased if the expected value of the statistic is not equal to the value of the population statistic. A typical source of bias in population studies is age or socioeconomic status: people with extreme values for these variables tend not to take part in the studies. Thus a high compliance is important in avoiding bias. Journal 66 Research Methodology and Statistical Methods editors are usually happy with compliance rates of at least 70%. Failure to randomize subjects to control and treatment groups in experiments can also produce bias: if you let people select themselves into the groups, or if you select the groups in any way that makes one group different from another, then any result you get might reflect the group difference rather than an effect of the treatment. For this reason, it’s important to randomly assign subjects in a way that ensures the groups are balanced in terms of important variables that could modify the effect of the treatment. Randomize subjects to groups as follows: Rank-order the subjects on the basis of the variable you most want to keep balanced; split the list up into pairs; assign subjects in each pair to the treatments by flipping a coin; check the mean values of your other variables in the two groups, and reassign randomly chosen pairs to balance up these mean values. Human subjects may not be happy about being randomized, so you need to state clearly that it is a condition of taking part.

Effect of Research Design The type of design you choose for your study has a major impact on the sample size. Descriptive studies need hundreds of subjects to give acceptable confidence intervals for small effects. Controlled trials generally need one-tenth as many, and crossovers need even less: one-quarter of the number for an equivalent trial with a control group. I give details on the stats pages at this site.

Effect of Validity and Reliability The precision with which you measure things also has a major impact on sample size: the worse your measurements, the more subjects you need to lift the signal out of the noise. Precision is expressed as validity and reliability. Validity represents how well a variable measures what it is supposed to. Validity is important in descriptive studies: if the validity of the main variables is poor, you may need thousands rather than hundreds of subjects. Reliability tells you how reproducible your measures are on a retest, so it impacts on experimental studies. The more reliable a measure, the less subjects you need to see a small change in the measure. For example, a controlled trial with 20 subjects in each group or a crossover with 10 subjects may be sufficient to characterize even a small effect, if the measure is highly reliable. See the details on the stats pages.

Pilot Studies As a student researcher, you might not have enough time or resources to get a sample of optimum size. Your study can nevertheless be a pilot for a larger study. Pilot studies should be done to develop, adapt, or check the feasibility of techniques, or to calculate how big the final sample needs to be. In the latter case, the pilot should be performed with the same sampling procedure and techniques as in the larger study. For experimental designs, a pilot study can consist of the first 10 or so observations of a larger study. If Quantitative and Qualitative Research 67 you get respectable confidence limits, there may be no point in continuing to a larger sample. Publish and move on to the next project or lab!

Meta-Analysis If you can’t test enough subjects to get an acceptably narrow confidence interval, you should still be able to publish your finding, because your study will at least set bounds on how big and how small the effect can be. Your finding can be combined with the findings of similar studies in something called a meta-analysis, which derives a confidence interval for the effect from several studies. If your study is not published, it can’t contribute to the meta-analysis! Unfortunately, many reviewers and editors do not appreciate the importance of publishing studies with suboptimal sample sizes. They are still locked into thinking that only statistically significant results are publishable.

QUANTITATIVE DATA ANALYSIS

Summarizing Data Producing summary reports of aggregated transaction data for decision support systems can be a complex and resource-intensive operation. Microsoft® SQL Server™ 2000 provides two flexible and powerful components for building SQL Server 2000 Analysis Services. These components are the main tools programmers should use in performing multidimensional analysis of SQL Server data: • Data Transformation Services (DTS) DTS supports extracting transaction data and transforming it into summary aggregates in a data warehouse or data mart. For more information, see DTS Overview. • Microsoft SQL Server Analysis Services Analysis Services organizes data from a data warehouse into multidimensional cubes with precalculated summary information to provide rapid answers to complex analytical queries. PivotTable® Service provides client access to multidimensional data. Analysis Services also provides a set of wizards for defining the multidimensional structures used in the Analysis processing, and a Microsoft Management Console snap-in for administering the Analysis structures. Applications can then use either the OLE DB for Analysis API or the Microsoft ActiveX Data Objects (Multidimensional) (ADO MD) API to analyze the Analysis data.

Variables Variables aren’t always ‘quantitative’ or numerical. The variable ‘gender’ consists of two text values: ‘male’ and ‘female’. We can, if it is useful, assign quantitative values instead of the text values, but we don’t have to assign numbers in order for something 68 Research Methodology and Statistical Methods to be a variable. It’s also important to realize that variables aren’t only things that we measure in the traditional sense. For instance, in much social research and in programme evaluation, we consider the treatment or programme to be made up of one or more variables. An educational programme can have varying amounts of ‘time on task’, ‘classroom settings’, ‘student-teacher ratios’, and so on. So even the programme can be considered a variable. An attribute is a specific value on a variable. For instance, the variable sex or gender has two attributes: male and female. Or, the variable agreement might be defined as having five attributes: • 1 = strongly disagree • 2 = disagree • 3 = neutral • 4 = agree • 5 = strongly agree Another important distinction having to do with the term ‘variable’ is the distinction between an independent and dependent variable.

Simple Statistics There are a wide variety of useful statistical tools that you will encounter in your chemical studies, and we wish to introduce some of them to you here. Many of the more advanced calculators have excellent statistical capabilities built into them, but the statistics we’ll do here requires only basic calculator competence and capabilities.

Arithmetic Mean, Error, Percent Error, and Percent Deviation The statistical tools you’ll either love or hate! These are the calculations that most chemistry professors use to determine your grade in lab experiments, specifically percent error. Of all of the terms below, you are probably most familiar with “arithmetic mean”, otherwise known as an “average”. • Mean — add all of the values and divide by the total number of data points • Error — subtract the theoretical value from your experimental data point. • Percent error — take the absolute value of the error divided by the theoretical value, then multiply by 100. • Deviation — subtract the mean from the experimental data point • Percent deviation — divide the deviation by the mean, then multiply by 100: Σ data points Arithmeticmean = number of data points (n) Error = Experimental value – "true" or theoretical value

Error Percent error = ∗100 Theoretical value Quantitative and Qualitative Research 69

Deviation = Experimental value - arithmetic mean

Deviation Percent deviation= ∗100 Theoretical value A sample problem should make this all clear: in the lab, the boiling point of a liquid, which has a theoretical value of 54.0° C, was measured by a student four (4) times. Determine, for each measurement, the error, percent error, deviation, and percent deviation. Observed value Error Percent error Deviation Percent deviation 54.9 0.9 2.0% 0.5 0.9% 54.4 0.4 0.7% 0.0 0.0% 54.1 0.1 0.2% -0.3 -0.6% 54.2 0.2 0.4% -0.2 -0.4% We show the calculations for the first data point as an example: 54.9+ 54.4 + 54.1 + 54.2 Arithemetic mean = = 54.4 4 Error = 54.9 – 54.0 =0.9

0.9 Percent error = ∗ 100 = 2% 54.0

Deviation= 54.9 – 54.4= 0.5

Deviation Percent deviation = ∗100 Theoretical value

Standard Deviation Standard deviation is a particularly useful tool, perhaps not one that the professor necessarily will require you to calculate, but one that is useful to you in helping you judge the “spread-outness” of your data. Typically, you hope that your measurements are all pretty close together. The graph below is a generic plot of the standard deviation. 70 Research Methodology and Statistical Methods

One standard deviation away from the mean in either direction on the horizontal axis accounts for somewhere around 68 percent of the data points. Two standard deviations, or two sigmas, away from the mean account for roughly 95 percent of the data points. Three (3) standard deviations account for about 99 percent of the data points. If this curve were flatter and more spread out, the standard deviation would have to be larger in order to account for those 68 percent or so of the points. That’s why the standard deviation can tell you how spread out the examples in a set are from the mean. How do you calculate the standard deviation? It’s not too difficult, but it IS tedious, unless you have a calculator that handles statistics. The formula for the standard deviation is as follows: Σd 2  σ =   n −1  Basically, what this says is as follows: 1. Find the deviation “d” for each data point 2. Square the value of d (d times itself) 3. Sum (add up) all of the squares 4. Divide the sum by the number of data points (n) minus 1 5. Take the square root of that value. If you have a statistics-capable calculator, this is really easy to do, since there is a button that allows you to do this. We, however, don’t have a stats calculator, so we have to do it the hard way. In this example, the student has measured the percentage of chlorine (Cl) in an experiment a total of five times. The arithmetic mean is calculated to be 19.71. The student wants to find out the standard deviation for the data set, with particular interest in the range of values from one sigma below the mean to one sigma above the mean: Trial 1 2 3 4 5 Percentage of Cl 19.82 19.57 19.68 19.71 19.75 d 0.11 -0.14 -0.03 0.00 0.04 d 0.0121 0.0196 0.0009 0.0000 0.0016 Adding up all of the d= 0.0342 Dividing 0.0342 by 4 (or 5 – 1) = 0.0086 Taking the square root of 0.0086= 0.09 This means that the standard deviation for this problem is 0.09, and that if we keep doing the experiment, most (68% or so) of the data points should be between 19.62 (19.71 - 0.09) and 19.80 (19.71+0.09). The lower the standard deviation, the better the measurements are.

Effect Statistic Effect size is a statistical concept that measures the strength of the relationship between two variables, or the effect size is research on a numeric scale. For instance, if Quantitative and Qualitative Research 71 we have data on the height of men and women and we notice that on average, men are taller than women. The difference between the height of men and the height of women is known as the effect size. The greater the effect size, the greater the height difference between men and women will be. Statistic effect size helps us in determining if the deference is real or if it is due to a change of factors. In hypothesis testing, effect size, power, sample size and critical significance level are related to each other. In Meta- analysis, effect size is concerned with different studies and then combines all the studies into single analysis. In statistics analysis, the effect size is usually measured in three ways: 1. Standardized mean difference, 2. Odd ratio, 3. Correlation coefficient.

Type of Effect Size 1. Pearson r correlation: Pearson r correlation was developed by Karl Pearson, and it is most widely used in statistics. This parameter of effect size is denoted by r. The value of the effect size of Pearson r correlation varies between -1 to +1. According to Cohen, the effect size is low if the value of r varies around 0.1. The effect size is called medium if r varies around 0.3, and the effect size is called large if r varies more than 0.5. The Pearson correlation is computed using the following formula: N∑ xy– ∑( x)( y) r =  2 2   2 2  N∑ x− ∑() x  N ∑ y− ∑() y  Where, r = correlation coefficient N = number of pairs of scores Σxy = sum of the products of paired scores Σx = sum of x scores Σy = sum of y scores Σx2= sum of squared x scores Σy2= sum of squared y scores 2. Standardized means difference: When research study is based on the population mean and standardize deviation, then the following method is used to know the effect size: µ− µ θ = 1 2 σ The effect size of the population can be known by dividing the two population mean differences by their standard deviation. Cohen’s d effect size: Cohen’s d effect size is known as the difference of two 72 Research Methodology and Statistical Methods

population means and it is divided by the standard deviation from the data. Mathematically Cohen’s effect size is denoted by: x− x d = 1 2 s Where s can be calculated using this formula: ()()n−1 s2 + n − 1 s2 s = 1 1 2 2 n1+ n 2 Glass ∆ method of effect size: This method is similar to the Cohen’s method, but in this method, standard deviation is used for the second group. Mathematically this formula can be written as: x− x ∆ = 1 2 s2 Hedges’ g method of effect size: This method is the modified method of Cohen’s d method. Hedges’ g method of effect size can be written mathematically as follows: x− x g = 1 2 s∗ Where standard deviation can be calculated using this formula: s2 ()() n−1 + s2 n − 1 Std. Deviation= 1 1 2 2 n1+ n 2 − 2 Cohen’s f2 method of effect size: Cohen’s f2 method measures the effect size when we use methods like ANOVA, multiple regression, etc. The Cohen’s f2 measure effect size for multiple regressions is defined as the following: R2 f 2 = 1− R2 Where R is the squared multiple correlation. Cramer ϕ or Cramer’s V method of effect size: Chi-square is the best statistic to measure the effect size for nominal data. In nominal data, when a variable has two categories, then Cramer’s phi is the best statistic to measure the effect size. When these categories are more than two, then Cramer’s V statistics will give the best result to measure the effect size for nominal data. 3. Odd ratio: The odd ratio is another useful measure for effect size. The Odds- Ratio is the odds of success in the treatment group relative to the odds of success in the control group. This method is used in cases when data is binary. For example, it is used if we have the following table: Quantitative and Qualitative Research 73

Frequency Success Failure Treatment group a b Control group c d To measure the effect size of the table, we can use the following odd ratio formula: ad Effect_ size = bc

Statistical Model A statistical model is a set of mathematical equations which describe the behaviour of an object of study in terms of random variables and their associated probability distributions. If the model has only one equation it is called a single-equation model, whereas if it has more than one equation, it is known as a multiple-equation model. In mathematical terms, a statistical model is frequently thought of as a pair where Y is the set of possible observations and P the set of possible probability distributions on Y. It is assumed that there is a distinct element of P which generates the observed data. Statistical inference enables us to make statements about which element(s) of this set are likely to be the true one. Three notions are sufficient to describe all statistical models. 1. We choose a statistical unit, such as a person, to observe directly. Multiple observations of the same unit over time is called longitudinal research. Observations of multiple statistical attributes is a common way of studying relationships among the attributes of a single unit. 2. Our interest may be in a statistical population of similar units rather than in any individual unit. Survey sampling offers an example of this type of modeling. 3. Our interest may focus on a statistical assembly where we examine functional subunits of the statistical unit. For example, Physiology modeling probes the organs which compose the unit. A common model for this type of research is the stimulus-response model. One of the most basic models is the simple linear regression model which assumes a relationship between two random variables Y and X. For instance, one may want to linearly explain child mortality in a given country by its GDP. This is a statistical model because the relationship need not to be perfect and the model includes a disturbance term which accounts for other effects on child mortality other than GDP. As a second example, Bayes theorem in its raw form may be intractable, but assuming a general model H allows it to become, PBAHPAH( , ) ( ) PABH(), = PBH() 74 Research Methodology and Statistical Methods which may be easier. Models can also be compared using measures such as Bayes factors or mean square error.

Competency Model Competencies are characteristics which drive outstanding performance in a given job, role or function. A competency model refers to a group of competencies required in a particular job and usually number 7 to 9 in total. The number and type of competencies in a model will depend upon the nature and complexity of work along with the culture and values of the organisation in which the work takes place. Since the early 70’s, leading organizations have been using competencies to help recruit, select and manage their outstanding performers after Dr David McClelland, Harvard Business School Professor of Psychology, found that traditional tests such as academic aptitude and knowledge tests, did not predict success in the job. More recent research by individuals such as Daniel Goleman in Emotional Intelligence and Richard Boyatzis, in The Competent Manager, have reinforced and emphasised the importance of competencies as essential predictors of outstanding performance. A competence model, also known as a competency framework, uses the five competences described earlier. These will support the primary tasks and the job specific tasks. Together these tasks reflect the purpose of the job. QUALITATIVE RESEARCH Qualitative research forms a major role in supporting marketing decision making, primarily as an exploratory design but also as a descriptive design. Researchers may undertake qualitative research to help define a research problem, to support quantitative, descriptive or causal research designs or as a design in its own right. Qualitative research is often used to generate hypotheses and identify variables that should be included in quantitative approaches. It may be used after or in conjunction with quantitative approaches where illumination of statistical findings is needed. In some cases qualitative research designs are adopted in isolation, after secondary data sources have been thoroughly evaluated or even in an iterative process with secondary data sources. In this chapter, we discuss the differences between qualitative and quantitative research and the role of each in marketing research. We present reasons for adopting a qualitative approach to marketing research. These reasons are developed by examining the basic philosophical stances that underpin qualitative research. The concept of ethnographic techniques is presented, with illustrations of how such techniques support marketing decision-makers. The concept of grounded theory is presented, illustrating its roots, the steps involved and the dilemmas for researchers in attempting to be objective and sensitive to the expressions of participants. Action research is an approach to conducting research that has been adopted in a wide variety of social and management research settings. Action research is developing in marketing research and offers great potential for consumers, decision-makers and Quantitative and Qualitative Research 75 researchers alike. The roots of action research are presented, together with the iterative stages involved and the concept of action research teams. The considerations involved in conducting qualitative research when researching international markets are discussed, especially in contrasting approaches between the USA and Europe. Several ethical issues that arise in qualitative research are identified. Before the chapter moves on to the substantive issues, one key point to note in the application of qualitative research is the name applied to the individuals who take part in interviews and observations. Note the emphasis in the following example. • Example: – Research is War’: Dutch agency MARE is calling for respondents to be promoted to the status of participants in research. The method that it has been developing puts the respondent in the position of participant rather than a passive reactive pawn in research. We take two participants, show them an image or commercial, and then invite one participant to ask the other what he or she just saw. This results in a conversation that is not directed by researchers, which is an important aspect. Our intervention is mostly focused on finding a good match between two participants who can communicate with one another. The method reveals how a consumer absorbs information and reports about it to fellow consumers, and it shows the client which elements of a commercial message work and which elements do not. Clients were reluctant to use this approach when it was first used in 1995. MARE senses there is a market for it now, so with an amount of refining, adjusting and testing, it ran from September 2005. A multinational in the Netherlands which has young marketers will apply it in research among young consumers. This example illustrates the creative thinking necessary to get the most from qualitative research and the respect that should be given to individuals who may be asked to engage in a process that sometimes goes way beyond simple questioning. Embracing this attitude, the term ‘participant’ rather than ‘respondent’ or ‘informant’ is used throughout Chapters, i.e., the core qualitative research chapters of the text. We now move on to the nature of qualitative research with three examples. Given the nature of its products and competitive environment, the first example illustrates why L’Oréal feels that qualitative research is of importance to it. The second example illustrates how Philips uses qualitative techniques to support its trend forecasting and product design. Note in this example the use of an analytical framework to help researchers and decision-makers gain insight from the data they collect. In the third example, Sports Marketing Surveys uses qualitative brainstorming techniques and focus groups as part of a research design to support the development of rugby league. These examples illustrate the rich insights into the underlying behaviour of consumers that can be obtained by using qualitative techniques. • Example: 76 Research Methodology and Statistical Methods

– A research commitment more than skin deep: L’Oréal is the largest supplier of toiletries and cosmetics in the world. The group tucks under its umbrella some of the best known brands and companies in the beauty business: cosmetics houses Lancôme, Vichy and Helena Rubenstein, and fragrance houses Guy Laroche, Cacharel and Ralph Lauren. Given the French penchant for qualitative research, and given the nature of the cosmetics industry, Anne Murray, Head of Research, was asked which type of research she favoured. We’re not particularly pro-quantitative or qualitative. Nevertheless, I do think qualitative in our area is very important. There are many sensitive issues to cover – environmental concerns, animal testing, intimate personal products. And increasingly, we have given to us very technical propositions from the labs, and what is a technical breakthrough to a man in a white coat is not necessarily so to a consumer. So the research department has to be that interface between the technical side and the consumer. – Trend forecasting at Philips: Marco Bevolo is in the future business. At Philips Design, the design department of the electronics multinational, he is responsible for identifying short-term trends in popular culture and aesthetics design. Marco believes his job has a lot in common with marketing research. Trend forecasting at Philips is carried out through a model developed by Marco and his team called Culture scan. The theoretical background of their approach comes from the Birmingham school of popular culture analysis as well as from the kind of cultural analysis performed by scholars who study phenomena such as ‘punks’ in Western cities as if they were a tribe in New Guinea. Rather than hinting at tangible design solutions – colour, ‘touch and feel’ or shape – of TVs, MP3 players or other specific products, Culture scan is supposed to provide an insight into the broader, longer term undercurrents in popular culture and aesthetics design all over the world. These broad trends are then customised by Philips decision teams. Culture scan uses both an internal and external network of experts to collect information on a wide variety of trends. These insights are then filtered by objective tools in order to validate the outcomes and make them actionable. The predictive horizon of Culture scan is 18 to 36 months with the trends refined every two or three years. In qualitative research, research agencies and companies are continually looking to find better ways to understand consumers’ thought processes and motivations. This has led to a wealth of research approaches, including techniques borrowed from anthropology, ethnography, sociology and psychology. For example, Intel has a specialist team of researchers, including ethnographers, anthropologists and psychologists, whose principal form of research is in the home. ‘People don’t tell you things because they Quantitative and Qualitative Research 77 don’t think you’ll be interested. By going into their homes you can see where and how they use their computers,’ says Wendy March, Intel’s interaction designer of Intel Architecture.

PRIMARY DATA: QUALITATIVE VERSUS QUANTITATIVE RESEARCH As explained in Chapter, primary data are originated by the researcher for the specific purpose of addressing the problem at hand. Primary data may be qualitative or quantitative in nature, as shown in Figure.

Fig. Fig. A classification of marketing research data. Dogmatic positions are often taken in favour of either qualitative research or quantitative research by marketing researchers and decision-makers alike. The positions are founded upon which approach is perceived to give the most accurate understanding of consumers. The extreme stances on this issue mirror each other. Many quantitative researchers are apt to dismiss qualitative studies completely as giving no valid findings – indeed as being little better than journalistic accounts. They assert that qualitative researchers ignore representative sampling, with their findings based on a single case or only a few cases. Equally adamant are some qualitative researchers who firmly reject statistical and other quantitative methods as yielding shallow or completely misleading information. They believe that to understand cultural values and consumer behaviour requires interviewing or intensive field observation. Qualitative techniques they see as being the only methods of data collection sensitive enough to capture the nuances of consumer attitudes, motives and behaviour. There are great differences between the quantitative and qualitative approaches to studying and understanding consumers. The arguments between qualitative and quantitative marketing researchers about their relative strengths and weaknesses are of real practical value. The nature of marketing decision making encompasses a vast array of problems and types of decision-maker. This means that seeking a singular and uniform approach to supporting decision-makers by focusing on one approach is futile. Defending qualitative approaches for a particular marketing research problem through the positive benefits it bestows and explaining the negative alternatives of a quantitative approach is healthy – and vice versa. Business and marketing decision-makers use both approaches and will 78 Research Methodology and Statistical Methods continue to need both. The distinction between qualitative and quantitative research can be in the context of research designs as discussed in Chapter. There is a close parallel in the distinctions between ‘exploratory and conclusive research’ and ‘qualitative and quantitative research’. There is a parallel, but the terms are not identical. There are circumstances where qualitative research can be used to present detailed descriptions that cannot be measured in a quantifiable manner: for example, in describing characteristics and styles of music that may be used in an advertising campaign or in describing the interplay of how families go through the process of choosing, planning and buying a holiday. Conversely, there may be circumstances where quantitative measurements are used conclusively to answer specific hypotheses or research questions using descriptive or experimental techniques. Beyond answering specific hypotheses or research questions, there may be sufficient data to allow data mining or an exploration of relationships between individual measurements to take place. The concept of data mining illustrated in Chapter allows decision-makers to be supported through exploratory quantitative research.

THE NATURE OF QUALITATIVE RESEARCH Qualitative research encompasses a variety of methods that can be applied in a flexible manner, to enable participants to reflect upon and express their views or to observe their behaviour. It seeks to encapsulate the behaviour, experiences and feelings of participants in their own terms and context; for example, when conducting research on children, an informality and child-friendly atmosphere is vital, considering features such as the decoration of the room with appropriately themed posters. Qualitative research is based on at least two intellectual traditions. The first and perhaps most important is the set of ideas and associated methods from the broad area of depth psychology. This movement was concerned with the less conscious aspects of the human psyche. It led to a development of methods to gain access to individuals’ subconscious and/ or unconscious levels. So, while individuals may present a superficial explanation of events to themselves or to others, these methods sought to dig deeper and penetrate the superficial. The second tradition is the set of ideas and associated methods from sociology, social psychology and social anthropology, and the disciplines of ethnography, linguistics and semiology. The emphases here are upon holistic understanding of the world-view of people. The researcher is expected to ‘enter’ the hearts and minds of those they are researching, to develop an empathy with their experiences and feelings. Both traditions have a concern with developing means of communication between the researcher and those being researched. There can be much interaction between the two broad traditions, which in pragmatic terms allows a wide and rich array of techniques and interpretations of collected data. Qualitative research is a significant contributor to the marketing research industry, Quantitative and Qualitative Research 79 accounting for substantial expenditure and is growing. In commercial terms, it is a billion-euro-plus global industry. However, it is not just a matter of business value. Qualitative thinking has had a profound effect upon marketing and the marketing research industry as a whole.

RATIONALE FOR USING QUALITATIVE RESEARCH It is not always possible, or desirable, to use structured quantitative techniques to obtain information from participants or to observe them. Thus, there are several reasons to use qualitative techniques. • Example – First get the language right, then tell them a story: Teenagers immediately recognise a communication in their language and are very quick to judge whether advertisers have got it right. They see ads and like them, reject them, ignore them or, in many cases, discuss them. Teenagers are so fluent in ‘marketing speak’ because marketing and advertising are perceived by them to be the kind of work which can be creative, interesting and acceptable. They discuss with one another the advertising which they perceive to be targeting them. Pelgram Walters International conducted a study called Global Village. The main contention of the study was that teenagers around the world have a common language, which speaks to them in the filmed advertising medium. Part of the study consisted of focus group discussions of 12–18 year olds. Pepsi’s Next Generation advertisement was criticised by more media-literate teenage markets for stereotyping teens and misunderstanding who they are. The ad was a montage of very hip skateboarding teens, male teens wearing make-up, perhaps implying that Pepsi is for the next generation which looks thus. The main complaint was ‘we don’t look like that’, the teens saying that they were not all the same as one another. By aligning the brand image with these extreme images, the commercial was less appealing to mainstream teen consumers. These reasons, either individually or in any combination, explain why certain marketing researchers adopt a particular approach to how they conduct research, analyse data and interpret their findings. • Preferences and/or experience of the researcher. Some researchers are more oriented and temperamentally suited to do this type of work. Just as some researchers enjoy the challenge of using statistical techniques, there are researchers who enjoy the challenges of qualitative techniques and the interpretation of diverse types of data. Such researchers have been trained in particular disciplines and philosophies that traditionally make use of qualitative research designs and techniques. • Preferences and/or experience of the research user. Some decision-makers 80 Research Methodology and Statistical Methods

are more oriented to receiving support in a qualitative manner. This orientation could come from their training but it could also be due to the type of marketing decisions they have to take. Decision-makers working in a creative environment of advertising copy or the development of brand ‘personalities’, for example, may have a greater preference for data that will feed such ‘artistic’ decisions. In the following example, consider how decision- makers would get to understand and represent the language used by teenagers. Consider also the implications for a brand if marketers do not fully understand the language and values of their target markets. • Sensitive information. Participants may be unwilling to answer or to give truthful answers to certain questions that invade their privacy, embarrass them, or have a negative impact on their ego or status. Questions that relate to sanitary products and contraception are examples of personally sensitive issues. In industrial marketing research, questions that relate to corporate performance and plans are examples of commercially sensitive issues. Techniques that build up an amount of rapport and trust, that allow gentle probing in a manner that suits individual participants, can help researchers get close to participants, and may allow sensitive data to be elicited. • Subconscious feelings. Participants may be unable to provide accurate answers to questions that tap their subconscious. The values, emotional drives and motivations residing at the subconscious level are disguised from the outer world by rationalisation and other ego defences. For example, a person may have purchased an expensive sports car to overcome feelings of inferiority. But if asked ‘Why did you purchase this sports car?’ that person may say ‘I got a great deal’, ‘My old car was falling apart’, or ‘I need to impress my customers and clients.’ The participants do not have to put words to their deeper emotional drives until researchers approach them! In tapping into those deeper emotional drives, qualitative research can take a path that evolves and is right for the participant. • Complex phenomena. The nature of what participants are expected to describe may be difficult to capture with structured questions. For example, participants may know what brands of wine they enjoy, what types of music they prefer or what images they regard as being prestigious. They may not be able to clearly explain why they have these feelings or where these feelings are coming from. • The holistic dimension. The object of taking a holistic outlook in qualitative research is to gain a comprehensive and complete picture of the whole context in which the phenomena of interest occur. It is an attempt to describe and understand as much as possible about the whole situation of interest. Each scene exists within a multi-layered and interrelated context and it may require multiple methods to ensure the researcher covers all angles. This orientation Quantitative and Qualitative Research 81

helps the researcher discover the interrelationships among the various components of the phenomenon under study. In evaluating different forms of consumer behaviour, the researcher seeks to understand the relationship of different contextual environments upon that behaviour. Setting behaviour into context involves placing observations, experiences and interpretations into a larger perspective. An example of this may be of measuring satisfaction with a meal in a restaurant. A questionnaire can break down components of the experience in the restaurant and quantify the extent of satisfaction with these. But what effect did the ‘atmosphere’ have upon the experience? What role did the type of music, the colour and style of furniture, aromas coming from the kitchen, other people in the restaurant, the mood when entering the restaurant, feelings of relaxation or tension as the meal went on, contribute to the feeling of atmosphere? Building up an understanding of the interrelationship of the context of consumption allows the qualitative researcher to build up this holistic view. This can be done through qualitative observation and interviewing. • Developing new theory. This is perhaps the most contentious reason for conducting qualitative research. Chapter details how causal research design through experiments helps to generate theory. Qualitative researchers may argue that there are severe limitations in conducting experiments upon consumers and that quantitative approaches are limited to elaborating or extending existing theory. The development of ‘new’ theory through a qualitative approach is called ‘grounded theory’, which will be addressed later. • Interpretation. Qualitative techniques often constitute an important final step in research designs. Large-scale surveys and audits often fail to clarify the underlying reasons for a set of findings. Using qualitative techniques can help to elaborate and explain underlying reasons in quantitative findings.

PHILOSOPHY AND QUALITATIVE RESEARCH

Positivist Perspectives In Chapter we discussed the vital role that theory plays in marketing research. Researchers rely on theory to determine which variables should be investigated, how variables should be operationalised and measured, and how the research design and sample should be selected. Theory also serves as a foundation on which the researcher can organise and interpret findings. Good marketing research is founded upon theory and contributes to the development of theory to improve the powers of explanation, prediction and understanding in marketing decision-makers. The dominant perspective of developing new theory in marketing research has been one of empiricism and more specifically positivism. The central belief of a positivist 82 Research Methodology and Statistical Methods position is a view that the study of consumers and marketing phenomena should be ‘scientific’ in the manner of the natural sciences. Marketing researchers of this persuasion adopt a framework for investigation akin to the natural scientist. For many, this is considered to be both desirable and possible. A fundamental belief shared by positivists is the view that the social and natural worlds ‘conform to certain fixed and unalterable laws in an endless chain of causation’. The main purpose of a scientific approach to marketing research is to establish causal laws that enable the prediction and explanation of marketing phenomena. To establish these laws, a scientific approach must have, as a minimum, reliable information or ‘facts’. The emphasis on facts leads to a focus upon objectivity, rigour and measurement. As an overall research approach qualitative research does not rely upon measurement or the establishment of ‘facts’ and so does not fit with a positivist perspective. However, if qualitative research is just seen as a series of techniques, they can be used to develop an understanding of the nature of a research problem, and to develop and pilot questionnaires. In other words, the positivist perspective of qualitative research is to see it as a set of techniques, applied as preliminary stages to more rigorous techniques that measure, i.e., surveys and questionnaires. This use of qualitative techniques is fine but may be limiting. To conduct in-depth interviews, focus groups or projective techniques, to understand the language and logic of target questionnaire participants makes good sense. However, using qualitative techniques just to develop quantitative techniques can affect how those techniques are used. As an illustration, we will examine how focus groups may be conducted. The term ‘focus group discussion’ is commonly used across all continents, yet it subsumes different ways of applying the technique. There are two main schools of thought, which may be termed ‘cognitive’ and ‘conative’. • Cognitive. American researchers generally follow this tradition, which largely follows a format and interviewing style as used in quantitative studies. ‘American-style groups’ is shorthand in Europe for large groups, a structured procedure and a strong element of external validation. Within the cognitive approach, the analysis or articulation has been worked on before, and so the interviews are largely meant to confirm or expand on known issues. • Conative. European researchers generally follow this tradition. This style assumes a different starting point, one that emphasises exploration, with analysis taking place during and after the group. There is less structure to the questions, with group members being encouraged to take their own paths of discussion, make their own connections and let the whole process evolve. Table summarises the differences between the American and European approaches to conducting focus groups. Note the longer duration of the European approach to allow the exploration to develop. To maintain the interest and motivation of participants for this time period, the interview experience must be stimulating and enjoyable. Quantitative and Qualitative Research 83

Table: The two schools of thought about ‘focus group discussions. Characteristics Cognitive Conative Purpose Demonstration Exploration Sample size 10-12 6-8 Duration 1.5 hours 1.5 to 6 hours Interviewing Logical sequence Opportunistic Questions Closed Open Techniques Straight question, questionnaires, Probing, facilitation, projectives. hand shows, counting describing Response required Give answers Debate issues Interviewer Moderator Researcher Observer’s role To get proof To understand Transcripts Rarely necessary Usually full Analysis On the spot Time-consuming Focus of time Preplanning Post-fieldwork Accusations against other style ‘Formless’ ‘Over-controlling’ Suited for Testing or proving ideas Meaning or understanding Output To be confirmed in quantitative Can be used in its own right to studies support decision-makers International marketers have always been aware that qualitative research as it developed in the USA and Europe involves quite different practices, stemming from different premises and yielding different results. American-style qualitative research started from the same evaluative premise as quantitative research but on a smaller scale. This made it cheaper, quicker and useful for checking out the less critical decisions. European-style qualitative research started from the opposite premise to quantitative research: it was developmental, exploratory and creative rather than evaluative. It was used as a tool of understanding, to get underneath consumer motivation. The American style uses a detailed discussion guide which follows a logical sequence and is usually strictly adhered to. The interviewing technique involves closed questions and straight answers. This type of research is used primarily to inform about behaviour and to confirm hypotheses already derived from other sources. For this reason, clients who have attended groups often feel they do not need any further analysis; the group interaction supplies the answers. Transcripts are rarely necessary and reports are often summarised or even done away with altogether The European style is used primarily to gain new insight; it also works from a discussion guide, but in a less structured way. The interviewing technique is opportunistic and probing. Projective techniques are introduced to help researchers understand underlying motivations and attitudes. Because the purpose is ‘understanding’, which requires a creative synthesis of consumer needs and brand benefits, analysis is time consuming and usually involves full transcripts. 84 Research Methodology and Statistical Methods

In the above descriptions of American and European traditions of applying qualitative techniques, it is clear to see that the American perspective is positivist, i.e., aims to deliver a ‘factual’ impression of consumers. The facts may be established, but they may not be enough – they may not provide the richness or depth of understanding that certain marketing decision-makers demand. So, although a positivist perspective has a role to play in developing explanations, predictions and understanding of consumers and marketing phenomena, it has its limitations and critics. The following quote from the eminent qualitative practitioner Peter Cooper cautions us of what we really mean by the term ‘qualitative’: • There is much qualitative research that still hangs on the positivist model or is little more than investigative journalism. Competition also comes from the media with increasing phone-ins and debates described as ‘research’. We need to be careful about the abuse of what goes under the title ‘qualitative’. The dominance of positivist philosophy in marketing research has been and is being challenged by other philosophical perspectives, taken and adapted from disciplines such as anthropology and sociology. These perspectives have helped marketing researchers to develop richer explanations and predictions and especially an understanding and a meaning as seen through the eyes of consumers.

Interpretivist Perspectives In general there are considered to be two main research paradigms that are used by marketing researchers. These are the positivist paradigm and the interpretivist paradigm. Table presents alternative names that may be used to describe these paradigms. Table. Alternative paradigm names. Positivist Interpretivist Quantitative Qualitative Objectivist Subjectivist Scientific Humanistic Experimentalist Phenomenological Traditionalist Revolutionist Whilst it may be easier to think of these as quite clear, distinct and mutually exclusive perspectives of developing valid and useful marketing knowledge, the reality is somewhat different. There is a huge array of versions of these paradigms, presented by philosophers, researchers and users of research findings. These versions change depending upon the assumptions of researchers and the context and subjects of their study, i.e., the ultimate nature of the research problem. It has long been argued that both positivist and interpretivist paradigms are valid in conducting marketing research and help to shape the nature of techniques that researchers apply. In order to develop an understanding of what an interpretivist paradigm means, Table presents characteristic features of the two paradigms. Quantitative and Qualitative Research 85

Table. Paradigm features. Issue Positivist Interpretivist Reality Objective and singular Subjective and multiple Researcher-participant Independent of each other Interacting with each other Values Value free = unbiased Value laden = biased Researcher language Formal and impersonal Informal and personal Theory and research design Simple determinist Cause and Freedom of will Multiple effectStatic research design Context influences Evolving design Context freeLaboratoryPrediction and bound Field ethnography control Reliability and validity Understanding and insight Representative surveys Experimental Perceptive decision making design Deductive Theoretical sampling Case studies Inductive

Comparison of Positivist and Interpretivist Perspectives The paradigms can be compared through a series of issues. The descriptions of these issues do not imply that any particular paradigm is stronger than the other. In each issue there are relative advantages and disadvantages specific to any research question under investigation. The issues are dealt with in the following subsections.

Reality The positivist supposes that reality is ‘out there’ to be captured. It thus becomes a matter of finding the most effective and objective means possible to draw together information about this reality. The interpretivist stresses the dynamic, participant- constructed and evolving nature of reality, recognising that there may be a wide array of interpretations of realities or social acts.

Researcher–participant The positivist sees the participant as an ‘object’ to be measured in a consistent manner. The interpretivist may see participants as ‘peers’ or even ‘companions’, seeking the right context and means of observing and questioning to suit individual participants. Such a view of participants requires the development of rapport, an amount of interaction and evolution of method as the researcher learns of the best means to elicit information.

Values The positivist seeks to set aside his or her own personal values. The positivist’s measurements of participants are being guided by established theoretical propositions. The task for the positivist is to remove any potential bias. The interpretivist recognises that his or her own values affect how he or she questions, probes and interprets. The task for interpretivists is to realise the nature of their values and how these affect how they question and interpret. 86 Research Methodology and Statistical Methods

Researcher Language In seeking a consistent and unbiased means to measure, the positivist uses a language in questioning that is uniformly recognised. This uniformity may emerge from existing theory or from the positivist’s vision of what may be relevant to the target group of participants. Ultimately, the positivist imposes a language and logic upon target participants in a consistent manner. The interpretivist seeks to draw out the language and logic of target participants. The language used may differ between participants and develop in different ways as the interpretivist learns more about a topic and the nature of participants.

Theory and Research Design In the development of theory, the positivist seeks to establish causality through experimental methods. Seeking causality helps the positivist to explain phenomena and hopefully predict the recurrence of what has been observed in other contexts. There are many extraneous variables that may confound the outcome of experiments, hence the positivist will seek to control these variables and the environment in which an experiment takes place. The ultimate control in an experiment takes place in a laboratory situation. In establishing causality through experiments, questions of causality usually go hand in hand with questions of determinism, i.e., if everything that happens has a cause, then we live in a determinist universe. The positivist will go to great pains to diagnose the nature of a research problem and establish an explicit and set research design to investigate the problem. A fundamental element of the positivist’s research design is the desire to generalise findings to a target population. Most targeted populations are so large that measurements of them can only be managed through representative sample surveys. The positivists use theory to develop the consistent and unbiased measurements they seek. They have established rules and tests of the reliability and validity of their measurements and continually seek to develop more reliable and valid measurements. In the development of theory, the interpretivist seeks to understand the nature of multiple influences of marketing phenomena through case studies. The search for multiple influences means focusing upon the intrinsic details of individual cases and the differences between different classes of case. This helps the interpretivist to describe phenomena and hopefully gain new and creative insights to understand ultimately the nature of consumer behaviour in its fullest sense. The consumers that interpretivists focus upon, live, consume and relate to products and services in a huge array of contexts, hence the interpretivist will seek to understand the nature and effect of these contexts on the chosen cases. The contexts in which consumers live and consume constitute the field in which interpretivists immerse themselves to conduct their investigations In understanding the nature and effect of context upon consumers, the interpretivist does not consider that everything that happens has a cause and that we live in a determinist universe. There is a recognition and respect for the notion of free will. The interpretivist Quantitative and Qualitative Research 87 will go to great pains to learn from each step of the research process and adapt the research design as his or her learning develops. The interpretivist seeks to diagnose the nature of a research problem but recognises that a set research design may be restrictive and so usually adopts an evolving research design. A fundamental element of the interpretivist’s research design is the desire to generalise findings to different contexts, such as other types of consumer. However, rather than seeking to study large samples to generalise to target populations, the interpretivist uses theoretical sampling. This means that the data gathering process for interpretivists is driven by concepts derived from evolving theory, based on the notion of seeking out different situations and learning from the comparisons that can be made. The purpose is to go to places, people or events that will maximise opportunities to discover variations among concepts. Interpretivists use theory initially to help guide which cases they should focus upon, the issues they should observe and the context of their investigation. As their research design evolves they seek to develop new theory and do not wish to be ‘blinkered’ or too focused on existing ideas. They seek multiple explanations of the phenomena they observe and create what they see as the most valid relationship of concepts and, ultimately, theory. Interpretivists seek to evaluate the strength of the theory they develop. The strongest means of evaluating the strength of interpretivist theory lies in the results of decision making that is based on the theory. Interpretivists continually seek to evaluate the worth of the theories they develop. A principal output of research generated by an interpretivist perspective should therefore be findings that are accessible and intended for use. If they are found to be meaningful by decision-makers and employed successfully by them, this may constitute further evidence of the theory’s validity. If employed and found lacking, questions will have to be asked of the theory, about its comprehensibility and comprehensiveness and about its interpretation. If it is not used, the theory may be loaded with validity but have little value.

Summarising the Broad Perspectives of Positivism and Interpretivism The positivist seeks to establish the legitimacy of his or her approach through deduction. In a deductive approach, the following process unfolds: • An area of enquiry is identified, set in the context of well-developed theory, which is seen as vital to guide researchers, ensuring that they are not naive in their approach and do not ‘reinvent the wheel’. • The issues to focus an enquiry upon emerge from the established theoretical framework. • Specific variables are identified that the researchers deem should be measured, i.e., hypotheses are set. • An ‘instrument’ to measure specific variables is developed. • Participants give answers to set and specific questions with a consistent language and logic. 88 Research Methodology and Statistical Methods

• The responses to the set questions are analysed in terms of a prior established theoretical framework. • The researchers test theory according to whether their hypotheses are accepted or rejected. From testing theory in a new context, they seek to develop existing theory incrementally. Such a process means that positivists reach conclusions based upon agreed and measurable ‘facts’. The building and establishment of ‘facts’ forms the premises of deductive arguments. Deductive reasoning starts from general principles from which the deduction is to be made, and proceeds to a conclusion by way of some statement linking the particular case in question. A deductive approach has a well-established role for existing theory: it informs the development of hypotheses, the choice of variables and the resultant measures. Whereas the deductive approach starts with theory expressed in the form of hypotheses, which are then tested, an inductive approach avoids this, arguing that it may prematurely close off possible areas of enquiry. The interpretivist seeks to establish the legitimacy of his or her approach through induction. In an inductive approach, the following process unfolds: • An area of enquiry is identified, but with little or no theoretical framework. Theoretical frameworks are seen as restrictive, narrowing the researcher’s perspective, and an inhibitor to creativity. • The issues to focus an enquiry upon are either observed or elicited from participants in particular contexts. • Participants are aided to explain the nature of issues in a particular context. • Broad themes are identified for discussion, with observation, probing and in- depth questioning to elaborate the nature of these themes. • The researchers develop their theory by searching for the occurrence and interconnection of phenomena. They seek to develop a model based upon their observed combination of events. Such a process means that interpretivists reach conclusions without ‘complete evidence’. With the intense scrutiny of individuals in specific contexts that typify an interpretivist approach, tackling large ‘representative’ samples is generally impossible. Thus, the validity of the interpretivist approach is based upon ‘fair samples’. Interpretivists should not seek only to reinforce their own prejudice or bias, seizing upon issues that are agreeable to them and ignoring those that are inconvenient. If they are to argue reasonably they should counteract this tendency by searching for conflicting evidence. Their resultant theory should be subject to constant review and revision.

ETHNOGRAPHIC RESEARCH It is clear that an interpretive approach does not set out to test hypotheses but to explore the nature and interrelationships of marketing phenomena. The focus of investigation is a detailed examination of a small number of cases rather than a large Quantitative and Qualitative Research 89 sample. The data collected are analysed through an explicit interpretation of the meanings and functions of consumer actions. The product of these analyses takes the form of verbal descriptions and explanations, with quantification and statistical analysis playing a subordinate role. These characteristics are the hallmark of a research approach that has developed and been applied to marketing problems over many years in European marketing research. This research approach is one of ethnographic research. Ethnography as a general term includes observation and interviewing and is sometimes referred to as participant observation. It is, however, used in the more specific case of a method which requires a researcher to spend a large amount of time observing a particular group of people, by sharing their way of life. Ethnography is the art and science of describing a group or culture. The description may be of a small tribal group in an exotic land or a classroom in middle-class suburbia. The task is much like the one taken on by the investigative reporter, who interviews relevant people, reviews records, weighs the credibility of one person’s opinions against another’s, looks for ties to special interests and organisations and writes the story for a concerned public and for professional colleagues. A key difference between the investigative reporter and the ethnographer, however, is that whereas the journalist seeks out the unusual, the murder, the plane crash, or the bank robbery, the ethnographer writes about the routine daily lives of people. The more predictable patterns of human thought and behaviour are the focus of enquiry. The origins of ethnography are in the work of nineteenth-century anthropologists who travelled to observe different pre-industrial cultures. An example in a more contemporary context could be the study of death rituals in Borneo, conducted over two years by the anthropologist Peter Metcalf. Today, ‘ethnography’ encompasses a much broader range of work, from studies of groups in one’s own culture, through experimental writing, to political interventions. Moreover, ethnographers today do not always ‘observe’, at least not directly. They may work with cultural artefacts such as written texts, or study recordings of interactions they did not observe at first hand, or even, as in the following example, the observations of a refrigerator. • Example – The Electrolux ‘screen fridge’: Electrolux and Ericsson joined forces to test new products using ethnographic methods. They provided a group of Swedish consumers with ‘screen fridges’ which download recipes from the Internet, store shopping lists and have a built-in video camera to record messages. By putting the fridges in people’s homes Electrolux could see how the technology was actually used and find out whether participants would be prepared to pay for it. This type of research is relatively expensive, but it gives in-depth information that could not be generated from a focus group or one-to-one interview. 90 Research Methodology and Statistical Methods

Before developing an understanding of the ethnography in marketing research, it is worth summarising the aims of ethnographic research: • Seeing through the eyes of others. Viewing events, actions, norms and values from the perspective of the people being studied. • Description. Attending to mundane detail to help understand what is going on in a particular context and to provide clues and pointers to other layers of reality. • Contextualism. The basic message that ethnographers convey is that whatever the sphere in which the data are being collected, we can understand events only when they are situated in the wider social and historical context. • Process. Viewing social life as involving an interlocking series of events. • Avoiding early use of theories and concepts. Rejecting premature attempts to impose theories and concepts which may exhibit a poor fit with participants’ perspectives. This will be developed further in this chapter when we examine grounded theory. • Flexible research designs. Ethnographers’ adherence to viewing social phenomena through the eyes of their subjects has led to a wariness regarding the imposition of prior and possibly inappropriate frames of reference on the people they study. This leads to a preference for an open and unstructured research design which increases the possibility of coming across unexpected issues. This final point is illustrated in the following example. This example can be explored in more depth by actually viewing the reviewed film. • Example – Kitchen Stories: Kitchen Stories was an unlikely contender for one of the best movies of 2004. Kitchen Stories, a Norwegian/Swedish co-production, is about an ethnographic study of the 1950s into the kitchen routines of single Norwegian men. Beneath the light humour, the film seeks to make a more serious point about whether the rules governing research stand in the way of reaching a true understanding of people. The film is based upon the real-life story of a team of Swedish researchers who set out to design the perfect kitchen. The researcher, Folke, finds himself partnered with the participant from hell: the doddering, grumpy recluse, Isak. Assuming his position on a shaky high chair stuck in the corner of the kitchen, it quickly dawns on Folke that Isak is determined to make his stay as difficult as possible. Both men are under strict instructions not to communicate and in these circumstances, much humour is wrought from Isak’s petty behaviour such as leaving a tap dripping in an attempt to infuriate the well-mannered researcher. Folke’s flouting of the rules marks him out as a less than perfect researcher. The film does not judge Folke but the method he has to work with. The objective nature of the research Quantitative and Qualitative Research 91

translates on screen as cold, dispassionate voyeurism. Ethnography in Kitchen Stories is not the science of observing but a clumsy foolish exercise that imposes rules on both men. It is only when Folke climbs down from his ‘pedestal’ that he truly begins to understand Isak. Both the formal impartility adopted by Folke and Isak’s bizarre behaviour cease, and the two are no longer ‘researcher’ and ‘participant’ – simply friends. The use of ethnographic approaches has rapidly developed in marketing research. Decision-makers are finding great support from the process and findings, as the following example illustrates. • Example – Demonstrating the value of air time: Capital Radio has been looking at the way the radio affects people’s lives. It wanted to show advertisers how listeners relate to the brand and to demonstrate the value of air time. It teamed up with the Henley Centre for a project called Modal Targeting, which highlighted different modes that listeners go through during the day. The intention was to establish when they were most susceptible to certain advertisements. On six occasions it sent researchers to observe listeners for three days at a time. They do not stay overnight, and they do not tell the people what they are looking for because that might affect the way they behave. The skill is that they blend into the background and almost become part of the fixtures and fittings. The study showed that there can be a greater difference between how one consumer feels at the start of the day and the end of the day than how two consumers feel at the same time of the day. Advertisers will now be able to be more effective as they know what ‘mode’ listeners are in. They should not be aiming to sell financial products when people are rushing for a train in the morning. Ethnography cannot reasonably be classified as just another single method or technique. In essence, it is a research discipline based upon culture as an organising concept and a mix of both observational and interviewing tactics to record behavioural dynamics. Above all, ethnography relies upon entering participants’ natural life worlds – at home, while shopping, at leisure and in the workplace. The researcher essentially becomes a naive visitor in that world by engaging participants during realistic product usage situations in the course of daily life. Whether called on-site, observational, naturalistic or contextual research, ethnographic methods allow marketers to delve into actual situations in which products are used, services are received and benefits are conferred. Ethnography takes place not in laboratories but in the real world. Consequently, clients and practitioners benefit from a more holistic and better nuanced view of consumer satisfactions, frustrations and limitations than in any other research method. A growing trend is for marketers to apply ethnographic methods in natural retail or other commercial environments. There are several objectives that lie behind these studies, one of which 92 Research Methodology and Statistical Methods is orientated towards a detailed ecological analysis of sales behaviour. In other words, all of the elements that comprise retail store environments – lighting, smells, signage, display of goods, the location, size and orientation of shelving – have an impact upon the consumers’ experience and their ultimate buying behaviour. The ethnographer’s role is to decode the meaning and impact of these ecological elements. Often, these studies utilise time-lapse photography as a tool for behavioural observation and data collection over extensive periods of time and avoid actual interaction with consumers, as illustrated in the following example. • Example – Top of the Pops: The point of purchase is a manufacturer’s last opportunity to have an effect on the customers’ decisions. Awareness of the crucial role of in-store influences is growing, and several POP companies have started offering detailed research on how customers react at the point of sale. To achieve this awareness, Electronic Surveillance of Behaviour gives a detailed understanding of how consumers behave in a shop. Kevin Price, Managing Director of Coutts Design, has formed a partnership with The In-Store Audit to utilise ESOB. Says Kevin: a. With ESOB, shoppers are tracked remotely on video around the store and their movements and actions are followed. Because this technique is fairly unobtrusive, we are able to capture natural shopper behaviour as people are not being followed around by a researcher. He goes on to explain the complexity of the computer software: b. The cameras are specially modified and they record a large sample size. They can measure consumer behaviour from entry to exit, following customers around the store and noting the items they touch and the visual cues that they give and get. The cameras may operate for between 10 and 14 days. The information is then analysed and the key clips from the video are used to reinforce the key points that have emerged from the analysis. One of the key elements of the above example is the context in which the consumer is behaving. The researcher observes shoppers, taking in and reacting to their retail experience, behaving naturally in the set context. The context of shoppers does not just mean the retail outlet they visit. The processes of choosing and buying products, of using products or giving them as gifts, of reflecting upon and planning subsequent purchases, are all affected by contextual factors. Context operates on several levels, including the immediate physical and situational surroundings of consumers, as well as language, character, culture and history. Each of these levels can provide a basis for the meaning and significance attached to the roles and behaviour of consumption. ‘Can we divorce the ways we buy, use and talk about products from the cultural and linguistic context within which economic transitions occur? Quantitative and Qualitative Research 93

The answer is an emphatic no.’ The ethnographer may observe the consumer acting and reacting in the context of consumption. The ethnographer may see a shopper spending time reading the labels on cat food, showing different brands to his or her partner, engaged in deep conversation, pondering, getting frustrated and putting tins back on the shelf. The ethonographer may see the same shopper more purposefully putting an expensive bottle of cognac into a shopping trolley without any discussion and seemingly with no emotional attachment to the product. The ethnographer may want to know what is going on. How may the consumer explain his or her attitudes and motivations behind this behaviour? This is where the interplay of observation and interviewing helps to build such a rich picture of consumers. In questioning the shopper in the above example, responses of ‘we think that Rémy Martin is the best’ or ‘we always argue about which are the prettiest cat food labels’ would not be enough. The stories and contexts of how these assertions came to be would be explored. The ethnographer does not tend to take simple explanations for activities that in many circumstances may be habitual to consumers. Ethnographic practice takes a highly critical attitude towards expressed language. It challenges our accepted words and utterances at face value, searching instead for the meanings and values that lie beneath the surface. In interviewing situations, typically this involves looking for gaps between expressed and non-verbal communication elements. For example, if actual practices and facial and physical gestures are inconsistent with a subject’s expressed attitudes towards the expensive cognac, we are challenged to discover both the reality behind the given answer and the reasons for the ‘deception’. Ethnographic research is also effective as a tool for learning situationally and culturally grounded language, the appropriate words for everyday things as spoken by various age or ethnic groups. Copywriters and strategic thinkers are always pressed to talk about products and brands in evocative and original ways. Ethnography helps act as both a discovery and an evaluation tool. • Marketing research objectives call for: High-intensity situations. To study high- intensity situations, such as a sales encounter, meal preparation and service, or communication between persons holding different levels of authority. • Behavioural processes. To conduct precise analyses of behavioural processes, e.g., Radio listening behaviour, home computer purchasing decisions or home cleaning behaviour. • Memory inadequate. To address situations where the participant’s memory or reflection would not be adequate. Observational methods can stand alone or can complement interviewing as a memory jog. • Shame or reluctance. To work with participants who are likely to be ashamed or reluctant to reveal actual practices to a group of peers. If they were diabetic, for example, participants may be reluctant to reveal that they have a refrigerator full of sweet snacks, something that an ethnographic observer would be able to see without confronting the subject. 94 Research Methodology and Statistical Methods

In these applications, the ethnographer is expected to analyse critically the situations observed. The critique or analysis can be guided by theory but in essence the researcher develops a curiosity, thinks in an abstract manner and at times steps back to reflect and see how emerging ideas connect. By reacting to the events and participants as they face them, to draw out what they see as important, ethnographers have the ability to create new explanations and understandings of consumers. This ability to develop a new vision, to a large extent unrestricted by existing theory, is the essence of a grounded theory approach, which is explained and illustrated in the next section.

GROUNDED THEORY The tradition of grounded theory was developed by Glaser and Strauss in the late 1950s and published in their seminal work in 1967. At that time, qualitative research was viewed more as impressionistic or anecdotal, little more than ‘soft science’ or journalism. It was generally believed that the objective of sociology should be to produce scientific theory, and to test this meant using quantitative methods. Qualitative research was seen to have a place, but only to the extent to which it developed questions which could then be verified using quantitative techniques. Glaser and Strauss accepted that the study of people should be scientific, in the way understood by quantitative researchers. This meant that it should seek to produce theoretical propositions that were testable and verifiable, produced by a clear set of replicable procedures. Glaser and Strauss defined theory as follows: • Theory in sociology is a strategy for handling data in research, providing modes of conceptualization for describing and explaining. The theory should provide clear enough categories and hypotheses so that crucial ones can be verified in present and future research; they must be clear enough to be readily operationalized in quantitative studies when these are appropriate. The focus upon developing theory was made explicit in response to criticisms of ethnographic studies that present lengthy extracts from interviews or field observations. Strauss sought to reinforce his view of the importance of theory, illustrated by the following quote: • Much that passes for analysis is relatively low level description. Many quite esteemed and excellent monographs use a great deal of data, quotes or field note selections. The procedure is very useful when the behaviour being studied is relatively foreign to the experiences of most readers or when the factual assertions being made would be under considerable contest by skeptical and otherwise relatively well-informed readers. Most of these monographs are descriptively dense, but alas theoretically thin. If you look at their indexes, there are almost no new concepts listed, ones that have emerged in the course of research. In contrast to the perhaps casual manner in which some ethnographers may be criticised for attempts at developing theory, the grounded theorist follows a set of Quantitative and Qualitative Research 95 systematic procedures for collecting and analysing data. This systematic procedure is used to encourage researchers to use their intellectual imagination and creativity to develop new theories, to suggest methods for doing so, to offer criteria to evaluate the worth of discovered theory, and to propose an alternative rhetoric of justification. The most distinctive feature of grounded theory is its commitment to ‘discovery’ through direct contact with the social phenomena under study, coupled with a rejection of a priori theorising. This feature does not mean that researchers should embark on their studies without any general guidance provided by some sort of theoretical understanding. It would be nigh on impossible for a researcher to shut out the ideas in the literature surrounding a particular subject. However, Glaser and Strauss argued that preconceived theories should be rejected as they obstruct the development of new theories by coming between researchers and the subjects of their study. In other words, the strict adherence to developing new theory built upon an analytical framework of existing theory can result in ‘narrow-minded’ researchers who do not explore a much wider range of explanations and possibilities. With the rejection of a priori theorising and a commitment to imaginative and creative discovery comes a conception of knowledge as emergent. This knowledge is created by researchers in the context of investigative practices that afford them intimate contact with the subjects and phenomena under study. The following example illustrates the use of grounded theory in the development of theory related to the marketing of health visitors. The example is then followed by a description of the process involved in developing theory through a grounded theory process. • Example – Grounded theory and the marketing of health visitors: Grounded theory was used to conduct a study into the effectiveness of marketing related to health visiting. The collection and analysis of data were conducted simultaneously as the study worked through interviews and observations, all guided through theoretical sampling. There was a line-by-line analysis of interview transcripts with concepts drawn from these to describe events. Categories based on attitudes, behaviour and the characteristics of informants were built through constantly comparing the data in the transcripts and the emerging concepts. The theory that emerged from this process was further guided by a contextual knowledge of the conditions under which any interactions took place, the nature of the interactions between the informants and health visitors, and the consequences of the actions and interactions undertaken by the informants. In addition, memos gathered throughout the process played a crucial part in developing the theory. The findings and emergent theory centred around the attitudes of the health visitors’ clients. The theory identified tactics which could enhance ‘selling’ in health visiting, strategies for gaining new clientele and methods for influencing behaviour in order to encourage the prevention 96 Research Methodology and Statistical Methods

of illness. The study addressed the nature of the ‘process’ of health visiting, and the need to promote the service through personal presentations and advertising. Further practical implications emerged from the study by identifying tactics for raising awareness by getting clients to recognise problems early.

Attempting to Gain an Objective Viewpoint For the grounded theorist, data collection and analysis occur in alternating sequences. Analysis begins with the first interview and observation, which leads to the next interview or observation, followed by more analysis, more interviews or fieldwork, and so on. It is the analysis that drives the data collection. Therefore there is a constant interplay between the researcher and the research act. Because this interplay requires immersion in the data, by the end of the enquiry the researcher is shaped by the data, just as the data are shaped by the researcher. The problem that arises during this mutual shaping process is how one can become immersed in the data and still maintain a balance between objectivity and sensitivity. Objectivity is necessary to arrive at an impartial and accurate interpretation of events. Sensitivity is required to perceive the subtle nuances and meanings of data and to recognise the connections between concepts. Both objectivity and sensitivity are necessary for making discoveries. Objectivity enables the researcher to have confidence that the findings are a reasonable, impartial representation of a problem under investigation, whereas sensitivity enables creativity and the discovery of new theory from data. During the analytic process, grounded researchers attempt to set aside their knowledge and experience to form new interpretations about phenomena. Yet, in their everyday lives, they rely on knowledge and experience to provide the means for helping them to understand the world in which they live and to find solutions to problems encountered. Most researchers have learned that a state of complete objectivity is impossible and that in every piece of research, quantitative or qualitative, there is an element of subjectivity. What is important is to recognise that subjectivity is an issue and that researchers should take appropriate measures to minimise its intrusion into their investigations and analyses. In qualitative research, objectivity does not mean controlling the variables. Rather it means an openness, a willingness to listen and to ‘give voice’ to participants, be they individuals or organisations. Though this may seem odd, listening is not necessarily a quality that some researchers possess. The following example illustrates the challenges of ‘listening’. • Example – Listening is not the same as researching: According to the Coca-Cola Retail Research Group study, the German discount supermarket Aldi is now the strongest retail brand in Europe. Dieter Brandes was an architect of Aldi’s success, and contends that in reaching this position, Aldi never Quantitative and Qualitative Research 97

had a ‘grand’ strategy. ‘We just groped our way forward. It was a dynamic process driven by intuition, incremental adjustments and decisions, whose consequences were not always foreseeable.’ This approach is one of being endlessly curious and being confident to take an experimental approach with much thinking and reflection. The essence of this curiosity is listening. Listening is not the same as researching. Research pursues a pre-identified agenda, that of the researcher. Listening gives other people’s agendas top priority. Because it can pick up looming dangers and new opportunities, listening lies at the heart of effective innovation. But it is very hard to do. Thus, good qualitative research means hearing what others have to say, seeing what others do and representing these as accurately as possible. It means developing an understanding of those they are researching, whilst recognising that researchers’ understandings are often based on the values, culture, training and experiences that they bring from all aspects of their lives; these can be quite different from those of their participants. As well as being open to participants, qualitative researchers reflect upon what makes them, as observers, ‘see’ and ‘listen’ in particular ways. This usually means that, while working on a particular project, the researcher keeps a diary or journal. This diary is used to make notes about the conditions of interviews and observations, of what worked well and what did not, of what questions the researcher would have liked to ask but did not think of at the time. As the researcher reads through the diary in the analysis process, the entries become part of the narrative explored, and reveal to the researcher and to others the way the researcher has developed their ‘seeing’ and ‘listening’. Research diaries will be covered in more detail in examining qualitative data analysis in Chapter.

Developing a Sensitivity to the Meanings in Data Having sensitivity means having insight into, and being able to give meaning to, the events and happenings in data. It means being able to see beneath the obvious to discover the new. This quality of the researcher occurs as they work with data, making comparisons, asking questions, and going out and collecting more data. Through these alternating processes of data collection and analysis, meanings that are often elusive at first later become clearer. Immersion in the data leads to those sudden insights. Insights do not just occur haphazardly; rather, they happen to prepared minds during interplay with the data. Whether we want to admit it or not, we cannot completely divorce ourselves from who we are and what we know. The theories that we carry around in our heads inform our research in multiple ways, even if we use them quite unselfconsciously. Ultimately, a grounded theory approach is expected to generate findings that are meaningful to decision-makers, and appropriate to the tasks they face. As with other interpretivist forms of research, if it is found meaningful by decision- makers and employed successfully by them, there is further evidence of the theory’s 98 Research Methodology and Statistical Methods validity. Another qualitative approach that is absolutely meaningful to decision-makers in that its primary focus is to deliver actionable results is called action research.

ACTION RESEARCH

Background The social psychologist Kurt Lewin had a main interest in social change and specifically in questions of how to conceptualise and promote social change. Lewin is generally thought to be the person who coined the term action research and gave it meanings that are applicable today. In action research, Lewin envisaged a process whereby one could construct a social experiment with the aim of achieving a certain goal. For example, in the early days of the Second World War, Lewin conducted a study, commissioned by the US authorities, on the use of tripe as part of the regular daily diet of American families. The research question was: ‘To what extent could American housewives be encouraged to use tripe rather than beef for family dinners?’ Beef was scarce and was destined primarily for the troops. Lewin’s approach to this research was to conduct a study in which he trained a limited number of housewives in the art of cooking tripe for dinner. He then surveyed how this training had an effect on their daily cooking habits in their own families. In this case, action research was synonymous with a ‘natural experiment’, meaning that the researchers in a real-life context invited participants into an experimental activity. This research approach was very much within the bounds of conventional applied social science with its patterns of authoritarian control, but it was aimed at producing a specific, desired social outcome. The above example can be clearly seen from a marketing perspective. It is easy to see a sample survey measuring attitudes to beef, to tripe, to feeding the family and to feelings of patriotism. From a survey, one can imagine advertisements extolling the virtues of tripe, how tasty and versatile it is. But would the campaign work? Lewin’s approach was not just to understand the housewives’ attitudes but to engage them in the investigation and the solution – to change attitudes and behaviour. Lewin is credited with coining a couple of important slogans within action research that hold resonance with the many action researchers that practise today. The first is ‘nothing is as practical as a good theory’ and the second is ‘the best way to try to understand something is to change it’. In action research it is believed that the way to ‘prove’ a theory is to show how it provides an in-depth and thorough understanding of social structures, understanding gained through planned attempts to invoke change in particular directions. The appropriate changes are in the proof. Lewin’s work was a fundamental building block to what today is called action research. He set the stage for knowledge production based on solving real-life problems. From the outset, he created a new role for researchers and redefined criteria for judging the quality of the enquiry process. Lewin shifted the researcher’s role from being a distant observer to involvement Quantitative and Qualitative Research 99 in concrete problem solving. The quality criteria he developed for judging a theory to be good focused on its ability to support practical problem solving in real-life situations. From Lewin’s work has developed a rich and thriving group of researchers who have developed and applied his ideas throughout the world. In management research, the study of organisational change with the understanding and empowerment of different managers and workers has utilised action research to great effect. There has been little application of action research in marketing research, though that is changing. Marketing researchers and marketing decision-makers alike are learning of the nature of action research, the means of implementing it and the benefits it can bestow.

Approach The term ‘action research’ includes a whole range of approaches and practices, each grounded in different traditions, in different philosophical and psychological assumptions, sometimes pursuing different political commitments. Sometimes it is used to describe a positivist approach in a ‘field’ context, or where there is a trade-off between the theoretical interests of researchers and the practical interests of organisation members. Sometimes it is used to describe relatively uncritical consultancy based on information gathering and feedback. It is beyond the scope of this text to develop these different traditions, so the following describes an approach that is grounded in the Lewin foundations of the approach, and, like his work, is applicable to marketing. Action research is a team research process, facilitated by one or more professional researchers linking with decision-makers and other stakeholders who together wish to change or improve particular situations. Together, the researcher and decision-makers or stakeholders define the problems to be examined, generate relevant knowledge about the problems, learn and execute research techniques, take actions, and interpret the results of actions based on what they have learned. There are many iterations of problem definition, generating knowledge, taking action and learning from those actions. The whole process of iteration evolves in a direction that is agreed by the team. Action researchers accept no a priori limits on the kinds of research techniques they use. Surveys, statistical analyses, interviews, focus groups, ethnographies and life histories are all acceptable, if the reason for deploying them has been agreed by the action research collaborators and if they are used in a way that does not oppress the participants. Action research is composed of a balance of three elements. If any one of the three is absent, then the process is not action research. • Research. Research based on any quantitative or qualitative techniques, or combination of them, generates data and, in the analyses and interpretation of the data there is shared knowledge. • Participation. Action research involves trained researchers who serve as facilitators and ‘teachers’ to team members. As these individuals set their 100 Research Methodology and Statistical Methods

action research agenda, they generate the knowledge necessary to transform the situation and put the results to work. Action research is a participatory process in which everyone involved takes some responsibility. • Action. Action research aims to alter the initial situation of the organisation in the direction of a more self-managed and more rewarding state for all parties. An example of an action research team in marketing terms could include: • Marketing researchers: trained in a variety of qualitative and quantitative research techniques, and with experience of diagnosing marketing and research problems. • Strategic marketing managers: decision-makers who work at a strategic level in the organisation and have worked with researchers, as well as those who have no experience of negotiating with researchers. • Operational marketing managers: decision-makers who have to implement marketing activities. These may be the individuals who meet customers on a day-to-day basis and who really feel the impact and success of marketing ideas. • Advertising agency representatives: agents who have worked with strategic decision-makers. They may have been involved in the development of communications campaigns to generate responses from target groups of consumers. • Customers: existing customers who may be loyal and have had many years of experience of the company its products and perhaps even its personnel. • Target customers: potential customers who may be brand switchers or even loyal customers to competitive companies.

Fig. Fig. The action research approach. Figure illustrates how action research may be applied. This model of action research is taken from the subject area of the management of change, which is relevant to many Quantitative and Qualitative Research 101 of the problems faced by marketing decision-makers. The process aims to create a learning community in a team such as that above. The team develops an understanding of issues to the extent that it makes sound judgements and takes effective action to implement the changes it wishes to make. The process in Figure can be described as follows. • Diagnosis. The present state of affairs would be set out, including the perceived barriers to change and an initial broad statement of desired direction for the organisation. Diagnosis would include documenting the change process and all data gathering activities such as secondary data gathering, surveys, interviews or observations. • Analysis. An initial interpretation of data gathered would be made. From this the issues to be tackled would be identified. Summary findings and the development of a framework, with set tasks for team members in subsequent data gathering, would be drawn up. • Feedback. Data analyses would be fed back for examination and discussion in the team. ‘Ownership’ of the diagnosis would be developed to formulate a commitment to action. • Action. Individual courses of action and the development of broader strategies would be formulated. • Evaluation. There would be an ongoing review of methods and outcomes. The effectiveness of any action would be evaluated against agreed criteria and critical success factors. All of these stages are interrelated, so there is no definitive path that the team would take. In subsequent iterations of activities, the team could move around the stages in any order that suits its needs. The process is illustrated in the following example where action research was used to evaluate a youth Drop-In Centre. The detail of the case is limited, but there is sufficient to see action research being successfully practised in an area that could be considered as a marketing research challenge. It also presents the challenge of what other research designs could have worked in these circumstances? • Example – Creating ‘The Kit’: A Drop-In Centre for street-involved youth, in a Canadian city, had been running for four years and it was time to evaluate its services. The Centre’s mission and clientele were controversial. Some people felt that a safe place to ‘hang out’ met the initial needs of street- involved youth and allowed staff to reach out informally, build trust, and intervene effectively in crises. Others wanted more structured activities and stricter rules, whilst others thought the Centre attracted ‘high-risk’ youth to the area and wanted it shut down completely. The evaluation, identified four objectives: 1. To involve youth in designing and implementing an evaluation to measure the impact of Drop-In services. 102 Research Methodology and Statistical Methods

2. To improve service delivery to youth. 3. To collaborate with community members on long-term solutions to help integrate street youth into the community. 4. To make the evaluation instrument available to other youth centres. – A team was built of six youth, two staff members and one outside professional researcher. The group met two afternoons per week at the Drop-In. The evaluation began with the team considering three questions: ‘What do we want to know?’, ‘How will we find out?’ and ‘Who do we need to talk to?’ The team worked together and in small groups to develop the initial questionnaire. Framing questions that would get the information it wanted was a lengthy process, with many drafts and redrafts. The team moved on to try alternative methods with the youth working together to develop tools that would fit the Centre milieu and engage other street- involved youth. When the tools were ready, each session was advertised in the Drop-In, with pizza as an incentive for participation. An iterative process was established for data analysis, ‘walking through’ responses to open-ended questions to learn the principles and techniques of analysis. The youth worked in pairs to describe and summarise the data. The team produced a formal report, a ‘Kit’ and participated in community and academic presentations about its work. The formal written report was produced by everyone brainstorming the contents, the professional researcher drafting each section, submitting it for feedback to the team and redrafting. The result was a thorough, well-crafted document. The youth took the main role in community and academic presentations. Sharing their expertise publicly helped them gain confidence and pride in their hard work and impressive results. Perhaps the most interesting reporting mechanism was ‘The Kit’, a colourful guide for other youth evaluators. ‘The Kit’ was designed and produced entirely by the youth team members. Research Process 103

3

Research Process

THE PROCESS OF SOCIAL RESEARCH In part I we discussed two images of the nature of the research process in some detail: namely research as a journey or quest for truth, and research as a process of knowledge production. The first image emphasises the epistemological dimension of research, namely that the ultimate aim of all science is to arrive at results that are as valid or truthful as possible. The second image focusses our attention on the equally important issue of resources and resource management. In this chapter we address a third image, namely research as a decisionmaking process. Our attention is now directed towards the researcher and the kind of decisions or judgment calls that he or she has to make during the research process. Before discussing the kinds of decisions that have to be taken during research, we shall reflect briefly on the nature of the process itself. What exactly happens in the research process? We have already made the point that it is in the World of Science that phenomena in World 1 are made objects of systematic and methodical inquiry. It is for this reason that the relationship between the scientist and the world has traditionally been referred to as the ‘subject-object’ relationship. This was taken to mean that, as the ‘knower’, the scientist is the subject or agent, and the phenomena that she is studying are the ‘objects’ of her inquiry. However, this has become a politically sensitive issue, especially for moral reasons. The criticism is that such a description of the relationship between researchers and their human subjects opens up the possibility of reducing research subjects to mere instruments in the attainment of the larger research 104 Research Methodology and Statistical Methods goal. It is argued that this results in the dehumanisation of the research subject and negation of the fact that the research subject is a free human agent with equal rights and freedoms to those of the researcher. Viewing this relationship in terms of subject and object disempowers those who participate in research projects. More recently, a number of scholars, including Reason and Rowan and Morgan have suggested using the notion of ‘engagement’ to describe the relationship between the scientist and the ‘object’ of inquiry. The notion of ‘engagement’ suggests a kind of reciprocity and even mutual interdependence between the researcher and the research subject or participant. Despite these criticisms, this way of describing the relationship does establish an important principle, namely that, for the sake of research, the scientist ‘objectifies’ some aspect of the social world. The scientist transforms a certain phenomenon into a cognitive ‘object’ of study by abstracting certain features from the social world. Regardless of whether the relationship between the researcher and the social world is viewed primarily in terms of a ‘subject-object’ framework or in terms of an ‘engagement’ framework, the researcher still has to take certain decisions. He or she has to make judgment calls based on the information that is available. What we refer to as the ‘stages’ in the research process or as the ‘research cycle’, are in fact ‘clusters’ of related decisions taken by the researcher. These stages are actually decisions to act in one specific way rather than in another. The following simple framework applies to all decision-making – also in research. As rational agents we decide on a certain course of action against the background of the available information. A decision is deemed to be rational or reasonable when others, usually our peers, concur with our judgment. In other words, a decision is deemed to be a good one when there is consensus that, given the available information, it was the best decision. This implies that the most reasonable decision is the one that is judged to have the best chance of leading to whatever is defined as a successful outcome. We return to our analogy of the journey. The traveller who wishes to ensure that he reaches his destination as planned in terms of the time frame and the budget, will take the appropriate decisions regarding the route, schedules, resources and mode of transportation. Such decisions are taken against the background of the available information. So, for instance, if the travel agent has informed the traveller that his plane leaves Johannesburg International at 19h00 on Thursday evening and he plans accordingly, one could say that he has acted rationally. If, in such a case, the traveller arrives at the airport and, owing to poor weather or a technical difficulty, the flight has been cancelled, no one would accuse him of having made a poor decision. Of course, if undertaking the journey is a matter of life and death, one would expect the person to have considered contingency plans. If, on the other hand, our traveler simply arrives at the airport on the assumption that there will be a flight to his particular destination and that there are seats available, this would be seen as somewhat irrational behaviour. This example shows that what is accepted as Research Process 105

‘reasonable behaviour’ or ‘rational decision making’, depends on a number of factors such as the goal or destination of the particular action, the information available and the risk associated with the outcome. In the remainder of part 2, we will explore, in more detail, the notion of ‘rational decision making’ as it applies to research. But we must first identify the ‘kinds’ of decisions that a researcher has to make in a typical research project. The ‘dynamics’ of the process are illustrated in figure.

Fig. Fig. The research process.

FORMULATING THE RESEARCH PROBLEM All empirical scientific inquiry begins with a ‘movement’ from World 1 to World 2. Research commences at the point where someone, in this case the researcher, begins to reflect on some aspect of World 1. This reflection can be a very unstructured thought, a conjecture, a question or a hypothesis. Some phenomenon in the social world, like the nature of depth perception, the activities of drug addicts, the poverty of a large section of the population or the effectiveness of a new crime prevention programme, has prompted the researcher to ask a question that requires an answer. This ‘phrasing of a question’, this ‘putting into words’, involves a cognitive representation of some real world phenomena. Something concrete and socially real that has causes and effects, is cognitively or mentally represented in the form of concepts that are strung together to form coherent propositions in the form of research questions. In simple terms, by formulating the problem, we are ‘abstracting’ from the ‘concrete’ social phenomenon.

CONCEPTUALIZING THE PROBLEM We distinguish between merely stating or formulating the problem and the next stage, which involves the conceptualisation of the problem. Conceptualisation involves at least two activities, namely the conceptual clarification or analysis of key concepts 106 Research Methodology and Statistical Methods in the problem statement, and relating the problem to a broader conceptual framework or context. Conceptual clarification involves definition of the key concepts, usually those referring to the key features of the phenomenon to be studied. If there are standard definitions in the field, these should be used. If not, the researcher has to ensure that the meanings of such concepts are clearly specified. But conceptualisation also means integrating or embedding the research problem within a larger body of knowledge. Obviously this only applies where such a body of evidence does exist. The reason for doing systematic literature searches is to determine whether previous empirical research or theoretical studies, which may guide one’s own study, have already been conducted.

Operationalisation Once the research problem has been conceptualised, we must specify how our conceptualisation, which is in the form of a hypothesis, theory or research problem, relates to the real world of things and events. We must establish linkages between our concepts and the phenomenon that we wish to investigate. This is done through the process of operationalisation or operational definitions. There is a clear connection between conceptualisation and operationalisation, based on the distinction between the sense and the reference of concepts. In the process of conceptualisation we analyse, inter alia, the meanings or connotations of concepts and their interrelationships. In the process of operationalisation, we define the references or denotations of these same concepts. The best way to operationalise is to list the measurement ‘operations’ or ‘rules’ in terms of which the classes of World 1 phenomena are uniquely determined. Operationalisation consists of the construction of ‘a set of operations’ or ‘measures’ that link the research problem to the world. Such measures can be either highly structured, as in quantitative measurement procedures, or highly unstructured. But in any empirical social research, the formulation and conceptualisation of the research problem must be followed by a process of operationalisation. It is important to note that the term ‘measurement’ is used in two senses in the literature. On the one hand, it refers to the construction of a measuring instrument such as a scale, a questionnaire, an interview schedule or an index. On the other hand, it involves the actual process of measuring some phenomenon such as the incidence of juvenile crimes in a specific geographical area or attitudes regarding specific issues. In the latter case, measurement is actually synonymous with data collection. In this book the use of the term ‘operationalisation’ is limited to the actual design and development of a measuring instrument.

Selection of Cases When formulating the problem, we already identify the ‘kind’ of social phenomenon that we plan to investigate. In chapter we identified six kinds of ‘units of analysis’ in the social sciences. However, this simply meant that a certain kind of social entity, namely individuals, groups, organisations, social objects, social actions, interventions and social Research Process 107 events, was identified. This process does not identify the actual ‘population’ of entities in the real world that meet the definition of ‘the unit of analysis’. The selection of subjects or cases refers to, • Decisions regarding the population that we wish to study, and • Decisions on whether, where practically possible, the whole population will be studied or whether we will, in fact, select samples of elements from the population.

Data Collection Data collection involves applying the measuring instrument to the sample or cases selected for the investigation. We must constantly remind ourselves that the human senses are our ‘first-order’ measuring instruments, even if they are qualitative. On the basis of our visual, auditory and tactile observations and perceptions, we begin to classify responses, people, actions and events. However, because we aspire to truthful representations of the social world, we have to ‘augment’ our observations by more reliable and valid measuring instruments such as scales, questionnaires and observation schedules. If properly constructed and validated over time, such instruments assist us in collecting data that are more likely to be reliable than they would be had we not used instruments.

DATA Analysis and Interpretation Data collection produces new information or data about the world that requires further ‘processing’. Data processing involves at least two kinds of operations, namely data reduction, during which the quantitative and qualitative data are summarised, and data analysis. Data analysis would include both qualitative analysis, which includes processes such as thematical and content analysis, and quantitative or statistical analysis. Data processing is followed by synthesis, which involves ‘interpretation’ or ‘explanation’ of the data. This concludes our discussion of the main stages in the research process. In the next section we argue that these are not merely consecutive steps in a chronological sequence of events, but that, underlying the process, there is a logic that is peculiar to scientific inquiry. THE LOGIC OF RESEARCH The logic of research is the logic of argumentation. Perhaps the best analogy is a legal one. What happens during for instance a murder trial? An attorney prosecutes someone by building a case on the evidence available. There may be different kinds of evidence such as eyewitness accounts, forensic evidence and ballistic evidence. The case is defended before a judge or jury, who have to make a judgement. Their judgement is the outcome of a process of weighing, as impartially and objectively as possible, the evidence presented. The social scientist similarly argues for a specific point of view. She has also to adduce evidence in support of the particular point of view and ‘defend’ 108 Research Methodology and Statistical Methods it before the ‘jury’ of peers – the research community. Many legal expressions, such as weight of evidence, burden of proof, arguing a case and beyond reasonable doubt, are just as applicable to scientific inquiry as they are to a court case! Making judgements on the basis of the evidence available is part and parcel of being human. All our everyday decisions, for instance about what to wear or which route to take to work are judgements made on the basis of available evidence. Some judgements have further-reaching effects than others. Some examples are parents having to decide on a school for their child, a doctor who has to prescribe a specific course of antibiotics on the basis of a clinical diagnosis, or a politician deciding to implement a particular policy on the basis of the advice of policy analysts. But these are all examples of the general form of logical reasoning or argumentation. In all these cases, reasoning or argumentation consists of drawing certain conclusions or making certain judgements on the basis of some body of evidence. Scientific reasoning has a similar logical form: in a very basic sense, a scientific thesis or report is no more than an extended logical argument. We shall define argument, following Larry Wright as ‘the dispassionate marshalling of support for some statement’. Just as, in everyday life, you might defend a certain point of view by citing evidence in its support, the social scientist attempts to muster scientific evidence in support of a specific point of view. If scientific reasoning can be correctly characterised as being like a logical argument, then it follows that any research study has to comply with the rules of logic, the rules of valid reasoning. This suggests that we need to have clarity about concepts such as ‘argument’ and ‘logical reasoning’.

THE NATURE OF ARGUMENTATION The following is the basic scheme of a typical argument:

Consider an everyday example: • There is no doubt that Italian drivers are the worst in the world. Just note how they ignore red traffic lights and stop signs. And it is hardly necessary to refer to their lack of courtesy when they cut across traffic lanes without indicating their intention of doing so and the way they force their way into lanes without considering other drivers! This ‘loose’ argument can be represented schematically as follows:

• S1: Italian drivers ignore red traffic lights. • S2: Italian drivers ignore stop signs. • S3: Italian drivers force their way into traffic lanes. • S4: Italian drivers change lanes without giving the necessary signals. Research Process 109

• C: Italian drivers are the worst in Europe. This particular example does not of course constitute a valid argument. Closer inspection reveals that the premises do not actually provide sufficient support for the conclusion. On the basis of S1 to S4 one would at best be able to claim that Italy has exceptionally poor drivers, but certainly not that they are the worst in Europe. Although they may well be the worst, this conclusion is not substantiated by the supporting evidence. This is a typical example of ‘inductive reasoning’. The distinctive feature of inductive reasoning is that, even if the supporting evidence or premises are accepted as true, there is always the possibility that the conclusion may not be true. The problem in such cases is usually that the conclusion is broader than the premises imply; the conclusion goes ‘beyond’ the data. Our example illustrates another important feature of logical reasoning, namely that ‘inferential validity’ does not refer to the truth or reliability of the premises but rather to the relationship between premises and conclusion. Drawing an inference refers to the ‘logical jump’ that one makes from the premises to the conclusions. The validity or acceptability of this ‘jump’ is determined, not by the validity of the premises, but by whether the premises provide support for the conclusions. Empirical evidence that can provide support for the truth or likelihood of a conclusion must therefore be both ‘true’, or at least highly probable, and also relevant to the conclusion. But how does one assess ‘relevance’ of evidence? Assume that we were to add the following premise to the preceding example:

• S5: Rome is the capital of Italy. Although this statement happens to be true, it is clearly quite irrelevant to the conclusion. However, adding the following premise to the argument – ‘S6: The accident rate in Italy is the highest in Europe’ – would constitute more supporting evidence. The reason why S5 is not relevant to our argument while S6 is, is obvious: the ‘problem’ that requires explanation or clarification, is the claim that ‘Italian drivers are the worst in Europe’. S5 does not address this claim in any way, whereas S6 does. The notion of ‘evidence’ therefore presupposes a certain ‘problem’ or context. Relevance of evidence is determined by the problem or context. Thus: the statement in S6 is relevant to the conclusion and also increases the likelihood that our conclusion is true. Whether the addition of S6 constitutes ‘sufficient’ evidence is a matter for debate. Some people would regard the addition of S6 as adequate evidence, while others would want to investigate additional evidence such as the number of traffic violations per driver. The fact remains that S6 is not only relevant but also strengthens the conclusion and thereby increases its inferential validity – the second condition of sufficient support. To summarise then: there are three key ‘elements’ in any research project: 1. The problem or research question or issue that is being addressed. 2. The evidence required to address or solve the problem adequately. 3. The conclusions that will be drawn on the basis of the evidence collected and will resolve the problem either way. 110 Research Methodology and Statistical Methods

In the remainder of the book I shall refer to this as the PEC-framework of scientific reasoning. The PEC-framework constitutes – albeit in an oversimplified way – the general form of all reasoning in science. The logic that characterises all sound research is illustrated in figure.

Fig. Fig. The logic used in research.

SOME COMMENTS There are obviously many variations of P, E and C. Even a cursory perusal of journals in the social sciences will reveal the considerable variety of research problems in this field. In basic research, such as testing hypotheses or theories, the research issues differ fundamentally from applied research problems such as evaluating the effectiveness of a new social programme or intervention. There is a substantial difference between empirical problems, for instance, what are the causes of absenteeism? and theoretical or conceptual problems: for instance, how should we define alienation? This is an important point to emphasise because the kind of research problem ultimately determines the logic of the study. What do we mean by this? It means that the nature of the research problem determines what will constitute adequate evidence. The kind of design and logic required for a study that aims to develop a new explanation or hypothesis for a specific phenomenon will differ from the kind required for a study which aims to confirm an already well-established theory. A study that is breaking new ground differs substantially from a study that aims to evaluate a social programme.

INDUCTIVE AND DEDUCTIVE REASONING We shall use an extended example to discuss the basic differences between deduction and induction. In the second half of the chapter we shall discuss the difference between inductive generalisation, which is sometimes simply referred to as ‘generalisation’, and retroduction, which is also referred to as ‘diagnostic induction’ or ‘abduction’. The following example taken from Larry Wright’s excellent book entitled Better reasoning is a reconstruction of certain real events; and the reader must resist the temptation to Research Process 111 allow any personal background knowledge – especially after JFK the movie! – to affect the reading of the example.

THE JFK EXAMPLE Suppose we were considering various arguments in favour of the claim that President John F. Kennedy was in fact shot by Lee Harvey Oswald. A first formulation of an argument in support of C, could be the following:

• S1: Shortly after the assassination, Lee Harvey Oswald was noticed in the book depository from which the shots had been fired. • C: Lee Harvey Oswald shot President Kennedy. Although S1 does lend some support to C, it is obvious that this does not constitute a strong argument in favour of the conclusion. So we immediately proceed to strengthen our argument by the addition of S2: • S1: Shortly after the assassination, Lee Harvey Oswald was noticed in the book depository from which the shots had been fired. • S2: Oswald’s palm print was found on a rifle left close to the window from which the shots had been fired. • C: Lee Harvey Oswald shot President Kennedy. There can be little doubt that the addition of S2 increases the likelihood of C being correct. However, if this evidence were to be presented before a jury, it is unlikely that it would convince the members. Let us assume, therefore – and remember that this is a creative reconstruction! – that we add two further pieces of evidence:

• S1: Shortly after the assassination, Lee Harvey Oswald was noticed in the book depository from which the shots had been fired. • S2: Oswald’s palm print was found on a rifle left close to the window from which the shots had been fired. • S3: An eye witness identified Oswald as the assassin. • S4: According to the ballistic tests, the fatal shots could have been fired from the rifle. • C: Lee Harvey Oswald shot President Kennedy. The support for C, offered by the arguments S1 to S4, now appears to be overwhelming. In most courts of law such evidence might even be judged sufficient. But let us continue and think of a somewhat more far-fetched possibility. Wright argues that the case against Oswald could have been made pretty much watertight had the following kind of evidence been available: • Assume that the owner of the book depository from which the shots were fired had been concerned about the security of his store. As a precautionary measure he had had closed-circuit television installed, and the whole episode had been recorded on tape. The quality of the recording was such that there could not be the slightest trace of doubt that it had indeed been Oswald who had fired the shots. 112 Research Methodology and Statistical Methods

What is clear is that each new piece of evidence increases the support for the conclusion. The addition of the final video-recorded evidence appears to have made the case watertight. On the basis of the evidence presented it is no longer possible to arrive at an alternative conclusion. One is virtually compelled to accept C! Or is one? Perhaps there is a possibility, however remote, that Oswald was not the assassin! Assume, says Wright, that the evidence just referred to was fabricated. Assume that an amazingly ingenious plot was hatched to frame Oswald for the murder of President Kennedy. With this purpose in mind, an exact replica of the book depository was built elsewhere and equipped with similar video cameras; a similar motorcade was arranged, someone who bore an unusually close resemblance to Oswald did everything that the real Oswald was supposed to have done, everything was recorded, the actual video tapes were replaced with the forged tapes, and so on. Although this is perhaps not the type of evidence that anyone, and particularly not a jury, would regard seriously, it nonetheless remains a logically possible explanation of the existing evidence. In other words, the conclusion does not necessarily follow logically from the evidence because this conclusion, although seemingly outrageous, is at least conceivable. One could think up additional hypothetical examples, and even more outrageous ones but in the final instance, there is only one way to remove all possible doubt if C is to follow necessarily from the premises. As Wright indicates, were that to happen, the nature of the argument would change radically. Up to this point we have been concerned with the issue of the weight of the evidence: in logical terms this is an inductive argument. However, when we modify the argument so that the conclusion necessarily follows from the premises, the argument loses its evidential character. When this happens, the supporting evidence is linked to the conclusion on the basis of semantic considerations rather than on any piece of empirical evidence. In such a case, either implicitly or explicitly, the conclusion is then already contained in the premises. This type of argument is called a deductive argument and would include a statement such as S5. • S1: Shortly after the assassination Lee Harvey Oswald was noticed in the book depository from which the shots had been fired. • S2: Oswald’s palm print was found on a rifle left close to the window from which the shots had been fired.

• S3: An eye witness identified Oswald as the assassin. • S4: According to the ballistic tests, the fatal shots could have been fired from the rifle mentioned in S2. • S5: Mary Oswald’s husband shot President Kennedy. • C: Lee Harvey Oswald shot President Kennedy. The conclusion here is explicitly contained in the supporting premises. In fact, the inclusion of S5 immediately makes S1–S4 redundant. This is what is meant by the above reference to the argument losing its ‘evidentiary’ nature. Of course, no-one would offer such an argument for any purpose other than to illustrate the difference between a Research Process 113 deductive argument and an inductive argument. This example was used to illustrate the principle of degrees of inductive support and the notion of adequate support. It simultaneously shows that inductive and deductive arguments differ radically. This difference will now be explored more systematically.

SOME DEFINITIONS It is important to reiterate that in our analysis of inferences – the inferential relationship between premises and conclusion – we are not questioning the epistemic status of the premises. For the sake of argument we accept the truth of all the premises, in other words, that the evidence is reliable. Having accepted the truth of the premises, we are interested in how much support they provide for the conclusion. In the Kennedy example, there were two possible answers to this question: inductive support, in which the premises provide gradual support for the conclusion, and deductive validity in which the truth of the conclusion is either implicitly or explicitly contained in the premises. This example enables us to define induction and deduction formally: • In an inductive argument, genuine supporting evidence can only lead to highly probable conclusions. In other words, in an inductive argument, supporting statements merely lend gradual support to the conclusion. • In a deductive argument, true premises necessarily lead to true conclusions; the truth of the conclusion is already either implicitly or explicitly contained in the truth of the premises. The following commonplace examples of deductive and inductive arguments illustrate the difference: • Deductive: – All mammals have hearts. – All horses are mammals. – All horses have hearts. • Inductive: – Horse 1 (was observed) to have a heart. – Horse 2 (was observed) to have a heart. – Horse 3 (was observed) to have a heart. – Horse n (was observed) to have a heart. – All horses have hearts. The use of the same empirical evidence in both examples illustrates the important differences between inductive and deductive arguments. In both examples the truth of the premises is accepted. In the deductive argument, the conclusion is already contained in the premises, and the conclusion is no more than an explication of the premises. In the inductive argument, however, the conclusion is supported by the observations made and is hence supported by the premises. The conclusion is highly probable, but, however unlikely this may appear, there is still a possibility that a type of horse that does not have a heart may be discovered. Thus, in the inductive argument the conclusion does 114 Research Methodology and Statistical Methods not follow of necessity. The differences between induction and deduction are summarised as follows by Salmon: • Deduction: – If all the premises are true, then the conclusion must be true. – All of the information or factual content in the conclusion was already contained, at least implicitly, in the premises. • Induction: – If all the premises are true then the conclusion is probably, but not necessarily, true. – The conclusion contains information not even implicitly present in the premises.

INDUCTIVE GENERALIZATION AND RETRODUCTIVE INFERENCE The two examples of non-deductive reasoning – the JFK example and the horse example – reveal an important difference between two kinds of inductive inference. In the first case, we drew a conclusion on the basis of various items of supporting evidence. Both the premises and the conclusion refer to a single case, namely the assassination of JFK, and the conclusion is offered as the ‘best explanation’ of why something happened. In the second example, we drew a conclusion on the basis of a limited number of observations, but we generalised beyond the actual number of horses observed. In this particular case, the validity of our conclusion depends both on the accuracy of our observations and on whether the cases that we have observed are representative of the total population of horses. It has become common to refer to these two kinds of inductive inference as examples of retroduction and inductive generalisation. The common denominator in both examples is that our conclusions go beyond the premises in the sense that we add information that is not already contained in the premises. The difference between retroduction and inductive generalisation lies in the fact that our conclusions ‘go beyond’ the premises. In the case of retroductive inference, our conclusion is an ‘inference based on the best explanation’ of the observed events. Our conclusion is offered as an explanation that provides an account of what has been observed. So, in the JFK case, the conclusion is offered as the ‘most plausible’ explanation of the events. In the second example, our conclusion goes beyond the data through the process of generalisation. The observed cases referred to by the premises refer are accepted as being sufficiently representative of a certain population of similar cases to enable us to formulate a generalisation that would apply to all such cases.

CONCLUDING REMARKS These distinctions provide us with the following scheme of types of scientific reasoning: • Deduction: An inference where the conclusion follows necessarily from the premises, that is, the conclusion is already contained in the premises. Research Process 115

• Induction: Inference where the conclusions – in different ways – go beyond the premises. – Inductive generalisation: An inference that generalises from the specific to the general. – Retroduction: An inference from the observed to the hidden or underlying mechanism that explains the workings of what is observed. TYPES OF REASONING IN SOCIAL RESEARCH The three forms of scientific reasoning that were distinguished in the previous chapter, namely deduction, inductive generalisation and retroduction, perform three different functions in scientific inquiry. Let us summarise the function of each: • Deductive reasoning is used when a researcher wishes to test an existing theory and has to generate research hypotheses. By definition, theories are usually fairly abstract and general statements that make empirical validation are quite difficult. But if a theory is true, or at least provides a plausible explanation of a certain phenomenon, it must be empirically testable. It must be possible to derive research hypotheses from such a theory in order to collect evidence which would either support or refute such hypotheses. Derivation of hypotheses from theories involves deductive reasoning. • In survey research, field experiments and most other forms of quantitative research, it has become standard practice to draw samples of cases rather than attempting to gather data from the population. However, once the researcher has collected data from the sample, she usually wishes to generalise her findings to the target population. This ‘movement’ from sample to population involves inductive generalisation. In fact, some researchers would argue that representative samples are not required if one wishes to generalise beyond the evidence collected. It has become acceptable, also in qualitative research, to use a form of inductive inference called analytic induction, to generalise from a small number of examined cases to a larger population of similar cases. • Scientists are rarely satisfied with merely establishing that something is the case. It is not enough to know that there is some pattern, some regularity in human behaviour. We also want to know why people act in specific ways, why certain groups of people hold particular views, why some students perform better than others. We are interested in explanations of phenomena and events in the social word. Such explanations are put forward in the form of new hypotheses or theories. Such a hypothesis or theory is judged to be a good explanation if it provides a plausible account of certain observable phenomena; if it can show that it would be reasonable to expect certain patterns or regularities given that particular theory or hypothesis. But how do scientists 116 Research Methodology and Statistical Methods

come up with plausible explanations? It clearly involves a logical jump beyond the data. One needs to go beyond the evidence at hand and ‘think up’ an explanation. The kind of reasoning involved here is called retroductive reasoning or ‘retroduction’ for short. In the remainder of the chapter, we will discuss an example of research in the social sciences that illustrates how these three forms of reasoning find expression in actual research. But before we discuss the example in detail, it is worth making an additional point. In chapter we argued that the format of the research problem determines what would constitute appropriate evidence to address the problem. And precisely because reasoning is all about drawing conclusions from evidence, it follows that the form of reasoning is in fact determined by the nature of the research problem. In other words, the manner in which the problem is defined determines the kind of reasoning required in a particular study. Let us illustrate this by looking at the reasoning involved in two different kinds of study: hypothesis-testing studies and hypothesis- generating studies. The differences are presented schematically in figure.

Fig. Fig. The differences between hypothesis-testing studies and hypothesis-generating studies. Although there are many variations in the statement of the problem and hence also in types of studies, this figure clearly illustrates the formal differences between the three kinds of reasoning.

EXAMPLE: GIORGI’S STUDY ON RELIGIOUS INVOLVEMENT IN SECULARISED SOCIETIES We begin by reconstructing the main decision-making stages in this study. Research Process 117

Stage 1: The Problem Giorgi’s general interest is in examining the ‘secularisation profiles’ of various European countries. As suggested by the title of the article, she is specifically interested in focussing on religious involvement in secularized societies. After discussing certain conceptual issues, Giorgi remarks that the patterns usually associated with the process of secularisation are present in European countries. However, there are also certain cross-national variations that require explanation. Although she does not formulate this clearly, the research question of her study could be formulated as follows: If increasing secularisation can be expected over time in societies that are becoming increasingly industrialised and modernised, why are there still notable differences between countries?

Stage 2: Conceptualisation of the Problem Giorgi then argues that Martin’s general theory of secularisation, which is somewhat unique in that it examines secularisation in its politico-historical context, might in fact suggest an answer to this question. Let us quote Giorgi on what she regards as the core ideas of his theory: • While never made explicit, one basic assumption underlying Martin’s theory is that both religion, especially through its institutional personification in a church, and the nation-state confer identify upon individuals, even if the identity conferred is of a qualitatively different kind. Consequently, even though national or group identity can be mediated through religion, in the typical case, once the state becomes the identity nexus of a society, religion or rather the Church will be marginalized, and this affects individual religious involvement. But there are variations in the way this displacement is brought about, and hence different patterns can be seen to emerge … Within this context, and in true Weberian style, Martin’s general theory set one basic premise: that historical events can and do become crucial in the way societies change and develop; they do so by delimiting the space along which change proceeds … the most crucial of these historical events for the secularisation process is of course the Reformation that fundamentally divided Europe across religious lines, thus challenging the tradition of the infallible Catholic unity and all that it represented. The main theses in Martin’s theory can be summarised as follows: • Religion and the nation-state confer identity on individuals. • Secularisation actually means that the state replaces the Church as the most important social entity or institution in a society. • Thus, once the nation-state becomes the identity nexus in a society, the role of religion becomes more marginal. • Secularisation patterns are not identical across societies. • Historical events are crucial in determining the way societies change. 118 Research Methodology and Statistical Methods

• The Reformation is the most crucial of historical events as far as the process of secularisation is concerned. But what was distinctive about the Reformation? According to Martin, certain elements that had always been present in Christianity – which had always the potential for secularisation – were brought to the surface by Protestantism. One of the most important of these elements is the individualistic nature of Christianity. With the loss of ritual and symbolism that accompanied the Reformation, secularisation could no longer be held at bay! This leads to the formulation of a final statement namely: • The Reformation led to a re-emphasis of the rationalistic and individualistic elements that had always been present in Christianity. If these seven statements, which form the core of Martin’s theory, are true, what could be expected to follow from this? Martin in fact derives three more specific theses from this theory. We will confine ourselves to his first theses. This is Martin’s argument as reconstructed by Giorgi: • Martin associated Protestantism with individual striving, Catholicism with collective class antagonism … In those countries that adopted the Reformation, argues Martin, and hence Protestantism, the formal separation and effectively the subordination of the Church to the State was smoothly established. In Catholic countries, on the other hand, the Church was forced to oppose any rising political secular ideology, including the State that personified the secular in all its self-willed power … This establishes a spiral of secularisation in Catholic countries, which veers between the two extremes of religiosity and atheism, where atheism is often associated with communism. In these countries, religion thus becomes a major issue in class conflict and political conflict in general. Alternatively, in Protestant countries, the Church is subordinate to the State from the start, and hence the cleavage between religion and politics is not as drastic as it is in Catholic countries. Giorgi has now reached the point where she can test a specific hypothesis, which is deductively derived from Martin’s theory. The hypothesis can be formulated as follows: • one would expect the patterns of religiosity to differ in countries that are predominantly either Catholic or Protestant. More specifically, one would expect greater religious involvement in Catholic countries as opposed to Protestant countries. This still very general hypothesis is not immediately empirically testable. The key concept in the statement is that of ‘religious involvement’. This raises the question of operationalisation.

Stage 3: Operationalisation In an earlier section of the paper, Giorgi has in fact argued that religious involvement can be operationalised through the use of four indicators, namely: Research Process 119

• Church attendance; • Self-assessed religiosity; • Doctrinal orthodoxy; and • Devotionalism. Each of these concepts is subsequently discussed in some detail and linked to specific items in the European Values Study conducted in ten European countries in 1981. The other key question that follows from the above statement, is how one decides which countries are predominantly Catholic and which are predominantly Protestant. Having operationally defined the key concepts in the statement, Giorgi is now in a position to test the specific research hypothesis, which we can formulate as follows: • One would expect to find higher proportions of religiously involved people in the predominantly Catholic countries of Europe such as France, Italy and Spain than in the predominantly Protestant countries such as Britain and Denmark. Less clear patterns might be expected to occur in the so called ‘mixed countries’.

Stage 4: Sample and Data Giorgi has very little to say on the sample and data. This is because she has used existing data collected as part of an international research programme in 1981. It is precisely because this is an extremely well-known study, which has been well documented elsewhere, that she probably assumed that her reader would be willing to accept the fact that the sampling and data collection are of a sufficiently high scientific quality.

Stage 5: Analysis and Interpretation On the basis of the available data analysed through a simple two-way table, she draws two conclusions: • Regarding the differences between Catholics and Protestants, as expected, Catholics not only attend church more regularly and profess to be religious in larger numbers than Protestants; they also endorse the religious doctrines to a greater extent than Protestants, and claim a higher degree of devotionalism. However, the non-conformists are probably as religious, if not more so, than Catholics themselves, despite being Protestant. Since the first ‘result’ is as expected, no further comment is required. For the same reason, the rather unexpected second result does require an explanation. Giorgi takes up the challenge. She first rejects an explanation in terms of possible biased sampling. The explanation that Giorgi puts forward: “Non-conformists are notorious for their militancy and conservatism in religious matters. If non-conformists comprise one extreme, the non-affiliators comprise the other. Among the latter, religious involvement is very low indeed”. This concludes our reconstruction of the Giorgi article. We will now show how, although they are not equally ‘visible’, all three kinds of reasoning are present in this study. The main stages in the reasoning process are presented below. 120 Research Methodology and Statistical Methods

SUMMARY COMMENTS ON THE EXAMPLE

Deductive Reasoning Deductive reasoning is exemplified in the derivation of the research hypothesis from Martin’s general theory of secularisation. It has the typical form of a conditional: if Martin’s theory is true, then the research hypothesis follows. Giorgi assumes the truth of Martin’s theory and proceeds to test the truth of the hypothesis derived.

Inductive Generalization Where an inductive argument is followed there are two stages in this process. First, by inferring from the samples of the countries to the countries themselves, Giorgi generalises beyond the observed data. Although she does not defend this move, it is acceptable, given the status of the European Values Study. The second inductive inference takes place when Giorgi concludes, at the end of the paper, that the findings of her study “support Martin’s general theory”. This is a clear example of inductive support. Giorgi’s study strengthens our belief in the truth of Martin’s theory. It does not prove it conclusively but does substantiate it. This form of inductive reasoning is similar to the JFK example in chapter.

Retroductive Inference Faced with the surprising results about the high religiosity of non-conformists, Giorgi postulates an explanation. She has to ‘invent’ an explanation that will account for the observable patterns. Her hypothesis, which is the result of retroductive reasoning, not only accounts for the results as they pertain to non-conformists, but also explains why non-affiliators scored very low on religious involvement. This gives her explanation an initial plausibility. But its real strength and true explanatory value will only be tested in future studies. STAGES IN THE RESEARCH PROCESS

FORMULATING THE RESEARCH PROBLEM (CASES, VARIABLES AND RELATIONSHIPS) The ‘objects’ or ‘entities’ that social scientists study, are usually referred to as the ‘units of analysis’ of a project. In chapter we identified seven general categories of ‘units of analysis’: to wit, individuals, organisations, institutions, collectives, social objects, social actions or events, and interventions. It is usually not difficult to identify the unit of analysis of one’s study. It is, very simply, that which one wishes to investigate. Another way of looking at it is that the unit of analysis is the ‘entity’ or ‘phenomenon’ to which one’s conclusions ought to apply. However, a common problem in research is that researchers tend to confuse the unit of analysis with the data source or sources. Research Process 121

The following is an example. Many studies investigate aspects of individual human behaviour such as individual attitudes, beliefs or kinds of behaviour. In these cases the individual is the unit of analysis, whereas many possible data or information sources can be utilised. These might be interview data such as attitudinal surveys, direct observation such as laboratory experiments, and documentary sources such as diaries and letters. In the first two examples the unit of analysis and the data sources are identical. In the third example, the unit of analysis and the data sources are different. This sometimes confuses novice researchers. There are many cases where the unit of analysis and the source of information regarding the unit of analysis differ. In studies of interventions like training programmes and policies, the intervention is the unit of analysis. In such a situation, individuals who have either designed or participated in the programmes, or both, might well be interviewed, in which case there will be more than one data source. A study of a social object such as a political text might similarly require interviews with certain individuals who are so-called experts, to gather information. In summary then: it is important to distinguish clearly between the ‘what’ that you are investigating and the data sources that have to be explored in gathering information or evidence about the unit of analysis. The best way to identify the unit of analysis is to ‘think ahead’ to the possible outcomes of your study and ask yourself: “To what entity or set of entities will my conclusions apply?” If you were to conclude, on the basis of your evidence, that eighty per cent of a certain group of individuals hold certain beliefs, then the individuals constitute your unit of analysis. If you were to conclude that a certain training programme had been effective in leading to higher productivity, the programme, which is a kind of intervention, is your unit of analysis. The ‘object’ or ‘target’ of your final conclusion is your unit of analysis. ‘Cases’ are defined as the actual concrete instances of the unit of analysis. Whereas the unit of analysis indicates a kind or type of entity, the ‘cases’ are the actual individuals or groups or towns studied. Thus, in an attitudinal survey of university students, the unit of analysis is ‘the individual’, while the cases are the actual students interviewed. Cases can be counted and might range from one to thousands. Although the first step in most studies is to identify the unit of analysis, researchers are less interested in the actual ‘entity’ or ‘object’ than in aspects of and relationships among specific characteristics or features of such objects. We refer to characteristics or features that take on different values, categories, or attributes as variables. In the following section we focus on ‘variables’ and different ways of categorising them.

Variables and their Attributes Variables may vary over cases. For instance, individuals differ in terms of features such as age, gender and occupation. For an individual, any characteristic may vary over time – we grow older, more educated, hopefully richer, we may change party affiliation and so on. It is clearly important to distinguish between variables and the attributes or categories of which they consist. Age is a variable that can range over a number of 122 Research Methodology and Statistical Methods years, while political attitudes can have conservative, moderate or radical categories. Similarly, ‘being divorced’ or ‘being female’ are not variables, but categories of the variables ‘marital status’ and ‘gender’ respectively. To distinguish clearly between a variable and its possible categories, you can apply the following rule of thumb. The terms that you use to describe someone, for example ‘middle class’, ‘English speaking’ and ‘poor’ are attributes of variables. The variables here are ‘social status’, ‘language group’ and ‘level of income’. In order to help you understand the concepts ‘case or unit of analysis’, ‘variables’ and ‘categories’ better, some typical examples are given in table. Table. Research questions, units of analysis, and variables. Research question/hypothesis Unit of analysis[What Variables[With respect to what [What one wants to know] entities are described and characteristics] compared] Are older people politically more Individuals Age, political conservative than younger people? attitudes The greater the increase in air Cities Increase in air traffic, economic passenger traffic at a city’s airport, growth the greater the economic growth The higher the proportion of female Factories Proportion of female employees employees, the lower the wages in who earn average wages 19th-century factories Does economic development lower Nations Level of economic development, the birth rate? birth rate A students university performance Individuals University performance, parents’ is directly related to his/her parents’ income and educational level income and educational level

Types of Variables Variables can be defined in various ways. In this section we shall look at the distinction between independent and dependent variables.

Independent and Dependent Variables The distinction between the independent variable and the dependent variable is an extremely useful distinction in social research. Although it was originally predominantly applied in experimental research, this distinction is now widely applied in most kinds of quantitative empirical research. An independent variable is the presumed cause of the dependent variable, which is the presumed effect. Independent variables are presumed to be the variables that are producing or causing certain effects as measured by the dependent variable or variables. Let us take an example. Say we are interested in explaining why some students perform better at university than others. The variable that we are interested in explaining is ‘university performance’. Following on our Research Process 123 discussion above, this variable will have different categories, such as excellent performance, ‘average performance’, and ‘poor performance’. We would hypothesise that many factors produce good university performance: the individual student’s cognitive capacity, motivation, high-school education and preparation, and the socio-economic background of both student and parents. All of these presumed causal or explanatory factors are so-called independent variables. The independent variable is hence the antecedent, which means that it precedes the dependent variable, which is the consequent. This relationship can be stated in the form of a conditional conjunction: If A, then B: this is a conditional statement which says that if the independent variable A obtains, then the dependent variable B follows logically. We can illustrate this point schematically. Note that it is standard practice to use X to refer to the independent variable(s) and Y for the dependent variable. Note also that the arrow joining X2 and X3 indicates a relationship or correlation between these two variables.

Fig. Fig. Causal relationship between variables. In experiments the independent variable is the variable that is manipulated by the experimenter. Say we are interested in studying the effects of two different teaching methods in a school. We are interested in determining whether the introduction of a new method of teaching second languages to primary school students will improve their pass rate. Teaching method is our independent variable, while the ‘pass rate’ is our dependent variable. We will typically manipulate the independent variable by dividing our sample of schools or classes into an experimental group and a control group. If we have ensured that the schools are relatively similar in terms of other important factors we would like to ascertain whether differences in scholastic performance between the experimental and control schools are to be attributed to the different teaching methods. Studies aimed at showing that a specific intervention such as a teaching method, a new training programme or a new performance appraisal system leads to better results or is more effective than other comparable interventions, are not simple and require a lot of ingenuity and rigour on the part of the researcher.

Quantitative and Qualitative Variables Another important distinction is that between quantitative and qualitative variables. This distinction reflects a fundamental difference in the way that variable categories 124 Research Methodology and Statistical Methods are represented numerically. A variable is quantitative if its values or categories consist of numbers and the differences between the categories can be expressed numerically. The variable ‘age’, which is measured in ‘years’, signifies a quantitative difference between people of different ages. Qualitative variables have discrete categories which are usually referred to by words or labels. The ‘gender’ variable has the discrete categories ‘male’ and ‘female’; the variable ‘political affiliation’ has the discrete categories ‘ANC’, ‘NP’, ‘IFP’ and so on. Having explained these distinctions, we can now discuss the different kinds of relationships between variables and how researchers describe these.

Relationships between Variables Researchers are typically, and perhaps ultimately, interested in the way that ‘things’ in the social world relate to each other. The aim of social research might even be defined as ‘the search for enduring patterns or regularities in relationships among phenomena’. We are interested, for example, in determining whether there is a relationship between unemployment and crime, whether high levels of stress in the workplace are related to absenteeism and whether there is a link between religiosity and suicide attempts. The term ‘relationship’ is part of our everyday vocabulary. We already know that certain events are related when for instance one event always seems to precede another. Examples are that certain hours in the mornings and afternoon are associated with heavy traffic volumes and that the advent of the Christmas season is related to increased consumer spending. All such relationships have two elements: first, two or more entities or events are involved and second, the combinations of events or situations usually change or vary simultaneously: in other words, the occurrence of the one, like the advent of Christmas, coincides with the occurrence of the other, namely increased spending. Two or more variables are therefore said to be related, associated or linked to the extent that changes in the one variable are accompanied by systematic and sometimes predictable changes in the other. How variables vary depends on whether they are quantitative or qualitative.

Relationships Among Qualitative Variables The core idea of a relationship or association between two qualitative variables is that if the one variable changes the other variable also does and that if one variable does not change, the other one does not change. Consider the relationship between race and executive position in an organisation. Affirmative action is one of the most pressing problems in South Africa at the moment. The question is not only whether blacks are adequately represented in organisations, but also which positions they occupy. Tables represent three hypothetical situations as illustrated by the association between race and occupation. Research Process 125

Table. Perfect association. Executive position Non-executive position Total White 40 0 40 Black 0 40 40 Total 40 40 80 Table. Moderate association.

Executive position Non-executive position Total White 25 15 40 Black 15 25 40 Total 40 40 80

Table. No association. Executive position Non-executive position Total White 20 20 40 Black 20 20 40 Total 40 40 80 In these tables we have illustrated a perfect relationship between race and executive position; a moderate relationship or association and a situation where there is no association whatsoever between the variables. Table expresses a perfect relationship – if the category of one variable changes, then the attribute of the other variable also changes. In other words, if the race of a staff member is known, one could predict with a hundred per cent accuracy whether the person occupies an executive or a non-executive position in the organisation. Table reveals a pattern of modest association where the two variables are related. More often than not, a prediction about an individual’s executive position, based on his/her race, will be correct. Table illustrates the situation where there is no association between the two variables. We would be inclined to suggest, and justifiably so, that these tables respectively reflect very strong, in fact ‘perfect’, moderate and zero relationships between the variables. The notion of the strength of the relationship between variables can therefore be defined as the proportion of times that we correctly predict the categories of the one variable, knowing the values of the other. A high proportion of accurate predictions means that the variables are strongly related; a low proportion indicates a moderate or weak association. Statistical indices of association such as the contingency coefficient, may be computed for the distribution of scores tabulated above. Ordinarily these indices will range from 0 to 1.00.

Relationships among Quantitative Variables A quantitative variable is a variable of which the categories can be represented numerically: someone is thirty years old, has an IQ of 115 and an income of R30 000 126 Research Methodology and Statistical Methods per year. When we investigate relationships among quantitative variables, it becomes possible to say whether a change in one variable represents an increase or decrease in the value of another. In addition to indicating strength of relationships, we are now able to specify two other aspects of relationships: direction and linearity. A relationship can be either positive or negative. A positive relationship exists if an increase in the value of one variable is accompanied by an increase in the value of the other, or if a decrease in the value of one variable is accompanied by a decrease in the value of the other. The two variables change in the same direction. Table. Positive relationship between variables. Age (years) Income (per annum) 20 R15000 30 R30000 40 R45000 50 R60000 There is a negative or inverse relationship between variables if a decrease in the value of one variable is accompanied by an increase in the value of the other. A change in one variable is opposite in direction to a change in the other. A common place example is the relationship between distance travelled and petrol remaining in the fuel tank. Table. Inverse relationship between variables. Distance (kms) Petrol remaining (litres) 100 40 200 30 300 20 400 10 These two examples can be depicted graphically. The lines in the two graphs illustrate the idea of linearity. Figure depicts a positive linear relationship and figure a negative linear relationship.

Fig.Fig.Fig. Positive linear relationship between variables. Research Process 127

Fig. Fig. Negative linear relationship between variables. Finally, not all relationships are linear; changes in one variable are not necessarily accompanied by changes in the other variable in one direction only. A typical example in social science is the relationship between stress or anxiety and scholastic performance. Some degree of anxiety is apparently productive and actually results in increased performance. However, at some point, too much anxiety and stress becomes counterproductive and results in reduced performance. This is referred to as a curvilinear relationship. In this case, the rate of change in one variable is not consistent over all the values of the other variable. Another example would be the relationship between age and annual earnings. Up to retirement, earnings will generally increase with age and will then gradually decline. A curvilinear relationship is depicted in the graph below:

Fig. Fig. Curvilinear relationship between variables.

FORMULATING THE RESEARCH PROBLEM (RESEARCH OBJECTIVES) The research objective or purpose gives a broad indication of what researchers wish to achieve in their research. For example, the aims of a project might be: • To describe or explain certain phenomena or events or even predict future patterns of behaviour; 128 Research Methodology and Statistical Methods

• To evaluate a particular intervention or educational programme; or • To develop new theories or further refine and test existing theories. We will eventually propose a classification of different types of research goals to present a more systematic picture of different kinds of research objectives. But before doing so, we must address a more basic question: what are the factors that come into play when a researcher identifies a particular research goal? Research goals do not drop from the skies! What makes a researcher decide to opt for a descriptive goal rather than an explanatory one? Which factors play a role in determining a choice for or against evaluating social interventions? I shall argue that there are at least two factors that codetermine decisions about the research goal: the researcher’s background knowledge of the particular topic and his/her cognitive interests. In terms of our model in part 1, the first factor refers to the epistemic dimension or existing stock of knowledge and the second to the sociological dimension or the social and biographical context of research. I shall discuss each of these factors briefly before returning to the actual differences in types of research goals.

Background Knowledge (The Epistemic Dimension) The state of existing knowledge on the phenomenon to be researched is an important factor in deciding on the specific goals or objectives of a project. The existence of a well-established tradition of previous studies on a specific topic would suggest one kind of research objective. If, on the other hand, there is little or no previous research on the topic, a different kind of research objective would be more appropriate. Where there is a well-established and long tradition of research in a given sphere, then new studies usually aim to test the existing theories and explanations. We will refer to such studies as validational or confirmatory studies. In cases where very little previous research has been conducted, the researcher will typically attempt to collect new data and develop new hypotheses to explain such data. We will refer to such studies as being primarily exploratory. But we need to be more specific when referring to ‘existing knowledge’. For purposes of this discussion we shall distinguish between two kinds of knowledge, namely descriptive and explanatory knowledge. Descriptive or factual knowledge, which includes data, facts, empirical generalisations, narratives and stories, provides truthful descriptions of phenomena in the world. Descriptive statements make claims about how things are; what the actual state of affairs or fact of the matter is. Explanatory knowledge, which includes theories, interpretations and models, makes causal claims about the world. Explanatory statements suggest plausible explanations of why things are as they are; what the causes of events or the causal mechanisms behind change are. It should be obvious that it is easier to substantiate purely factual or descriptive claims about the world, for instance the claim that ‘fifty-five per cent of a sample Research Process 129 opposed legalised abortion’. Explanatory or theoretical claims, on the other hand, are much more difficult to confirm. For example, a theoretical statement which claims that ‘the reason for the majority being opposed to legalised abortion is the dominance of Catholicism in a particular country’, will require a host of evidence including evidence which rules out other possible rival explanations. Our discussion thus far has produced two distinctions: • A distinction between exploratory and confirmatory studies, which is a function of the state of background knowledge. • A distinction between descriptive and explanatory studies, which is a function of the difference between kinds of knowledge. When these distinctions are cross-tabulated they suggest the following typology of research studies in the social sciences as illustrated in figure.

Fig. Fig. Typology of research studies in the social sciences. The numbers in the diagram refer to four generic kinds of studies. We have to emphasise, though, that these four kinds of studies are very broad ideal-types. The aim of the typology is to make us aware of certain distinctions that are helpful in identifying and understanding different kinds of studies in social research.

Exploratory Studies 1 The aim of such studies, which would include pilot studies and other kinds of qualitative research, is to establish the ‘facts’, to gather new data and to determine whether there are interesting patterns in the data. Tan, Li and Simpson’s study of the influence of American television programmes on the formation of stereotypes amongst foreign audiences is closest to this category.

Replication Studies 2 Where there is already a well-developed body of evidence or knowledge on a topic, it is sometimes important to replicate and validate previous findings. This is done for various reasons, but mainly to establish whether the same results will be obtained with different samples of subjects under different conditions and time frames. Hill’s replication 130 Research Methodology and Statistical Methods of Schuman’s work on the relationship between study time at university and grades achieved is an example.

Hypothesis-generating Studies 3 Empirical findings, as expressed in patterns and generalisations, have to be explained and this is precisely the aim of the bulk of empirical research, namely to generate plausible explanations or accounts in the form of hypotheses.

Theory-testing Studies 4 In certain areas, there are well-established and highly plausible theories such as modernisation theory, socialisation theory and social learning theory. A significant proportion of empirical research is aimed at testing and validating such theories. The second factor that influences the formulation of the research problem is the researcher’s cognitive interests.

Cognitive Interests (The Sociological Dimension) ‘Cognitive’ interests are those factors that motivate or drive the researcher to undertake a particular study. What are his/her reasons for undertaking the particular study? Some of these reasons might be very specific to the individual concerned while others might be more closely linked to institutional and other social concerns. Individual interests could include ‘mundane’ objectives such as getting a post-graduate degree or undertaking a research project under contract. Institutional interests would be interests that are linked to the researcher’s institutional ‘home’ or base. Research is undertaken in various institutions and environments such as academic departments, think-tanks, government-funded bureaus, commercial research houses and development agencies. As argued in chapter, the institutional context of research affects research in many ways. This discussion is especially pertinent to the way in which the broader social context influences and even determines the formulation of research problems. In research textbooks it has become accepted practice to distinguish between predominantly basic or academic research on the one hand, and predominantly applied research on the other, in order to demonstrate how different kinds of interests affect the problem formulation. This distinction is clearly one of degree. In fact, it is more a question of perspective and intention than of black and white. The main purpose in the more basic or academic research is to contribute to the existing body of scientific knowledge. This is not to say that such research does not aim to make a contribution to our understanding of the social world. But the focus, the point of departure, is in the World of Science. Typically such research consists of asking questions such as the following: Is this theory or model correct? How can we improve our understanding of X? How do we test this hypothesis? In what areas of my discipline or research domain are there clear deficiencies and lack of data? Research Process 131

Fig.Fig.Fig. Basic and applied research. The more applied research, such as policy research and social problems research, takes a certain problem in the social world as its point of departure. Its primary purpose is to solve a social problem or to make a contribution to real-life issues. Examples would be: How do we solve the housing shortage in this area? Has this literacy programme been effective in this rural area? What are the causes of unemployment in this town?

RESEARCH DESIGN In our analogy between research and travel we compared the research design of a project to a journey planner or itinerary. If we consider what goes into the planning of a journey we get some idea of the functions of a research design. On having decided on my destination, I must, as a traveller, consider the best route by means of which to reach it. ‘Best’ implies taking into consideration factors such as time of the year, costs, mode of transportation and the route. In planning the journey I am constrained by what my travel agent and I know about the route from existing maps and guidebooks, and by resources like time and money. But I shall eventually produce an itinerary that meets my needs and which I will follow to reach my destination. A research design is like a route planner. It is a set of guidelines and instructions on how to reach the goal that I have set for myself. The notions of ‘plan’ and ‘design’ are commonly used in the construction industry where drawing up building plans or architectural designs invariably precedes the actual construction of a building. Similarly, the research design could be viewed as the ‘blueprint’ of the research project that precedes the actual research process. Building plans and blueprints are step-by-step outlines of what needs to be done. They specify the materials and the specifications according to which they are to be used, the critical deadlines against which particular stages must be completed, and so forth. In the construction industry such plans and designs are necessarily very detailed. But in other spheres of everyday life, designs differ in their degree of articulation and detail. 132 Research Methodology and Statistical Methods

Someone who wishes to design a new dress might have only a very general ‘picture’ or ‘idea’ in her mind of how the design should look. An artist who composes a new picture might have only the faintest of ideas when starting out. Designs differ in terms of detail and finality. A major determining factor in this process is the degree of control that is deemed necessary for the project. Control and planning are paramount considerations where the risk of error is high and disastrous consequences could result if things are not well planned. This is typically the case where lives are at risk, as in the case of the construction of a bridge or an apartment building. In other cases, where the question of risk is less important, designs are more flexible, open-ended and less fixed. Although we do not usually encounter similar degrees of risk in the social sciences the distinction remains valid. Certain areas of pharmaceutical and medical research are obviously high-risk fields. In highly structured research such as experimental designs aimed at testing research hypotheses, the design is a framework of clearly formulated decision steps. In a semistructured and open-ended project such as qualitative exploratory research aimed at developing new hypotheses, the design is more flexible. We can once again illustrate this with our analogy of a journey. If I embark on a business trip which will hopefully see the conclusion of months of extended negotiations and the closing of a multimillion rand contract, I will ensure that all my travel arrangements are taken care of well in advance, even to the extent of making contingency plans should something go wrong. In contrast, if I am embarking on a leisurely trip along the West coast and I intend to explore the area in my own time, questions of meeting deadlines, making bookings, and contingency plans may not even arise. This discussion illustrates the importance of the research problem in a project. The structure and particular logic of a research design is determined by the formulation of the research problem. The degree of structure in our design will be a direct function of the research goals that we have set for ourselves.

Research Design as Maximising Validity Our construction and travelling examples have shown that the need for design and planning is most urgent when errors and mistakes have to be eliminated. This even applies to research where the design is relatively open-ended. Although it is seldom possible to plan a project in such detail that all error will be eliminated, it is usually possible to identify certain typical threats to validity and to adjust one’s design accordingly. The rationale for a research design is to plan and structure a research project in such a way that the eventual validity of the research findings is maximised through either minimising or, where possible, eliminating potential error.

The Notion of Validity In chapter we argued that ‘validity’ should be viewed as a synonym for ‘best approximation to the truth’. Very briefly, our argument there was that, although scientists work under the epistemic imperative or search for truth, there are various ontological Research Process 133 and sociological constraints that seriously curtail the attainment of this ideal, except in the simplest cases of singular descriptive statements. This does not mean that we should abandon the search for truth, only that we have to accept that at best our research can only produce better or worse approximations of the truth. But merely setting our aims lower – from attaining truth to aiming at approximation of the truth above absolute truth – does not by itself clarify the notion of ‘validity’. The key question is: how does one recognise valid research? How does one know when one study is more valid than another? These are essential questions, because unless we have a clear idea what the criteria for ‘validity’ are, there is no sense in defining the function of research designs as ‘maximising validity’. Our approach is as follows: first, we have to identify the key dimensions of validity. Secondly we argue that the only feasible way to ‘maximise validity’ is by either minimising or eliminating all foreseeable threats to validity in the research process. The above discussion suggests that we should regard ‘validity’ as a criterion that is applicable to the whole research process. One way to do this is to look at the change in the meaning of ‘validity’ when applied to each of the main stages in the research process. In the following discussion of the stages in the research process, we shall identify, in each case, • The stage in the research process; • The major sources of error • The particular ‘outcomes’ or ‘products’ of that stage in the research process and • The appropriate criterion of validity as it applies to that outcome. We need to emphasise that ‘validity’ is an epistemic criterion, which means that it is a quality of the elements of knowledge. These knowledge elements are the products of the various stages of decision making in research. The actual process of decision making is more or less objective. Objectivity is a criterion of the process, which means that it is a methodological criterion. We would therefore argue that research uses relatively objective methods when conceptualising, sampling, defining, analysing and collecting data Let us elaborate. In each case in the discussion below, we: • Define what is understood by the particular stage in the research process; • Identify the epistemological criterion; and • Identify methodological criteria appropriate to the particular stage.

Conceptualisation ‘Conceptualisation’ refers to both the clarification and the analysis of the key concepts in a study and also to the way in which one’s research is integrated into the body of existing theory and research. As far as the first meaning is concerned, ‘conceptualisation’ is synonymous with ‘conceptual analysis’ and involves the clear and unambiguous definition of central concepts. ‘Conceptualisation’ also refers to the underlying theoretical 134 Research Methodology and Statistical Methods framework that guides and directs the research. When the research question or problem is formulated in the form of a research hypothesis, two of the important epistemic criteria are empirical testability and explanatory potential The question surrounding empirical testability is whether one can foresee or even indicate how the hypothesis will be tested. The question of explanatory potential refers to the degree of theoretical support or embeddedness enjoyed by the hypothesis. If the hypothesis is derived from an established theoretical framework that has been successfully applied to explain similar phenomena in the past, it strengthens the conclusions that could be drawn from the research. If not, the particular hypothesis has still to ‘prove itself’. The ‘outcome’ of the conceptualisation phase is a research hypothesis which should, I suggest, meet the criterion of ‘theoretical validity’.

Operationalisation During the process of operationalisation a measuring instrument such as a questionnaire or scale is developed. Ideally, this instrument constitutes a valid measure of the key concepts in the research question. The outcome is a measuring instrument and the predominant epistemological criterion is measurement validity. It has become customary to distinguish aspects or dimensions of measurement validity such as face validity, construct validity, criterion and predictive validity. What methodological criteria are applicable in the construction of valid measuring instruments? We shall mention a few. Firstly the population from which one selects items to construct the instrument must be exhaustive with regard to the phenomenon being investigated. Secondly the categories used in the scale or questionnaire must be unambiguous and mutually exclusive. Thirdly scales must meet the criterion of unidimensionality, which means that a single scale cannot be used to measure two or three different dimensions or aspects of a phenomenon.

Sampling During the process of selecting or sampling the aim is to get a sample that is as representative as possible of the target population. Representativeness is the underlying epistemic criterion of a ‘valid’, that is, unbiased sample. The methodological criteria applied in the process of sampling are: clear definition of the population, systematic drawing of the sample, drawing probability rather than non-probability samples and observing the advantages of multi-stage versus simple random sampling.

Data Collection During data collection, the researcher collects various kinds of empirical information or data, for instance historical, statistical or documentary data. This is accomplished through various methods and techniques of observation such as document analysis, content analysis, interviewing and psycho- metric testing. There are a number of methodological criteria that ought to be followed during the process of data collection. Research Process 135

These include suspension of personal prejudices and biases, systematic and accurate recording of the observations, establishment of trust and rapport with the interviewee and creating optimal conditions in terms of location or setting for the collection of the data. The outcome of the process is a set of data or empirical information and the epistemological criterion is that of reliability. We aim to produce reliable data. This means that if we were to use the same measures and hold the conditions under which the data are collected as constant as possible, we should get the same data from situation to situation. Reliability is hence synonymous with stability or consistency over time.

Analysis and Interpretation We analyse data by identifying patterns and themes in the data and drawing certain conclusions from them. What are the methodological analysis criteria? Using appropriate statistical techniques for the appropriate level measurement and so on. Drawing inferences according to the principles of statistical inference. The outcome of the analysis or interpretation is certain conclusions which must follow logically from the empirical evidence if it is to be regarded as ‘valid’ results or conclusions. The discussion thus far can be summarised in what I will refer to as the validity framework as reflected in table. Table. The validity framework. Stage in research Sources of error Methodological Outcome/goal/ Epistemic process ‘move’ or ‘strategy’ end-product (validity-related) (objective research) quality or criterion Conceptualisation Complex notions →Thorough literature Concepts/ Theoretical validity (conceptual Vagueness review definitions (clarity/scope) analysis) Arnbiguity Abstract →Clear and logical concepts definitions Operationalisation Poor sampling of →Scale validation Measuring Measurement items →Face validity instruments validity Leading questions →Pilot test (construct validity) Scaling errors Sampling Bias →Probability sampling Sample Representativeness Heterogeneous →Stratification populations →Optimal sample size Incomplete sampling frame Data collection Observation effects →Multi-method Data sets Reliability Interviewer bias →Proper training of Respondent bias fieldworkers Context effects Analysis/ Competing/rival →Appropriate tech- Conclusions/ Inferential validity interpretation conclusions or niques of analysis results/findings explanations →Thorough under- standing of literature 136 Research Methodology and Statistical Methods

CONCEPTUALISATION (DEFINING KEY CONCEPTS) The term ‘conceptualisation’ is used here as a synonym for ‘conceptual analysis’ or ‘conceptual explication’. Assume that a researcher has decided to conduct a study to establish the relationship between political conservatism and racial prejudice. Even a person with no training in the social sciences would know that the concepts ‘conservatism’ and ‘racial prejudice’ have many connotations. In our everyday life these concepts form part of commonly held attitudes and value orientations. In the language game of the social sciences, the concepts have become embedded in a variety of models and theories in sociology and political science. It is obvious that familiarity with the most important theories relating to the research problem is an essential precondition for an adequate conceptualisation. One of the most striking characteristics of theories in the social sciences is the incidence of highly abstract and multidimensional concepts. In the social sciences, concepts such as values, culture, solidarity, maturity, meaning, power, peace, revolution, alienation, anomie, structure, function, rite, religion, depression, social distance, anxiety, aggression, motivation, intelligence and success are unavoidable. Many of these concepts have their roots in the world of social sciences research and are therefore usually linked exclusively to certain theories or models. However, even concepts such as power, freedom and revolution, which are part and parcel of everyday life and language, acquire new meaning when they become integrated in a theory in the social sciences such as, for example, that of Karl Marx. The fact that concepts acquire meaning, or even new meaning, within a conceptual framework such as a theory, a model or a typology, has led philosophers of science to refer to such concepts as ‘theoretical concepts’ or ‘constructs’. In chapter, we shall demonstrate that the aim in empirical research is to operationalise such constructs meaningfully by rendering them either measurable or observable. In the next section, we shall discuss in some detail how a highly theoretical sociological concept such as ‘alienation’ can be explicated by means of theoretical definition, after which the notion of ‘theoretical validity’ will be defined more clearly. Although Hegel was the first author to use the term ‘alienation’ in a theoretically interesting manner, Karl Marx is generally accepted as the first person to have developed a consistent and systematic theory about the concept. Marx endorsed Hegel’s view that alienation is a reality that arises when an individual feels that he or she has lost control. However, Marx differed from Hegel, Feuerbach and others in his view of the origin of alienation. He believed that it stemmed from economic factors, and more specifically that it was a consequence of capitalism: • In what does alienation consist? First that the work is external to the worker, that it is not a part of his nature, that consequently he does not fulfil himself in his work but denies himself … His work is not voluntary but imposed, forced labour. It is not the satisfaction of a need, but only a means of satisfying other needs. The object produced by labour, its product, now stands opposed Research Process 137

to it as an alien being, as a power independent of the producer … The performance of work is at the same time its objectification. This performance appears, in the sphere of political economy, as a vitiation of the worker, objectification as a loss and as servitude to the object, and appropriation as alienation. One can only really understand this paragraph against the backdrop of Marx’s emphasis on the importance of human beings as labourers or makers. A human being attains self-realisation through labour or productivity. According to Marx, the capitalist system, as it existed at the time of his writing, resulted in human beings being alienated from the product of their labour by a system of unequal and unjust relations of production. This system therefore separated people into two clearly identifiable classes: the owners and the workers. The fundamental inequity of the system stems from the structure of the production process. In relative terms, according to Marx, the worker contributes more to the actual production process, while the owner derives a far greater benefit. The worker’s productive ability is reduced to an object or thing. In other words, it is regarded as simply one more commodity on the market. Alienation therefore inevitably results when a quality which is intrinsic to the existence of man is reduced to a mere object or commodity. The first clear definition of ‘alienation’ was therefore encountered in Marx’s economic theory. Despite the fact that it is a highly theoretical and abstract concept, we now have a clearer grasp of its meaning. The reason for this improved understanding is of the fact that the relationship between ‘alienation’ and better-known concepts such as ‘labour’, ‘production relationships’ and ‘inequality’ have been clarified within the framework of a theory. These concepts are obviously still highly abstract terms. Nonetheless, the fact that the term ‘alienation’ has been embedded in a network of other related concepts, leads to a more precise definition of its meaning. In 1959 Melvin Seeman published an article entitled On the meaning of alienation in which he further elucidated the notion of ‘alienation’. His point of departure was that it was possible to define modern mass society more clearly by emphasising five essential structural elements, namely: • The development of impersonality and a reduction of relationships as a result of differences in status; • The development of a bureaucracy that leads to secularisation; • An increase in social differentiation and job specialisation; • Increasing social mobility; and • An increase in scale or size. According to him, these five elements are fundamental to three factors that are relevant to alienation, namely loss of control over work and product, lack of integration within large organisational structures, and a low level of accessibility to reward values. 138 Research Methodology and Statistical Methods

Seeman maintained that the objective alienation in mass society eventually leads to five socio-psychological phenomena: • Powerlessness; • Normlessness; • Isolation; • Self-estrangement; and • Meaninglessness. Each of these five phenomena are subsequently defined in greater detail. Powerlessness refers to an individual’s perception that he does not have complete control of his behaviour. Normlessness refers to the perception that socially unacceptable behaviour is necessary for the attainment of specific goals. Meaninglessness may be defined as a low expectation of being able to make meaningful predictions about future consequences of behaviour. Isolation is a tendency to attribute limited value to convictions or ideals that are typically highly valued. Self-estrangement indicates a degree of dependence on specific forms of behaviour for expected future consequences of behaviour. This is obviously a coherent theory. An explanation of the causes of alienation is provided. As demonstrated by Marx, conceptual analysis by means of theoretical definition clearly involves explication of the concept by using other concepts that are sometimes more familiar. In the subject under discussion the concepts of powerlessness, normlessness, meaninglessness, isolation and self-alienation were used. In other definitions of alienation different dimensions of the concept are emphasised. Keniston emphasises the distinction between alienation from society and self- alienation: • In societies in which the transition from childhood to adulthood is unusually painful, young people often form their own youth culture with a special set of anti-adult values and institutions, in which they can at least temporarily negate the feared life of the adult … alienation of man from his own creative potentialities, embedded in his fantasy life. In his typology of the dimensions of alienation, Stroup included indifference, isolation, self-estrangement, powerlessness, loneliness, meaninglessness, disenchantment and anonymity. The attempts of other scientists to define ‘alienation’ more precisely in different theories and typologies could also be referred to. However, these examples will suffice.

Theoretical Validity What is involved in the theoretical definition of concepts? Concepts, or rather constructs, such as for example, alienation typically have many ‘shades’ of meaning – a variety of connotative elements. Theoretical concepts are rich in connotation. One could use the analogy of a field of meaning to illustrate this idea. Wittgenstein uses the Research Process 139 term ‘family of resemblances’. Within a given field of meaning, certain dimensions or aspects of meaning are more closely associated than others. Together, these dimensions within a field of meaning constitute the connotation ascribed to the concept. The relationships between these dimensions in meaning is not a matter of coincidence – it is not simply given. It is only within the framework of a theory or model that such relationships are systematically defined. And this is the function of a theoretical definition: to arrange or logically systematise the most important dimensions of the meanings of theoretical concepts. In this context, to arrange logically implies that the logical rules of correct classification, and the rules of mutual exclusion and exhaustiveness have to be adhered to. This can be explained as follows: Assume that we needed to develop a classification of types of societies on the basis of their levels of development. The ‘principle of classification of’ or the ‘dimension along which’ societies are classified is ‘level of development’. The classification that we apply is the following: industrialised societies, agrarian societies and high-technology societies. Obviously this simple typology would not be acceptable because there is a large degree of overlap between the first and third categories; they are not mutually exclusive. This is one way of saying that the principle of classification has not been adhered to. The distinction between ‘industrialised societies’ and ‘high-technology societies’ is not sufficiently clear because they both cover a similar part of the dimension of ‘level of development’. Using the example of alienation, we have been able to demonstrate that a good theoretical definition implies that the essential dimensions of the meaning of a concept have been identified, and that, as far as possible, these dimensions are mutually exclusive. On face value, Stroup’s typology would appear to be lacking in terms of compliance with the second requirement in that the dimensions of isolation and loneliness, and also what he calls indifference and disenchantment, could be regarded as overlapping categories. On the other hand, Seeman’s five dimensions appear to be valid, exhaustive and mutually exclusive categories, even on cursory inspection. But the notion of ‘theoretical validity’ should not be confined to conceptual clarity. Rose, following Phillips, introduced the term ‘internal theoretical validity’ and listed three characteristics of acceptable theoretical explication. These are ‘clarity’, ‘scope’ and ‘systematic import’. Rose described each of these terms as follows: • Clarity is the concept’s potential for leading to indicators, which depends on the degree to which it implies a chain of lower-level concepts; scope is the breadth of the class of phenomena to which the concept applies; and systematic import is the extent to which the concept is used in propositions and theories. We rarely, if ever, judge a theory solely in terms of conceptual clarity. We also ask whether it explains a lot of phenomena and also how integrated the various concepts and statements of the theory are. Because the connotative and denotative dimensions of concepts are so closely related, the ultimate test of a theory, model or typology is the extent to which it leads to valid information on the phenomena that it is supposed to describe or explain. 140 Research Methodology and Statistical Methods

Critical Assignment In reading 1, Giorgi defends a specific definition of the concept of ‘religiosity’ and also propounds a distinction between ‘religiosity’ and ‘the religious’. Reconstruct and summarise her arguments. Why is it important to Giorgi to make this distinction?

CONCEPTUALISATION (FORMULATING RESEARCH HYPOTHESES) What do we mean when we refer to the ‘conceptualisation’ of a study over and above the definition of our key concepts? In this chapter we will demonstrate that ‘conceptualisation’ also involves embedding or incorporating one’s research into the body of knowledge that is pertinent to the research problem being addressed. To do this, the researcher must first do a thorough literature search of previous theoretical and empirical work in this field and then relate her work to the existing literature. What does this mean and why are literature reviews important? • A literature review serves as a ‘map’ or ‘maps’ of the terrain. With reference to our analogy of the journey, we have to realise that other researchers have ‘travelled’ this way before. In areas where there has been a concentrated focus on a specific phenomenon, a researcher has an obligation to acquaint herself with any publications on major research already conducted in the field, the most widely accepted theoretical positions and the most recent debates. • A review of previous research also provides guidelines, or at least suggestions, on the design of one’s own project. By studying previous studies on a particular topic, one not only learns about the ‘maps’ and ‘guidebooks’, but also about the ‘itineraries’, that is the different ways that people have travelled this terrain. • An intensive study of the existing body of knowledge yields various kinds of ‘resources’. These include conceptual resources such as useful theoretical formulations or definitions of key concepts that are encountered in a specific field; methodological resources, such as a reliable and valid scale or questionnaire; and appropriate examples of qualitative and quantitative techniques. • Literature searches are sometimes done by researchers who intend to replicate previous research. In such cases one is interested in both the methodology and the substantive results of previous research. • Finally, anyone planning to research a field which has hitherto enjoyed limited attention, either worldwide or locally, can learn a great deal by studying related fields and from the designs and methods used. When we refer to a literature review as a kind of ‘research map’, we must bear in mind that there are different kinds of research maps. When planning a journey we have a choice of any number of maps including largescale maps of countries, detailed town maps, guidebooks to countries and cities, and guidebooks that provide historical information as well. Similarly, there is a range of resources from which to choose when a literature review is undertaken. There are specific resources for specific research Research Process 141 reviews. A simple classification of these resources and their main applications is given in table. Table. Sources for literature searches. Kind of literature source Application Dictionaries, encyclopaedias, 1. Provide standard definitions of central concepts in a discipline. textbooks 2. Provide descriptions of main research areas. 3. Provide useful historical overviews of main figures and traditions in a discipline. Annual reviews, state-of-the- 1. Usually include authoritative reviews of the most prominent art reviews (usually in special theories and research in specific problem areas. editions of journals) 2. Include succinct statements and discussions of key debates and issues. Monographs/books 1. Comprise extended and well-contextualised in-depth studies on topics. 2. Normally include well-documented theoretical arguments on key issues in current debates. Journal articles 1. Include topical discussions on the latest theoretical and empirical issues. 2. Include brief reports on key findings and new advances in methodology. 3. Include book reviews and discussions. Articles are useful ‘second- order’ introductions to primary sources. How does this classification relate to our discussion of types of research in chapter? In broad and somewhat simplified terms, the following guidelines apply: The more exploratory and open-ended the study, the more useful it will be to look at general sources such as encyclopaedias and review articles and also to use broad search strategies. For more validational and structured studies, it is more useful to consult subject literature such as books and journal articles and to use more focussed search strategies. The level and context of the research project is an equally important consideration. In chapter we shall discuss, in greater detail, the fundamental differences involved in writing undergraduate assignments, master’s theses, doctoral dissertations and journal articles. We shall discuss the differences in the kind of literature review required for each of these situations, and also the style of reporting. In summary then: a survey of the literature is an essential component of any study because it is the main access point or gateway to the relevant body of knowledge. Through reading and studying the work of other academics, we learn how to improve our own research methods, ask the right questions and identify potentially useful answers to such questions. In fact, the body of literature is best viewed as a three-dimensional space that can be explored by a researcher to best locate and position her own work. In the previous chapter we focussed on the issue of conceptual analysis and the meaning of conceptual or theoretical validity. 142 Research Methodology and Statistical Methods

We showed how a study on alienation would benefit from a review of the theories on alienation and specifically what it means to arrive at a conceptually valid definition of highly abstract concepts. Another important reason for reviewing the literature is that it provides ideas, hypotheses and conjectures for one’s own research. The remainder of this chapter is devoted to a discussion of the distinction between different kinds of hypotheses and between hypotheses, assumptions and postulates.

Kinds of Hypotheses Scientific statements differ with regard to the degree of evidentiary support that they enjoy. In this respect it is customary to distinguish between ‘hypothetical statements’ or ‘hypotheses’ and ‘substantiated or validated statements’, to which we usually refer as ‘empirical statements’. When we first formulate a statement without knowing whether we have any empirical warrant to accept it as reasonably valid or even true, we call this a hypothesis. A hypothesis is a statement that makes a provisional or conjectural knowledge claim about the world. A ‘good’ hypothesis is empirically testable, which means that we must be able to specify clearly what data would provide support or rejection for it. Before discussing the criteria for a ‘good hypothesis’ we must distinguish between different kinds of hypotheses.

Existential and Relational Hypotheses Hypotheses can be classified into two main groups, namely existential and relational hypotheses. An existential hypothesis is a provisional statement about a certain state of affairs, that is, it makes a claim that something is the case. For example: • Sixty per cent of rural South Africans are functionally illiterate. • More than seventy per cent of all South Africans are opposed to legalized abortion. • Durban has the highest crime rate in the country. Statements such as these are claims that a certain entity has a certain property and what the value of that property is. Existential hypotheses are more common in exploratory research where the main purpose is to find out what the case is. Relational hypotheses postulate that a certain kind of relationship exists between two or more variables. It has become customary to distinguish further between correlational hypotheses and causal hypotheses, depending on the kind of relationship that is being postulated. A correlational hypothesis might claim that there is a positive or negative relationship between people’s educational level and their tolerance of other people, that there is a relationship between stress and productivity. The first hypothesis postulates a positive relationship, namely that tolerance of other people or groups increases as level of education increases. The second hypothesis postulates a negative relationship, namely that, as stress in the workplace is increased, workers become less productive. Research Process 143

Singular and General Hypotheses Another classification of hypotheses addresses the scope or range of the hypothesis, depending on whether a hypothesis applies to only one case or to a class of cases. A singular hypothesis is a claim about one specific instance or case, for example: John is a type A person or Mary is a very bright individual. General or ‘universal’ hypotheses make claims about classes of people, for example: people who have high levels of anxiety are more likely to resort to suicide. This discussion is summarised in figures.

Fig. Fig. Kinds of hypotheses. hypotheses, Assumptions and Postulates It is important to realise that the ‘epistemic status’ is something that varies. Hypotheses and conjectures make claims that require substantiation and cannot therefore claim the same epistemic status as well-established and entrenched scientific theories and models. By the same token, explanations that may thus far have been regarded as fairly plausible accounts of the world, may suddenly become suspect or at least less plausible and eventually even be rejected because of new evidence. The discussion of hypotheses also allows us to define ‘postulates’ and ‘assumptions’ in science. Assumptions and postulates have the same epistemic status as hypotheses in that they are also ‘hypothetical’ or ‘conjectural’ statements. The critical difference though, is that researchers choose not to submit assumptions and postulates to empirical testing. Their truth is accepted, at least for purposes of the investigation at hand. The reasons for this differ for postulates and assumptions. Postulates or axioms are usually accepted as statements that are self-evidently true. Postulates, sometimes also referred to as ‘principles’, of causality in physical nature, or of rationality in human behaviour, are examples. Postulates are usually general principles that are accepted as being applicable to all human behaviour and hence regarded as self-evidently true. They function as ‘first principles’ in a system of derived propositions. 144 Research Methodology and Statistical Methods

Assumptions and presuppositions function as essential background beliefs that underlie other decisions in the research process. Examples would be assumptions on the nature of the population to be investigated, the most appropriate design for the investigation, or the best definition of the phenomenon. In terms of the model of science developed in part 1 of the book, one way of classifying assumptions would be in terms of the dimensions of science. This would lead to the following typology: • Epistemological assumptions are assumptions about the nature of knowledge and science or on the content of ‘truth’ and related ideals. • Ontological assumptions include assumptions about human nature, society, the nature of history, the status of mental entities, observable and material phenomena, and causality and intentionality in human action behaviour. • Methodological assumptions are assumptions about the nature of the research process and the most appropriate methods to be used, about the relative worth of quantitative and qualitative methods, about interpretation versus explanation, and about the ideal of universal statements versus specific and ‘local’ generalisations.

OPERATIONALISATION Operationalisation or operational definition consists of the development of a measuring instrument by means of which accurate data about specific phenomena can be obtained. Let us take the example of a fairly abstract concept in social science, namely ‘alienation’. In this case, operationalisation would involve the development of a measuring instrument to collect reliable data about the phenomenon called ‘alienation’. The aim of the study could vary: the researchers might need to determine the extent to which alienation may be regarded as a characteristic of a certain group of people such as marginalised street children or highly-stressed businesspeople. Another aim may be to determine whether an existing theory or theories provide a correct interpretation of alienation. Irrespective of the specific research aims and the unit of analysis to be chosen, or even of whether the approach will be qualitative or quantitative, the concept of ‘alienation’ must be rendered measurable. But how do we ‘make’ a concept measureable? For example, it would obviously be quite absurd to approach individuals and to ask them whether they are alienated. Similarly, taking up a position on a street corner or in a factory and trying to observe whether people are alienated would be equally ridiculous. The obvious and most common approach would be to collect data on the theoretical concepts by means of indirect measurement. This would involve compiling a list of questions or items that are assumed to be elements of the phenomenon called ‘alienation’ and presenting them to a sample of individuals in an interview situation. If, for instance one were to administer twenty items that deal with aspects of alienation, it ought to be possible to gain an overall impression of the person’s position with regard Research Process 145 to the phenomenon. The process of operationalisation involves compiling, for purposes of measurement, a list of characteristics denoted by the concept. When a measuring instrument is constructed the items or questions are regarded as indicators of the list of denoted characteristics. The most commonly used indirect measurement technique in the quantitative tradition is scale construction. Dean’s social alienation scale, which is based on Seeman’s typology, can be used to illustrate what the process of operationalisation involves. Dean regarded three of Seeman’s dimensions as most typical of the construct ‘alienation’. These were powerlessness, normlessness and what he referred to as social isolation. He subsequently formulated a number of questions relating to each of these dimensions which he believed would, in combination, define the dimension more clearly. The item format was such that each item had to be rated on a five-point scale, namely strongly agree, agree, uncertain, disagree, and strongly disagree. Item scores ranged between 4 and 0. Five of the items were negatively worded, necessitating a reversal of the scoring pattern. Subscale scores were used to determine an individual’s level of powerlessness, normlessness and social isolation, while a total scale score was used to determine his or her overall level of alienation. According to the scheme used, a score of 96 would indicate a maximum level of alienation, with 48 representing a neutral score. For illustrative purposes, a few items from each subscale are reproduced here: • Social isolation: – “Sometimes I feel all alone in the world”. – “Real friends are as easy to find as ever”. – “There are few dependable ties between people any more”. – “People are just naturally friendly and helpful”. • Powerlessness: – “I worry about the future facing today’s children”. – “There are so many decisions that have to be made today that sometimes I could just blow up”. – “There is little chance for promotion on the job unless a man gets a break”. – “We are just so many cogs in the machinery of life”. • Normlessness: – “People’s ideas change so much that I wonder if we’11 ever have anything to depend on”. – “Everything is relative, and there just aren’t any definite rules to live by”. – “With so many religions abroad, one doesn’t really know which one to believe”. The content and nature of the measuring instrument is determined by a range of factors including formulation of the problem, the methodological preferences of the researcher and the nature of the phenomenon. If the phenomenon of alienation were to be studied amongst a smaller group of people the researcher would be likely to employ 146 Research Methodology and Statistical Methods qualitative methods such as in-depth interviews and participant observation. Manifestations of alienation as they occur in literature or in the media such as newspapers or letters in newspapers, could be investigated by means of one of the forms of content analysis. Studies of a more quantitative nature on alienation would probably be conducted by means of some form of interview schedule or questionnaire. The central concepts in an investigation must be operationalised, regardless of the data collection technique that is envisaged. The above example sets out the nature of such operationalisation in a quantitative study. However, even in a qualitative study where, for example, we are interested in investigating the degree of alienation evinced by a group of people displaying pathological behaviour such as rapism, the investigators would need to have a clear grasp of the denotative dimensions of alienation. Without such clarity they would be unable to identify the manifestations of alienation correctly in the unstructured interviews and would hence be unable to collect reliable data on the phenomenon. Similarly, in content analysis the researcher must develop a category system in which the central denotative components of the concept of alienation have been accounted for, before being able to analyse newspaper reports or letters to newspapers.

Measurement Validity Important questions at this stage are clearly: when are the operationalisations of concepts or constructs valid? When does an operationalisation comply with the requirement of measurement validity. In the field of measurement theory it has become customary to distinguish amongst several types of measurement validity. These are presented schematically in figure.

Fig. Fig. Types of measurement validity. Since there are numerous introductory texts in the field of measurement theory, we shall describe each concept only briefly.

Criterion Validity According to Nunnally criterion validity is relevant ‘when the purpose is to use an instrument to estimate some important form of behaviour that is external to the measuring instrument itself, the latter being referred to as the criterion’. An example from everyday life is when we use the number of distinctions attained by matriculants as a predictor of academic achievement at university. Research Process 147

If a high positive correlation were to be found between the number of distinctions and tertiary academic achievement, the former could justifiably be regarded as a good predictor of the latter. This is an example of predictive validity, which is the criterion employed to determine whether the measurement can be used to predict a future situation validly. If, in the example of alienation, it were possible to develop criteria of the manifestations of alienation, it ought to be possible to predict future manifestations by means of an alienation scale. When the criterion and the other measurements are used simultaneously this is referred to as concurrent validity. The following is an example of concurrent validity: if scores on an intelligence test and examination scores were to be simultaneously obtained and found to be highly correlated, the intelligence test could justifiably be regarded as a valid indicator of the examination marks.

Construct Validity Obtaining construct validity is probably one of the most difficult problems in social sciences research. Earlier in this section we referred to the fact that the social sciences are characterised by highly theoretical concepts or constructs that are derived from scientific theories and cannot be inductively inferred from the observation of human behaviour. The methodological problem that arises is the following: how does the researcher really know that the items included in the scale or questionnaire actually measure the construct that they are supposed to represent? How for instance can Dean be sure that the 24 items actually measure ‘social isolation’, ‘normlessness’, and ‘powerlessness’? A few examples will serve to illustrate the complexity of the issue. Item 10 and item 11 might well also measure something akin to ‘relativism’. For argument’s sake item 7 could be regarded as a measurement of ‘fatalism’. From the above it follows that construct validity refers to the extent to which a scale, index or list of items measures the relevant construct and not something else. Cook and Campbell mention three threats to construct validity. “Inadequate preoperational explication of constructs, mono-operation bias and mono- method bias”. Under the first heading Cook and Campbell discuss the effect of poor conceptualisation on construct validity: “A precise explication of constructs is vital for high construct validity since 128 it permits tailoring the manipulations and measures to whichever definitions emerge from the explication”. This issue was addressed in the section on conceptualisation. The second and third threats to construct validity are related: mono-operation bias refers to problems that surface when single indicators or measurements of a construct are employed, while mono-method bias refers to problems resulting from the use of the same type of measurement technique for collecting data on the construct that is being investigated. In view of the fact that mono-method bias is discussed in the next section, we shall limit ourselves to a few remarks on the issue at this stage. Cook and Campbell 148 Research Methodology and Statistical Methods define this concept as follows: “Since single operations both underrepresent constructs and contain irrelevancies, construct validity will be lower in single exemplar research than in research where each construct is multiply operationalised in order to triangulate on the referent”. With reference to our example of alienation, it ought to be clear that, had Dean used only one item to obtain a scale score, mono-operation bias would have occurred. Although it has become customary to employ multiple- item scales for each construct, there is no denying that far too many attitudinal measurements still rely on single items to measure the respondent’s attitudes to a variety of issues. However, when multiple indicators are used, various techniques can be used to help determine the construct validity of theoretical concepts. One such technique is factor analysis. The following example has been slightly adapted from Krausz and Miller. The example is a simple illustration of the principle underlying the use of factor analysis to determine construct validity. Assume that the theory that we are employing contains the constructs ‘status’ and ‘intelligence’. Assume also that six indicators are used to measure these constructs, namely income, educational level, value of fixed assets, problem-solving ability, figure recognition and reading comprehension.

Fig. Fig. The use of factor analysis in determining constuct validity. Basically, factor analysis involves an analysis of the intercorrelations between indicators. In the present example, one would expect high intercorrelations between A, B and C, and also between D, E and F. We would also expect very low or zero correlations between the indicators of status and intelligence. Were this pattern of correlations to be found, it would suggest the existence of a common factor underlying A, B and C, and a second factor underlying D, E and F. It is important to note that the application of the factor analysis technique is limited to the identification of factors on the basis of the intercorrelations between indicators. The researcher still has to demonstrate the relationship between factor I and factor II, and the underlying theory. It should be clear that ‘demonstrating the relationship’ is a matter of interpretation, and that alternative interpretations could exist. Referring once again to the example of alienation, one would expect Dean to have found high intercorrelations between items 1 to 9, 10 to 18, and 19 to 24. He would, however, have expected low correlations between the items that measure social isolation, normlessness and powerlessness respectively. Thus far we have limited our discussion to the problems surrounding operationalisation in quantitative studies. Research Process 149

Obviously operationalisation, in the more technical sense as we have used it thus far, cannot be used in qualitative studies. Nonetheless, the methodological problems concerning the relationship between theory and measurement or observation are similar, although the specific problems differ. One of the major distinguishing characteristics of qualitative research is the fact that the researcher attempts to understand people in terms of their own definition of their world. In terms of Becker’s distinction, the focus is on an ‘insider perspective’ rather than on an ‘outsider perspective’. In qualitative research the natural and subjective components of the sample are emphasised. It is for this reason that qualitative research is also referred to as naturalistic research. From a naturalistic perspective, one of the major assignments in research of this nature is accurate identification of the ‘native’ or ‘indigenous’ concepts or conceptualisations of the subjects being investigated. It is only after having dispensed with this task that the researcher will attempt to integrate them within the framework of an existing social scientific theory or model. A leading qualitative researcher, Norman Denzin, defines ‘operationalisation’ in qualitative research as follows: • Naturalists link their theoretical components to the empirical world through the collection of behaviour specimens. They operationalise those concepts through a careful analysis of their specimens. Starting with loose sensitising definitions of their concepts, they empirically operationalise the concepts only after having entered the worlds of interaction that they wish to understand … They include as many behaviours as possible as indications of the concept in question, through the use of naturalistic indicators which represent any segment of the subjects: behaviour that reflects one, or describes, a sociological concept. An indicator is naturalistic if it derives from the subjects: world of meaning, action, and discourse – it is not imposed on that world by the observer. Typically, the concepts generated in qualitative research are concrete concepts, which accurately reflect the world of the subjects. Qualitative researchers quite justifiably claim that qualitative concepts have strong construct validity because they have their roots in the world of the subjects. An obvious limiting factor with concepts of this nature is their limited interpretative scope. For the precise reason that these concepts are part of the world of meaning of a given group their generalisability will usually be limited.

SAMPLING Sampling is part of our everyday life. We sample restaurants by selecting different ones over a period of time; we decide which car to buy on the basis of a sample or selection of our own and other people’s experiences. We continually gather information from specific instances and generalise to new ones on the basis of their belonging to a common population of instances. In everyday life, ‘sampling’ is pretty much equivalent to ‘selection’. Although we often work on the assumption that sampling in everyday 150 Research Methodology and Statistical Methods life is reliable, that is, that it represents the ‘population’ from which it is selected, this is often not the case. Sampling in everyday life is usually haphazard and unsystematic and hence often results in decisions being based on inaccurate information. The concept of ‘representative’ sampling is a well-known one. We know that we cannot really judge all politicians by their appearance on television programmes, although we are tempted to do so, and that all Italians are not the types depicted in Mafia movies. This is precisely the stuff that stereotypes are made of. Scientific sampling aims to avoid the pitfalls of biased and unsystematic sampling. Before embarking on a more detailed discussion of the concept and logic of sampling, it must be emphasised that not all social research involves sampling. We must first distinguish between two distinctive research strategies in social research.

Contextual versus Generalising Research Strategies Since 1984 when Wilhelm Windelband proposed the distinction between nomothetic and ideographic research strategies, it has become customary to classify social research into one of these categories. The best description of this distinction is found in the following statement: • In their quest for knowledge of reality, the empirical sciences either seek the general in the form of the law of nature or the particular in the form of the historically defined structure. On the one hand, they are concerned with the form which invariably remains constant. On the other hand, they are concerned with the unique, immanently defined content of the real event … scientific thought is nomothetic in the former case and ideographic in the latter case. In terms of this distinction, two general types of research strategies can be identified. On the one hand, there are broad strategies by means of which it would be possible to search for general regularities in human behaviour. On the other, the attention is focussed on a single event or case and its structural coherence. In a previous book, I suggested the use of the terms ‘general strategy’ and ‘contextual strategy’ to refer to these two broad types of research. In a general or generalising strategy, social objects or phenomena are studied for their interest as representative examples of a larger population of similar objects or phenomena. In a contextual strategy we study phenomena because of their intrinsic and immediate contextual significance. Typical examples of studies with a contextual strategy are encountered in the historical disciplines, the ‘cultural’ or ‘hermeneutic’ disciplines like languages, arts, jurisprudence and theology, and studies in social science where the aim is to investigate a single case in an in-depth manner. Well-known examples of in- depth investigations are Bogdan’s study of a single transsexual, Whyte’s study of a specific subculture and, obviously, the multitude of ethnographic studies of specific cultures, communities and tribes. In all these examples, the primary aim of the investigators is to produce an extensive description of the phenomenon in its specific context. Research Process 151

In contrast, the aim of research is often to study a representative number of events or people with a view to generalising the results of the study to a defined population or universe. Typical examples are experimental studies, comparative studies and various kinds of sample surveys. It is important to emphasise that there is no logical or philosophical reason why any one of these strategies should be regarded as being in any way superior to another. They are all legitimate forms of inquiry. In the final instance, it is the researcher who decides – primarily in terms of the specific objectives of her study – on the most appropriate strategy. Finally, a note on terminology. In contextual studies, it is customary to refer to the ‘selection’ of cases that are included in the investigation. It is only in generalizing studies that we would use the term ‘sampling’ when referring to the procedures involved in selecting cases.

Basic Concepts: Population, Census, Target Population, Sample and Sampling Frame The terms ‘population’ and ‘universe’ are used interchangeably in the literature. We begin by looking at three definitions: • A ‘population’ is ‘a collection of objects, events or individuals having some common characteristic that the researcher is interested in studying’. • The ‘universe’ is the ‘complete set of elements and their characteristics about which a conclusion is to be drawn on the basis of a sample’. • The ‘population’ is ‘the aggregate of all the cases that conform to some designated set of specifications’. All three of these definitions emphasise the fact that the population is a collection or set of elements referred to as the ‘population elements’, which meet a certain definition or specification. This implies that populations in the context of sampling are always ‘constructed’ or ‘defined’ sets of elements. They are not naturally given entities. For this reason, the ‘population’ that interests a researcher has nothing to do with the everyday notion of the population of people in a certain country or city. The social researcher might of course be interested in studying the attitudes of the total population of a country, but this would be an example where the statistical notion of ‘population’ and the demographic notion coincide. Another implication of the way in which ‘population’ is defined, is that it refers to a set of elements of various kinds. In fact, we can apply it to all of the seven categories of ‘units of analysis’ that were distinguished in chapter. In social research populations may hence include: • Populations of individual human beings such as adults, school children, the aged in a certain area, the inmates of a prison or all the members of a sports team; • Populations of organisations such as all the financial institutions in a country or all the government departments; • Populations of institutions for instance all the tertiary institutions in South Africa – the universities and technikons; 152 Research Methodology and Statistical Methods

• Populations of collectives for instance all cities with populations exceeding 50 000, or all developing countries; • Populations of social activities or events, for example: all instances of violence such as murder, armed robbery or rape within a certain time frame; • Populations of cultural objects such as the set of Agenda programmes televised in 1994 or the collected works of Sigmund Freud; and • Populations of interventions such as all the training programmes in RDP offices in Gauteng or all the affirmative action programmes in banks in South Africa. To re-emphasise: in social research the terms ‘population’ and ‘universe’ are ‘defined’ or ‘constructed’ entities within the context of a specific research project. These are not ‘naturally givens’. In terms of our discussion in chapter, a population is the sum total of all the cases that meet our definition of the unit of analysis. Thus, if we say that we wish to study ‘adolescents between the ages of twelve and eighteen who live in Cape Town’ as a unit of analysis, the population of our study will be the aggregate or sum total of all the cases or instances that fall within this definition. A census is a count of all the elements in a population and/or a determination of the distributions of their characteristics, based upon information obtained on each of the elements. For various reasons, though, we usually select only some of the elements with the intention of finding out something about the total population from which they are taken. We refer to that selection of elements as a sample. Selltiz and Cook maintain that, under certain specifications, one population may be included in another. The population of ‘students at the University of Stellenbosch’ would hence include other ‘subpopulations’ or ‘population strata’ such as ‘female students at US’ or ‘male students at US’. Obviously various features could be used to define the almost unlimited number of strata, such as gender, age, height, weight, degree course, place of residence, race and political affiliation. Defining the population is a two-step process: first, the target population, which is the population to which one wishes to generalise, must be identified and second, the sampling frame must be constructed. When defining the target population there are two important considerations, namely the scope of the generalisation planned and the practical requirements of drawing the sample. Once the target population has been defined, it must be made operational through the construction of the sampling frame. The sampling frame refers to the set of all cases from which the sample will actually be selected. It is important to note that the sampling frame is not a sample, it is the operational definition of the population that provides the basis for sampling. Let us return to an everyday example. Suppose you wish to study the level of service quality of the better hotels in the Western Cape. Your target population, the set of ‘elements’ to which you wish to generalise, is all the ‘top hotels’ in the municipal districts of Cape Town, Bellville, Somerset West and Stellenbosch The sampling frame is the actual collection of hotels from which you will sample. There are basically two ways of Research Process 153 constructing a sampling frame, namely by drawing up a complete list of all the cases that fit one’s definition or by defining a rule that will define membership. For example, in a city telephone survey, the sampling frame could consist of the telephone directory for the specific city or the set of all telephone numbers with certain telephone exchanges. In our case, it should be possible, by working systematically through telephone directories, calling tourist information centres or, even better, by procuring a list of all two- to five-star hotels from the Federation of Associated Hospitalities in South Africa to compile a fairly comprehensive sampling frame. A researcher will obviously try to compile a sampling frame that is identical to the target population. Unfortunately this is usually only possible for small, geographically concentrated populations such as organisations or university campuses. Because it is often impossible or impractical for a researcher to compile an accurate list of the target population, existing lists have to be used and these are often incomplete and outdated. Strictly speaking, conclusions should be made only about the population represented by the sampling frame. Yet, it is to the target population that we wish to generalise. Therefore one should always evaluate cases in the target population that have been omitted from the sampling frame.

The Logic of Sampling The key concept in sampling is representativeness. Unless the sample from which we will generalise ‘truthfully’ or ‘faithfully’ represents the population from which it was drawn, we have no reason to believe that the population has the same properties as those of the sample. The concept of ‘representativeness’ can be better explained by relating it to the principles of statistical inference, namely inference from samples to populations. Wright formulates it as follows: • To decide whether we may soundly infer that the population has a property that we have observed in our sample, we must ask the following question: ‘What is the best explanation of the sample’s having that property?’ Logically there are two possible explanations: either the sample has property P because the population has that property, or the sample has P, but not because the population has that property. There are, furthermore, two versions of E2. • E2a: The sample may have property P, not because the population has it, but rather because of some distorting feature in the selection procedure. • E2b: The sample has property P by chance. It is important to realise that E1 must compete with both versions of E2, because both versions characterise a sample as an unrepresentative and hence unreliable guide to its population. We return to our example of the study of top hotels in the Western Cape. We have thus far completed two steps: • Step 1: We defined the target population. 154 Research Methodology and Statistical Methods

• Step 2: We defined the sampling frame. Assume that this yielded a list of 200 hotels. The third step is the actual sampling. • Step 3: Draw a sample of twenty hotels by choosing the first twenty names on the list provided by FEDHASA. Suppose we proceed with the study, collect and analyse the data, rank our twenty hotels in terms of a five-point-service quality scale and find that the service at all twenty hotels is rated as extremely poor. This raises the question: how do we explain the fact that all twenty hotels have the property? The above discussion suggests three possible explanations: • E1: The sample might have turned out that way because all the hotels in the Western Cape do in fact offer service of poor quality. • E2a: However, our sample findings might be inaccurate and most of the hotels in the Western Cape might actually deliver service of high quality. The inaccurate results might be due to an error in the sampling procedure, probably because the twenty hotels selected were simply the first twenty on the list. Had I investigated more thoroughly I might have discovered that FEDHASA lists the hotels from the lowest ratings to the highest. This would explain why the first twenty names drawn were all low-rated hotels! • E2b: But suppose I sampled differently. Say I followed a more scientific procedure by sampling every tenth name on the list or by using a table of randomly generated numbers and still came up with the same results. We would then have no choice but to attribute this outcome to the luck of the draw, to chance. It seems very unlikely that, after having taken appropriate measures to ensure a ‘random’ selection procedure, you could still come up with an unrepresentative sample, but it does sometimes happen! In this case, we just happened to draw randomly, from the list of two hundred, the twenty hotels with the lowest levels of service. This is an example of an explanation for the argument in E2b, which claims that all or the majority of the hotels are providing poor service because this is the case in the population, which is wrong, not because of poor reasoning. So the difference between the last two versions is simple: E2a is the result of a biased selection procedure; E2b is due to chance. This distinction allows us to define the notion of an unbiased sample in research: • An unbiased sample is one in which no unrepresentativeness can be traced to the selection procedure. If a selection procedure is responsible for unrepresentativeness in the sample, the sample is biased by the selection procedure. The most celebrated case of a biased selection is the 1936 presidential election poll in the USA conducted by the Literary Digest. The Digest randomly selected names from telephone directories throughout the country, called the selected individuals, and compiled the results. On the basis of their sample they predicted a landslide win for Alf Landon, the Republican nominee. But it Research Process 155 was Landon’s opponent, Franklin Roosevelt, who won by a landslide. The unrepresentativeness of the Digest’s sample is easily traceable to the selection procedures, for in 1936 the vast majority of the nation’s poor and lower-middle class voters could not afford telephones. So by limiting its sample to voters listed in telephone directories, the Literary Digest inadvertently biased its sample heavily in favour of relatively affluent citizens. In 1936, even more than today, the relatively affluent citizens tended to be Republicans. In South Africa a comparable situation would arise if one were to use the telephone directory as a sampling frame for a national sample. The majority of South Africans, especially those in the rural areas, would then be excluded. It is important to note that a given selection may be unbiased for one property but not for another. In our hypothetical example of the hotel study, we showed how the relevant property, namely service quality levels, was systematically linked to the selection procedure and that this produced a biased sample. But say for instance I was interested in other properties of hotels in the Western Cape, for example, staff composition in terms of race and gender distribution, or the average number of rooms. It is highly unlikely that these properties would be related to the quality rating of the hotel. If this assumption is justified, then the selection procedure initially used would probably not have led to a biased sample. In summary: selection procedures will naturally be connected with some properties and not with others, so a selection procedure that is quite adequate for inferences about one property may well be inadequate for others. This illustrates a general point: the more we know about populations and the connections among their properties, the better. In fact, this example shows why it is essential to use probability sampling designs. It is precisely because we do not usually have sufficient information about the population that we should choose a sampling procedure based on random selection. This brings us to the relationship between bias and randomness. By their very nature, random samples are unbiased, but they are a particular sort of unbiased sample: not all unbiased samples qualify. Statisticians define random sampling as a procedure in which every member of the population has an equal chance of being selected. Giving every member an equal opportunity of being included in the sample implies that, not only should there be no connection between the selection procedure and P but also that there must be no connection between the procedure and any statistical property of the sample. According to Wright, a procedure is effectively random when there is no explanatory connection between the procedure and any statistical property of the sample that is likely to correlate with P. Drawing names from a hat will be effectively random in this sense, although certain properties of the paper slips on which the names are written, such as size, shape, thickness, weight or edge condition may well affect the selection. But since there is not likely to be any connection between these properties and the names themselves, the selection is ‘random enough’ for most purposes. 156 Research Methodology and Statistical Methods

• So the value of a random, or effectively random, sample is that it is often our only route to unbiasedness; and this is increasingly the case the less we know about the property in question. The discussion thus far has dealt with the problem of unrepresentative samples where the bias is a result or outcome of the selection procedure. The problem of chance or ‘bad luck’ must also be addressed. As Wright aptly comments: “We might do all the right things, get a complete list of the population, use a table of random numbers, employ redundant safeguards and nevertheless end up with all fat ones or all Democrats … purely by chance”. One response is to increase the size of the sample. It is generally true that as the size of the sample increases, it becomes less likely that we will obtain inaccurate results purely by chance. Consider again our hotel example. If it is in fact so that there are only twenty poor quality hotels in the Western Cape and we were unlucky to draw these twenty in our sample by using systematic or random sampling, then by increasing the sample to forty, we will already have twenty high- quality hotels in the sample. If we increase the sample size to eighty, we are increasingly getting closer to the actual value of the property in the population. But we are usually unable to follow this procedure in concrete research. The point about sampling is precisely that it is usually neither feasible nor possible to draw large samples. We sample precisely because we wish to base estimates of population properties on small selections of cases. This finally brings us to the notion of inferential statistics. Statisticians have devised ways of obtaining more accurate estimates of population properties, or in other words, of drawing inferences about populations on the basis of sample information. They have developed procedures that enable us to be more precise about the odds of getting accurate estimates. The question that interests us is the following: as the sample increases how much more confident are we that the hotels have a certain property? Where the sample was still very small, we could obviously not be very sure that the proportion of good-quality hotels in the population was close to that which we found in our sample – pure chance is still too strong a possibility. As the size of the sample increases, so also can our confidence that the proportion of good-quality hotels in the sample mirrors that of the population. We decide when to stop increasing sample size for the same sort of reasons that we decide to stop collecting data in any empirical investigation, which is once we have enough certainty for the purpose at hand or when the cost of collecting further data exceeds the likely benefit in increased certainty. In summary we can say that probability sampling has two major advantages: it removes the possibility that bias on the part of the investigator will enter into the selection of cases; and, through the process of random selection, the principles of probability theory may be applied to estimate the accuracy of samples. The first point has now been discussed in some detail. Regarding the second, some concluding comments can be based on the hotel example. Assume that the true proportion of excellent-quality hotels in the Western Cape is eighty per cent. Since we do not Research Process 157 know what the value of this population property is, we must use the results of our sample study to estimate it. Suppose we disregard the highly unlikely result of drawing only the poor quality hotels through a process of random selection, and assume that we obtain a result where sixty per cent of the hotels in our sample are rated excellent. In this case, the difference between the sample statistic and the population parameter is defined as the sampling error. The ‘average’ of such errors for an entire sampling distribution is known as the standard error. This means that if we were to draw all possible samples of size twenty from the total population, the average deviation of each sample statistic from the true population value is the standard error. The concept of ‘standard error’ is useful for formulating a general rule about sample size: the larger the sample size, the smaller the standard error. We remarked earlier that probability theory enables us to make statements about the accuracy of a sample statistic in actual research. According to this theory, the distributions of various sample statistics exhibit a consistent and predictable pattern. So, although the population parameter is not known, the theory indicates how sample estimates will be distributed and provides a statistical formula for calculating the standard error – a measure of how much the sample estimate is likely to vary. However, the technical treatment of estimating population parameters using the information obtained through a study of a sample, is beyond the scope of this book. Any good textbook on statistical inference can be consulted.

Assignment You have to design a study to investigate differences in the political attitudes of South African university students. As an exercise, address the following issues: • Define the target population. • How will you define the sampling frame? • Which variables are most likely to be used to stratify your sample? • What are the likely sources of error that might affect the selection procedure?

ADAT COLLECTION (DATA SOURCES, REACTIVITY AND CONTROL)

Introduction One of the distinctive features of the social sciences is that, to a greater of lesser degree, the participants in social research, to wit, individuals or groups, are aware of the fact that they are ‘objects’ of investigation. Depending on the nature of the particular data source and the manner in which it is collected, human beings are aware of this situation when they participate in research and they tend to react to it. In the literature on methodology, this phenomenon has been known as reactivity since 1957. In this section this term will be applied in a broad sense to refer to the phenomenon that human beings react to the fact that they are participants in research. This reaction 158 Research Methodology and Statistical Methods manifests itself in a variety of forms, for example, resistance to being interviewed or to completing questionnaires, supplying inaccurate information as a result of apathy, wilfulness, modifying behaviour or information to create a better impression, or deliberately misinforming the researcher. The different manifestations of human reactivity in the process of data collection will be discussed in the next section. However, it is important to note that, depending on the nature of the data source, reactivity is an important variable. In accordance with Manheim’s scheme, data sources in the social sciences can be divided into two main categories, with two subcategories in each, namely: • Human behaviour and human characteristics. • Products of human behaviour and of human characteristics.

Human Behaviour and Human Characteristics Mannheim distinguishes between two main categories: on the one hand, verbal behaviour, which includes verbal or written responses to questions posed by the researcher and on the other, all observable behaviour and characteristics. The first category includes all forms of human behaviour which only become accessible by means of indirect observation such as questionnaires, interviews or projective tests. The second category includes all forms of individual behaviour, social interaction, and observable characteristics such as gender, number of individuals, physical locality, non-verbal behaviour and stature. Direct observation methods are generally used to collect data in this category. Examples would include structured or controlled observation in experimental situations, and participant observation in non-structured situations. The distinction drawn between structured and unstructured observation is clearly not equally applicable in all situations. Nonetheless, it provides a rough systematisation of data-collection techniques, which is useful for the remainder of the discussion.

Products of Both Human Behaviour and Human Characteristics Mannheim divides this category into two subcategories, namely physical traces and archival sources. Physical traces are defined as any physical evidence that has been left from earlier human behaviour. In accordance with Webb’s subsequent refinement of this category, physical traces are further subdivided into erosion measures and accretion measures. Examples of erosion measures would include wear on floor tiles at museum exhibits, erosion of library books and patterns of attire such as shoes, which may be employed as indications of human activity. Examples of accretion measures would include for example, the number of empty liquor bottles per week in refuse cans, the placing of buildings and pot shards. Archival or documentary sources refer to the extensive collections of records, documents, library collections or mass media material that have been amassed. Those sources would also include well-known material such as census data, life statistics, ecological and Research Process 159 demographic data, personal documents like diaries, autobiographies and letters, and case studies. Other types of archival sources include mass media material like newspaper reports, the content of radio and television programmes, and film material. Webb et al. also refer to less well-known material such as inscriptions on tombstones, sales records, suicide notes, nursing records on patients, and voting records. Webb and Banks were the first researchers to focus on the existence of ‘differential reactivity’. Data sources where human beings are directly involved are regarded as highly reactive sources, whereas those where human beings are only indirectly involved are far less likely to be regarded as reactive sources. Reactivity becomes the largest single threat to the validity of research findings when human behaviour or characteristics are the sources of data or information. With the exception of covert observation, and irrespective of whether data is collected by means of indirect or direct observation, human respondents or research participants are aware of the researcher and usually react to this situation in some or other way. Obviously the products of human activities such as documents or texts cannot react to the fact that they are being researched. It should nevertheless be borne in mind that the products of human behaviour are the result of decisions and cognitive processes. These products are the sedimentations or ‘residues’ of the human spirit. An example is manifested in the fact that when studying a text, the researcher has to be mindful of the original intention or aim of the author and of the researcher’s own historicity. The fact that human beings are rational beings is obviously also manifested in the products of human behaviour. Although data sources in the second category, where human involvement is only indirect, are unlikely to display reactivity to any marked degree, the possibility cannot be ignored. In the remainder of this section we shall discuss the threats to the validity of findings in which human beings are directly involved. How do researchers respond to the high level of reactivity of some data sources? Typically they resort to some form of ‘control’. The researcher could for instance attempt to reduce the effect of error by imposing a greater degree of structure on the observations, or by exerting more control on the research situation. Traditionally, the strongest form of control has been experimental research design. Randomisation is one such form of control. It involves the assignment of research subjects or participants to experimental and control groups on a random basis to control for the possible effects of individual differences. As Krathwohl aptly points out, ‘random assignment makes groups comparable in all of the variables we think might present problems and also in all other things we had not expected’. Unfortunately, it is also true that such measures of experimental control are intrusive. Quite frequently, the participants of such research are isolated in a laboratory situation that is removed from their natural environment so as to limit the effects of external nuisance or confounding variables. Banks focussed on the interesting phenomenon that these control measures vary positively with the degree of reactivity of the specific 160 Research Methodology and Statistical Methods observation technique employed. This means that the greater the number of controls the researcher builds into the research situation, the more likely the participants are to be reactive. Because laboratory conditions such as isolation and random assignment to treatments do not typically form part of the everyday life of the subjects, it is likely to result in artificial and atypical patterns of behaviour. As Groenewald quite justifiably states, this presents a dilemma for the researcher. While, on the one hand, it is desirable to use observation techniques that elicit as little reactivity as possible in order to ensure the highest level of validity it is, on the other, equally desirable to employ observation techniques that make it possible to exercise as much control as possible. Data-collection sources in which direct observation methods like systematic and participant observation, or indirect observation methods like questionnaires and interviews are used, can be controlled by the use of appropriate statistical techniques. We have, however, already indicated that these data sources are also highly reactive. The second main category of data sources, namely physical and archival sources, does not really allow for any ‘strong’ measures of control. In a certain sense, the data is already given. The researcher can of course select which data sources to use. He may also be able to sample such sources in a content analysis study. But he has no direct control over the ‘production’ of the data. By definition, documentary and archival sources have already been produced. What is important is that the researcher should take steps to ensure the authenticity of such sources. Although physical and archival sources do not allow for much control, the good news is that these data categories are low on reactivity and for this reason do not pose as big a threat to the eventual validity of the findings. The fact that reactivity and control are positively correlated illustrates a point which we made earlier in this book, namely that methodology in general, and research design in particular, inevitably involve compromises. The researcher must constantly weigh the advantages and disadvantages of a number of issues against each other, and eventually decide on whatever measures are, as a whole, likely to increase the validity of his findings most.

The Requirement of Reliability The key validity criterion for data collection is ‘reliability’. This is the requirement that the application of a valid measuring instrument to different groups under different sets of circumstances should lead to the same observations. Smith defines reliability by posing the following question: • Will the same methods used by different researchers and/or at different times produce the same results? As suggested by this definition, reliability demands consistency over time. In this sense, reliability refers to the fact that different research participants being tested by the same instrument at different times should respond identically to the instrument. But Research Process 161 what are the possible sources of error during data collection? I shall argue that the reliability of data is affected by the following: • The researchers, ‘experimenters’, ‘interviewers’ or ‘observers’; • The individuals or ‘subjects’ who participate in the research project; • The measuring instruments such as questionnaires, scales, tests, interviewing schedules and observation schedules; and • The research context or the circumstances under which the research is conducted. Since we have already discussed the issue of ‘measurement validity’ in chapter, our discussion here and in the next chapter will be confined to the effects of the researcher, the research participant and the research context on the reliability of the data. Each of these terms will be used in the widest sense possible. The term researcher includes project leaders, interviewers, experimenters, participant observers and fieldworkers. Participants include individuals being observed, questioned or a group of people who are being either observed or questioned. The research context refers to both the broad spatio-temporal circumstances under which research is conducted and the specific spatio-temporal setting. A further distinction is drawn between the characteristics and orientations of the researcher and the participant. Researcher or participant characteristics refer to attributes such as gender, nationality, age, socio-economic status and educational level. These characteristics are known as organismic variables. Researcher or subject orientations have reference to attitudes, opinions, expectations, preferences, tendencies and values.

Fig. Fig. Factors that affect the reliability of data. 162 Research Methodology and Statistical Methods

In accordance with common usage in the literature on experimental design, we shall refer to the consequences of the nuisance variables associated with each of the four variables as ‘effects’: researcher effects, participant effects, measuring instrument effects and context effects. Researcher effects are the negative consequences relating to validity that are directly attributable to the researcher. Similarly, measuring instrument effects are the negative consequences or lack of validity that may be directly attributed to some aspect of the measuring instrument. A final note on the relationship between measurement validity and reliability. Reliability is a precondition for measurement validity. It is clearly impossible to expect accurate measurements, for instance when using a scale to measure your weight, if the instrument does not consistently produce similar readings. Some of the readings will be accurate and some not. At the same time, reliability is not the only precondition for measurement validity. You can have a scale that consistently gives you the wrong readings! This means that the readings are reliable but invalid. The relationship between reliability and validity can be illustrated by the analogy between a measuring instrument and a gun. An accurate and reliable gun will repeatedly hit the target in the centre. The shots are clustered and on target. If some shots are clustered but not on target, the gun is reliable but not very accurate. If the gun is not consistent the shots will be all over the target and therefore inaccurate or invalid. Figure Reliability and validity portrayed as an analogy to firing consistent and inconsistent guns at a target.

Fig. Fig. Reliability and validity portrayed as an analogy to firing consistent and inconsistent guns at a target.

Concluding Comments The objective of data collection is to produce reliable data. This means that such data is consistent over time and place. But, as we shall see in detail in the following chapter, there are a number of potential sources of error that could result in the production of unreliable data. These sources of error or observation effects have been classified into three categories, namely effects that are due to the researcher, those that result from the reactivity of the participant and those that follow from certain factors in the environment. We have also shown that, in many cases, measures to control these effects lead to higher levels of Research Process 163 reactivity and hence to lower reliability. Social researchers have no choice but to strike a compromise between control and creating reactivity, in order to produce the highest degree of reliability possible.

ADAT COLLECTION (SOURCES OF ERROR) The term ‘observation effects’ is used in a broad sense to include researcher, participant and context effects. These are sometimes also referred to as ‘confounding’ variables: that is, elements that pose a threat to the reliability of data collected. The aim in this chapter is to identify different kinds of confounding variables or sources of error and give examples from various kinds of studies. Although the examples are mainly derived from the experimental and survey research literature, they have a wider application.

Researcher Effects Our discussion of researcher effects is divided into two sections; in the first, we discuss effects associated with researcher characteristics, and in the second, those associated with researcher orientations.

Researcher Characteristics Some of the most important researcher effects associated with specific researcher characteristics or attributes relate to the affiliation of the researcher, the image projected by the researcher to the research participants, and the distance between the researcher and the participants owing to differences between certain characteristics in the researcher and in the participants. The latter category is hence not only an effect of researcher characteristics, but also arises from the interaction between characteristics in the researcher and those in the participants.

Affiliation of the Researcher The researcher’s association with a specific organisation may result in biased responses. If the interviewer is employed by a highly influential organisation that is known for the quality of its research, respondents are likely to be better motivated to answer questions seriously and truthfully. Universities and large research organisations usually have reputations of this nature. However, should the interviewer be associated with an organisation that elicits suspicion or with a completely unknown organisation, respondents are likely to react more negatively to the interview situation. Atkin and Chaffe found that the affiliation or presumed affiliation of interviewers played an important role in research related to government control of violence in television programmes. In cases where the respondents, who were parents, thought that the interviewer represented some government body, they were more inclined to give extreme responses to questions. 164 Research Methodology and Statistical Methods

Image of the Researcher In an important study, Jack Douglas discusses the problems surrounding conflict in research. According to him, a tacit assumption in research has always been that participants naturally wish to cooperate with the researcher, and that they would obviously provide valid and reliable information. However, Douglas maintains that the assumptions of what he calls ‘the investigative paradigm’ are far more realistic. The investigative paradigm is based on the assumption that social life is pervaded by profound conflicts of interest, values, feelings and actions. Based on a variety of studies that he conducted, including some in massage salons and on nudist beaches, Douglas found that suspicion and mistrust were the rule rather than the exception: In an important study, Jack Douglas discusses the problems surrounding conflict in research. According to him, a tacit assumption in research has always been that participants naturally wish to cooperate with the researcher, and that they would obviously provide valid and reliable information. However, Douglas maintains that the assumptions of what he calls ‘the investigative paradigm’ are far more realistic. The investigative paradigm is based on the assumption that social life is pervaded by profound conflicts of interest, values, feelings and actions. Based on a variety of studies that he conducted, including some in massage salons and on nudist beaches, Douglas found that suspicion and mistrust were the rule rather than the exception: • One manifestation of mistrust was in avoidance or evasive responses. Rather than being the exception, I suspect such evasiveness is the common situation in field research: People rarely tell the whole truth as they see it about the most important things, but they are generally being evasive or misleading rather than lying. A field researcher must understand this and the reasons for it: Primarily a fear of exposure, of being caught in a lie, and an unwillingness to appear less than absolutely ‘moral’ to an academic stranger. A researcher is often seen as a stranger, an outsider, or an intruder. In the research conducted by Douglas these issues were probably given greater prominence as a result of the sensitive nature of his research. It seems fairly obvious that women in massage salons would regard the researcher as suspect since the possibility of him being a policemen cannot be excluded. The examples that we have discussed have all related to fairly general perceptions of the researcher as a suspect or stranger. At a considerably lower level Brislin et al. use the term rudeness as an all-embracing term to refer to the researcher as someone who interrupts the normal activities of the respondents. However, a variety of issues like the affiliation of the researchers, their interests and cultural background, and the time and place of the research all contribute to the image of the researchers: the positive or negative perception that the participants are likely to have of them. Research Process 165

Distance between Researcher and Participant A large body of research has been conducted in an attempt to establish which effects result from differences between the researcher and the participant. In some of the most important findings the existence of racial effects, gender effects, status effects, urban- rural effects, and even style-of-dress effects have been indicated. We shall consider a few of these studies. In a recent study Campbell found race-of-interviewer effects similar to those found in earlier studies by Schuman and Hatchet and also by Hyman. He concluded that racial differences between interviewer and participant may lead to a significant degree of bias. However, this bias is limited to the items in which the race of the respondent is explicitly mentioned by the interviewer. The direction of the observed bias is also constant in the sense that respondents consistently provide responses that are favourable to the race to which the interviewer belongs. In a study on pre- marital sex, Zehner found that the responses of male participants were not influenced by the gender of the interviewer. In comparison, female participants tended to be far more reticent when they were interviewed by female interviewers. However, in his study on controversial issues related to sexual intercourse, Rangonetti found no significant differences between the answers provided by those who were interviewed by male and female interviewers respectively. What he did find was that, irrespective of the gender of the interviewer, respondents were significantly more open in their responses when they were interviewed by a single interviewer than in a group interview situation. Mendras attempted to establish whether differences in rural and urban background between the interviewer and the participant had an influence on response bias. Giles and Chevasse in turn attempted to establish whether the interviewer’s style of dress had any influence on participants’ responses. Their conclusion was that style of dress could have an even greater influence on response than the perceived status of the interviewer! In a more recent publication, Sudman and Bradburn found that the distance created between participant and interviewer as a result of interviewer and participant characteristics should not be seen as an issue separate from the content of the questions posed. It has already been noted that racial differences are only found when the content of the question includes a reference to the race of the respondent, and that gender factors were found to be sensitive in Zehner’s study only when the items referred 150 to sexually sensitive themes. According to Sudman and Bradburn, the perceived threat of a question is of greater importance than the other issues. People are simply reluctant to reply to questions that deal with sexual behaviour, alcohol consumption, traffic offences, possession of firearms and the use of drugs. It is hence hardly surprising that when questions are posed that relate to sexual behaviour, which is a sensitive topic, and these questions are posed by members of a specific gender, gender effects will be observed. The same would apply if questions on race relations were to be asked by interviewers from a race group other than that of the interviewee. 166 Research Methodology and Statistical Methods

Researcher Orientations From research conducted over a broad spectrum it can be concluded that the final data is clearly influenced by the prejudices, expectations, attitudes, opinions and beliefs of the researcher, and that this applies equally to an interview, a laboratory or a field situation. Hyman attempted to systematise the influence of researcher orientations by identifying three types of orientation effects in interview situations: 1. Bias-producing cognitive factors in the interviewer; 2. Attitude structure expectations; and 3. Role expectations. In the first category Hyman included the cognitive factors that could result in specific expectations in terms of respondents’ answers and are unique to the interviewer. These factors include specific beliefs and perceptions. As an example, Hyman quotes the following passage in which a female interviewer discusses her attitudes towards respondents: • When asked whether she could make guesses about the attitudes of the respondents, she replied: “I often get fooled. On Russian questions I perhaps unconsciously make guesses. But if I do that I’m likely to write down what I think. Therefore I try not to”. But when the issue is pursued by asking her whether there are any characteristic types of respondents, she says: “Once they start talking, I can predict what they’ll say …”. Hyman justifiably maintains that expectations of this nature may constitute an important source of bias if the interviewer allows herself to be led by them in her further probing, classification of responses and so on. Under the second category that Hyman refers to as attitude-structure expectations, he notes the fact that some interviewers tend to believe that the attitudes of respondents are likely to display a uniformity of structure. This leads to a situation where the interviewer expects the respondent to answer later questions in a schedule in accordance with responses provided earlier on. This situation is clearly reflected in a statement like: “Once they start talking, I can predict what they’ll say …”. The third category of orientation effects, which could perhaps more appropriately be referred to as expectancy effects is defined as follows: we might conceive of role expectations to denote the tendencies of interviewers to believe that certain attitudes or behaviours occur in individuals of given group memberships, and therefore to expect answers of a certain kind from particular persons. Role expectations, which frequently lead to the development of rigid stereotypes, are especially prevalent in cases where men have certain views of female roles, where whites have specific perceptions about blacks, youth about the aged, or the inverse, and so on. As an illustration of this phenomenon, Hyman refers to the remark by a male interviewer who said: “I just don’t think the average woman has as much social Research Process 167 consciousness as the average man”. Rosenthal and his coworkers systematically studied similar expectancy effects in experimental studies. One of the best-known studies on experimenter expectancy effects was conducted by Rosental and Fode with laboratory . The experimenters were undergraduate psychology students who were led to believe that they would acquire practice in established experimental procedures. One half of the experimenters were told that the rats that they would use had been bred from exceptionally intelligent blood stock, while the other half were also inaccurately informed that their rats were less gifted. In actual fact the rats had been randomly selected from a homogeneous rat colony and there was no reason to expect that there was any difference in their intelligence. The final results confirmed the expectancy effect: the first group of experimenters, who had expected their rats to learn more quickly, reported that this had indeed been the case, whereas the second group with the supposedly dull rats reported that their rats had acquired the skills less quickly. In a recent review of the literature on interviewing techniques, Campbell et al. comment on a similar orientation effect which they refer to as reinforcement and feedback. They emphasise the fact that several studies have shown that, when the interviewer provides positive feedback for instance by saying uh-uh or good, this has a definite influence on subsequent responses. In some cases the interviewer’s systematic approval of a response could have a biasing effect on the information obtained.

Participant Effects The mere fact that human beings are being studied leads to atypical behaviour. It is probably safe to claim that the first description of participant effects in the literature of the social sciences is to be found in the publication by Roethlisberger and Dickson. Four researchers, Mayo, Roethlisberger, Whitehead and Dickson, did a research project at the Hawthorne factory of the Western Electric Company in 1927. The original intention was to study the effects of working conditions such as temperature, lighting, rest periods and hours of work, on worker productivity by observing six female workers. The interesting and unexpected finding was that the workers’ performance increased irrespective of the variable being manipulated. Irrespective of whether working hours were increased or reduced or rest periods lengthened or shortened, productivity consistently increased. The researchers interpreted their findings as meaning that the employees felt flattered to have been able to participate in the experiment! It has subsequently become common practice to refer to this type of participant effect as the Hawthorne effect.

Participant Characteristics In the preceding section we considered the influence of some of the betterknown participant characteristics such as gender, racial group and status in the interaction 168 Research Methodology and Statistical Methods between researcher and participant. We now turn briefly to another three well-known subject effects: memory decay, the omniscience syndrome and interview saturation.

Memory Decay According to Smith the researcher has to accept the fact that there is a natural decay in the ability to remember events that have positive correlations with: • The length of time that has elapsed since the occurrence of the event; • The irregular occurrence of the event; • The relative unimportance of the event; and • Secreased accessibility to relevant data relating to the event.

The Omniscience Syndrome Some respondents appear to believe that they can answer any question. The researcher must be sensitive to this type of effect to avoid the inclusion of responses that are not authentic. Brislin et al. discuss this phenomenon in more detail.

Interview Saturation Pareek and Rao justifiably maintain that some members of society, and particularly those who live in metropolitan areas, have become so strongly conditioned to market surveys that they tend to answer questions mechanically and superficially. Apart from the fact that this type of attitude can be identified in the interview situation, initial refusal or reluctance on the part of the respondent is usually also a good indication of over-saturation.

Participant Attitudes

Role Selection One of the most radical participant effects is the participant’s perception of his or her role in the research setting. Webb et al. justifiably maintain that: • By singling out an individual to be tested the experimenter forces upon the subject a role-defining decision – What kind of person should I be as I answer these questions or do these tasks?. Webb et al. also maintain that the role selection effect is likely to be manifested in a variance between ‘don’t know’ responses and the measurement of imaginary attitudes and opinions. If, for example, the instructions to the interviewee were to read: “You have been selected as part of a scientifically designed sample … It is important that you should answer all the questions …”, the importance and uniqueness of the respondent are obviously emphasised. When instructions like these play an important role in the interview situation, it is not at all difficult to predict that fewer ‘don’t know’ responses will be found, and that more imaginary attitudes and opinions will be measured. Research Process 169

Level of Motivation of the Participant One of the most important variables that can influence the validity of the data collection process either positively or negatively, is the participant’s level of motivation. The level of motivation is clearly influenced by a variety of factors such as interviewer characteristics, contextual factors and the manner in which the questions are phrased. Two issues may be emphasised in this context: the degree of interest that the topic has for the interviewee, and the extent to which he or she is likely to feel threatened by the questions that are posed. It has been empirically demonstrated that the more interesting the respondent finds the topic, the more highly motivated he or she will be and this in turn results in an increase in the response rate. As indicated earlier, the level of threat posed by the questions will have an important bearing on people’s willingness to respond to them, and also on their level of motivation. Questions that relate to highly private issues are likely to be perceived as threatening by the majority of respondents, and they are likely to respond unreliably. For this reason Cannell and Kahn maintain that the interviewer must make the interviewing experience and task sufficiently meaningful, rewarding and enjoyable to attain and maintain the necessary respondent motivation.

Response Patterns One of the most important types of observer effect in interviewing is the occurrence of systematic response patterns that are generally referred to as ‘response sets’. A number of authors, including Cronbach, Kolson and Green, and Webb et al., have addressed this matter. Kolson and Green focus on the fact that children are inclined to gamble when they are unsure of the meaning of items. Similar response patterns noted, particularly when the meaning of an item is obscure, include a tendency to endorse only the extremes on scaled items or to check the midpoints of the scale. For purposes of our discussion, we shall highlight two well-known types of response patterns: social desirability and the acquiescence response set. The Hawthorne effect is clearly an example of a social desirability tendency. According to Selltiz and Cook most individuals will try to give answers that make them appear well-adjusted, unprejudiced, rational, open-minded and democratic. Rosenberg was also able to confirm that individuals who attained high scores on Marlow- Crone’s Social Desirability Scale were more inclined to supply extremely positive responses than those with low scores on the scale. The tendency to answer either yes or no to virtually all the items in a questionnaire or scale is referred to as the acquiescence response set. As early as 1937, Sletto found that respondents were more likely to agree with a statement than to disagree with its inverse. In a more recent and detailed study of this issue, Schuman and Presser were able to confirm earlier findings on this topic. Apart from the fact that they were able to confirm the existence of this type of response pattern, which can produce differences that range between ten and fifteen per cent, they also found indications that this phenomenon is 170 Research Methodology and Statistical Methods more likely to occur amongst respondents with a low level of education than amongst for instance university graduates. However, these researchers maintain that we have not yet built up a large enough body of research on the phenomenon of the acquiescence response set to be able to provide an adequate interpretation of the reasons underlying this type of response pattern.

Context Effects When discussing the research context one can distinguish between broader spatio- temporal factors that are determined by historical, socio-political, and economic factors, and the narrower research setting within which the research is conducted. With regard to the former the researcher must be sensitive to the following types of factors: • The period during which the research is conducted. It is particularly relevant in the case of longitudinal research where changes in behaviour or attitudes are investigated and significant changes could be the result of external events such as elections, civil unrest, or increased unemployment. • Cultural factors such as habits, traditions, customs and institutions. The anthropological literature abounds with examples in which researchers, to their own detriment, have failed to take local conventions and customs into account in the design and execution of their research. • Political factors such as the existence of interest groups, lack of freedom, and intimidation. The importance of this issue is associated with the perceived ‘neutrality’ of each setting. In the first two, the respondent is familiar with the setting, but the researcher is not. However, the third and fourth categories are neutral territories. Studies on the influence of the research setting have shown that the researcher’s impressions of the participant’s home or place of work frequently led to significant data bias. The respondent’s role is also directly influenced by the research setting. In the domestic setting a woman’s role as a mother may be more noticeable, whereas her role as employer or supervisor may be more noticeable in the workplace.

ADAT COLLECTION (ENSURING RELIABILITY) The wide range of observer effects identified in the previous chapter illustrates how reactivity may influence the collection of data in social research. Although it is usually practically impossible for any researcher to identify and control for all of these effects, he still has a responsibility to plan and execute a study in a manner that will minimise the combined effects of various threats to validity. We now turn to some of the methods that the researcher can use to control for some of the effects. Our discussion does not address specific techniques, which may be found in the publications cited. Our primary concern is with the broad issues. Research Process 171

Triangulation A first general principle in data collection is that the inclusion of multiple sources of data collection in a research project is likely to increase the reliability of the observations. Denzin coined the term triangulation to refer to the use of multiple methods of data collection. Campbell and Fiske suggested a similar strategy which they called multiple operationism. Both of these concepts refer to the use of a variety of methods and techniques of data collection in a single study. The underlying assumption is that, because various methods complement each other, their respective shortcomings can be balanced out. It is important to bear in mind that specific types of data collection are designed for collection of certain types of data. In a classic article in 1962, Morris Zelditch distinguished between three types of information: frequency distributions, incidents or events, and institutionalised norms and status. For each of these types of information there is a prototypical method of data collection, which encompasses the use of surveys for information concerning incidents and the use of informants or interviews for information on norms and status. Zelditch’s classification also illustrates the fact that each type of method has specific limitations. By employing different methods of data collection in a single project we are, to some extent, able to compensate for the limitations of each. In an earlier section, we focussed on the fact that not all methods are equally reactive. It is hence an important principle to supplement the more reactive methods such as direct observation, with less reactive methods, such as the use of documentary sources. Two examples of triangulation will illustrate the advantages of such an approach. One of the observation effects identified in the previous chapter is associated with item sensitivity. We have specifically indicated that items that address issues relating to race and sex are likely to result in considerable response variability, particularly when there have been no controls for the race and gender of the interviewer. In the event of such a variation occurring in response to sensitive questions, more reliable information is likely to be obtained by doing a follow-up study using in-depth interviews. Similarly, where historical events are being investigated and memory decay may play an important part, the reliability of the information can probably be increased by the use of documentary sources like diaries and letters.

Ensuring Anonymity As indicated by Schuman and Presser, respondents tend to be reluctant to provide interviewers with information on sensitive matters. A similar problem surfaces in studies of sensitive behaviour such as so-called deviant behaviour. Douglas indicated that subjects tend to be unusually reluctant or unwilling to participate because they regard the investigation as an invasion of their privacy. The fact that his investigations concerned situations of a sensitive nature – massage parlours and nudist beaches – obviously contributed to this kind of response! 172 Research Methodology and Statistical Methods

One possible strategy to reduce the effect of such responses would be to emphasise the anonymity of responses and observations where possible. Rather than face-to-face interviews, it may for instance be possible to use postal or telephone surveys. Nevertheless, respondents are not necessarily convinced that the latter approaches actually ensure their anonymity. In the case of studies on so-called deviant behaviour, the assurance that the investigator will not identify the respondents in any way, must be regarded as a minimum requirement for establishing greater validity.

Establishing Rapport As opposed to anonymity one could use a strategy of trying to establish the best possible interpersonal relationship or rapport with the respondent. This strategy is obviously time consuming, and it might hence not always be practical. Douglas reported that a year had elapsed before they discovered that one of their most trustworthy participants had been using a ‘nom de plume’ all along. The advantage of a strong interpersonal relationship between researcher and participant is that it neutralises initial distrust. It can also serve as a control for role selection effects. If the respondent trusts the interviewer, there is no longer any need for any kind of role playing. The establishment of good rapport can also serve as a control for context effects.

Covert Research A more drastic strategy is to make use of some form of covert research. Covert observation may assume a variety of guises. Basically, it amounts to the researcher deceiving the participant about the actual purpose of the research or about the identity of the researcher. In such cases all possible measures are taken to ensure that the participant does not become aware of the fact that he or she is part of a research project. A good example of this type of research is Simon Holdoway’s study of police brutality. Because he suspected that he would not obtain reliable data in any other way, Holdoway resorted to covert research. He went as far as joining the police force, undergoing the necessary training, and spending several months serving as a policeman on patrol duty. With a single exception, nobody knew that his eventual aim was to conduct a sociological study of police activities. Covert research is particularly applicable in studies where participant observation or in-depth-interviewing is used. These are studies in which it is essential for the researcher to establish close ties with the group being investigated while keeping his/her actual identity from them. Other types of covert research are encountered where researchers disguise the fact that research is being conducted. An example of this method is found in the study conducted by Schwartz and Skolnick in which letters of application for employment were manipulated to investigate the effect of a criminal record on ‘suitability for appointment’. For a more detailed discussion of experimental designs in the natural context where some form of disguise is used, the reader can refer to Campbell. One of the most common examples of deception in Research Process 173 laboratory experiments is to be found in so-called blind and double-blind designs. In blind experimental designs the participants do not know whether they are part of the experimental group or the control group, whereas in double-blind experimental designs there is the additional requirement that the experimenters do not know whether they are dealing with the experimental group or the control group Effective covert research is obviously a useful strategy for countering the general guinea-pig effect. Where research participants are unaware of the fact that they are being studied they cannot react to the fact of being investigated. Covert research also controls for expectancy effects. In the example of the double-blind experimental design, one of the most important causes of expectancy effects is eliminated. Although the use of covert strategies like disguise, deception or withholding information is one of the most effective ways of minimising or even eliminating error, there are fundamental ethical objections to the wholesale use of this approach. Covert research necessarily implies that the subject is deceived, or that his/her right to privacy is infringed on, or that he/she has to be lied to. The dilemma confronting the researcher is hence how to weigh the moral interests of the subject against the interests of science. A number of authors have proposed suggestions on ways to neutralise the negative ethical implications of covert research. One approach involves requesting the subject’s permission to use the information gathered, immediately after completion of the study while obviously still ensuring the subject’s anonymity. Martin Bulmer’s Social research ethics may be consulted for an excellent exposition of the ethical implications of participant observation and covert research.

Control Group It has always been the norm to use control groups in experimental studies wherever possible. Apart from the experimental group to which the specific experimental treatment is applied or in which given interventions are made, a comparable control group, which does not undergo the experimental treatment, is used. In an attempt to ensure that the experimental and control groups are comparable, techniques such as random assignment of participants to either the experimental or control groups or matching of participants in the two groups is used. This enables the researcher to draw causal inferences with a higher degree of validity. A control group facilitates control for participant effects such as maturation, history and selection effects. However, researchers who intend to use an experimental approach are cautioned to study carefully the most important participant effects that are likely to occur in different types of experimental design and the measures that may be taken to eliminate these.

Training Adequate training of experimenters, interviewers, research assistants and fieldworkers is a precondition for any research. One of the specific aims in such training is to counteract 174 Research Methodology and Statistical Methods researcher effects. In our discussion of researcher effects, we noted the negative consequences of researcher orientation effects, and particularly those associated with researcher expectation effects. The likelihood of obtaining reliable data is increased when interviewers are given clear instructions regarding the aims of the project, and the importance of accurate and consistent interviewing is emphasised. Thorough prior training is also likely to eliminate or reduce the occurrence of some of the other researcher effects that we have not discussed, such as inaccurate noting of responses, coding errors, classification errors and many more.

Selection of Fieldworkers The cause of one of the more important researcher effects can be found in the distance between researchers and participants. Although various factors such as context or level of motivation result in greater degrees of distance between researchers and participants, researcher characteristics such as gender, race, age, and style of dress are some of the most important factors that fall under this rubric. An obvious solution to this problem is to exercise due care in the selection of fieldworkers. Fieldworkers who share as many characteristics of the sample as possible ought to be given preference.

Replication Studies In conclusion, one can hardly overemphasise the importance of the principle of replication. As Barber notes, a variety of factors contribute to the situation where exact replication in the social sciences is virtually impossible. Following Lykken’s lead, Barber argues in favour of more constructive replication by stressing that more investigators should try to confirm empirical relationships claimed in earlier reports while the replicator formulates his own methods of sampling, measurement and statistical analysis. Constructive replication implies that the researcher wishes to control the findings of an earlier study by investigating the same problem for a different sample and/or by using a different research design. A good example of such studies is included.

DA TA ANALYSIS AND INTERPRETATION The term ‘analysis’ basically means the ‘resolution of a complex whole into its parts’. In this sense, it is usually contrasted with the term ‘synthesis’, which means ‘the construction of a whole out of parts’. These terms were originally used in the domain of logic. In quantitative approaches to empirical research, ‘analysis’ refers to the stage in the research process where the researcher, through the application of various statistical and mathematical techniques, focusses separately on specific variables in the data set. The word ‘synthesis’ is not used that often in empirical studies, but it would have a meaning similar to the term ‘interpretation’. Interpretation refers to the stage in the research process where the researcher tries to ‘bring it all together’, either by relating the various individual findings to an existing Research Process 175 theory or hypothesis, or by formulating a new hypothesis that would best account for the data. There are fundamental differences between quantitative and qualitative research in terms of data analysis. In the discussion in the remainder of the chapter, I shall first elaborate on the notion of quantitative analysis and then refer to some of the distinctive features of qualitative data analysis. In the final section, I shall discuss the more general issue of drawing valid conclusions from data, irrespective of the kind of data.

Quantitative (Statistical) Data Analysis This discussion will be confined to statistical data analysis, although the general category of quantitative analysis would normally also include mathematical techniques and computer simulation studies. The aim here is not to discuss specific techniques of statistical analysis, but rather to provide a framework that could be used in making more sense of such techniques. In various chapters of the book we have introduced terminology that is used in statistical analysis: words such as variables, levels of measurement, and relationships between variables and cases. A useful way to look at statistical analysis is in terms of the ‘data matrix’. A ‘data matrix’ is defined as any array of figures or numbers where the rows are the cases and the columns are the variables. The cells of a data matrix represent the actual values of each variable as it applies to a specific case. Assume that we undertook a survey of the attitudes of full-time third-year students at three universities. We are specifically interested in testing a number of hypotheses, namely whether there is a link between ‘university’ and ‘political affiliation in terms of party political support’; whether there is a relationship between gender and political affiliation; and whether there is a relationship between ethnic group and political party affiliation. The data matrix in table, which constitutes an excerpt from the total data set, contains information on twenty cases and the variables: gender, ethnic group, university and political party support. The value labels are as follows: • Gender: 1 = Male, 2 = Female; Ethnic group: 1 = African black, 2 = ‘Coloured’, 3 = White; University: 1 = UCT, 2 = US, 3 = UWC; Political party support: 1 = ANC, 2 = NP. Table. Data matrix of hypothetical study at three Western Cape universities. Cases Gender Ethnic group University Political Party Age 1 2 3 2 2 19 2 1 1 1 1 20 3 2 2 3 1 23 4 1 3 2 2 20 5 2 3 1 2 20 6 2 3 1 2 19 7 2 2 2 2 19 176 Research Methodology and Statistical Methods

Cases Gender Ethnic group University Political Party Age 8 1 2 2 1 21 9 1 2 3 2 21 10 2 2 3 2 20 11 1 3 2 2 19 12 2 1 1 1 18 13 1 1 1 1 18 14 2 1 3 1 19 15 2 3 2 2 20 16 2 3 2 2 20 17 2 1 3 1 20 18 1 2 3 1 20 19 1 3 2 2 19 20 2 3 1 2 19 The domain of statistics has traditionally been divided according to two main functions, namely descriptive statistics and inferential statistics. Descriptive statistics is concerned with organising and summarising the data at hand to render it more comprehensible. Inferential statistics deals with the kinds of inferences that can be made when generalising from data, as from sample data to the entire population. Descriptive statistics can be further divided according to the number of variables that the researcher focusses on: if a single variable is studied the process is known as univariate analysis, when two variables are studied we refer to this as bivariate analysis and when more than two variables are studied we refer to it as multivariate analysis. We shall now discuss briefly the differences between univariate and bivariate descriptive statistics and inferential statistics, using a simple example from our data set. The purpose of this discussion is to give the reader an impression of what is involved in the process of quantitative data analysis and to elaborate on some of the underlying principles. For a detailed discussion of the variety of statistical techniques the student is referred to any good introductory statistical textbook.

Univariate Analysis Univariate analysis is sometimes seen as the first step in the analysis chain, as a stage of data cleaning. During this stage the aim is to get a clear picture of the data by examining one variable at a time. Univariate ‘images’ or ‘pictures’ of data come in various forms, namely frequency and percentage tables, graphs and charts and statistical indexes. Figure is a frequency polygon of the distribution of age in our sample. The specific techniques used will depend on the level of measurement of the variable: nominal/ordinal level data allows for certain techniques, whereas interval level data allows for more powerful statistical analysis. Research Process 177

Fig. Fig. Frequency polygon on the distribution of age. Using our data on university students, we could for instance have done a simple frequency and percentage analysis of political party affiliation as illustrated in table. Table. Frequency/percentage distribution of political party support. Code Label Frequency Percentage 1 ASC 8 40 2 NP 12 60 Total 20 100 The scores in the columns under ‘frequency’ and ‘percentage’ are referred to as the and percentage distribution of scores on the variable. If we were to do similar univariate analyses on the other three variables we would find that sixty per cent of the sample are female and forty per cent male; twenty-five per cent are African blacks, thirty per cent are ‘coloureds’ and forty-five per cent are whites; thirty- five per cent are students at UCT, thirty-five per cent at US and thirty per cent at UWC. It should be obvious that these statistics of the four variables provide us with a much clearer and manageable picture of the data. In fact, the impact of such summary statistics is really only felt when the number of cases becomes quite extensive. We can also get a picture of a distribution by looking at its respective statistical properties. Three kinds of measures of central tendency are usually distinguished, namely the mean, the median and the mode. These measures indicate various points of concentration in a set of values. The mean is the arithmetical average, calculated by adding up all the responses and dividing by the total number of respondents. The median is the midpoint in a distribution, or the value of the middle response; half of the responses are above it and half are below. The mode is the value or category with the highest frequency. The other class of properties that provides a statistical summary of the data is the degree of variability or dispersion in a set of values. The simplest dispersion measure is the range, which is the difference between the lowest and highest values in the data set. In our example the age of our students ranged between 19 and 24. Of several other measures of dispersion, the most commonly used 178 Research Methodology and Statistical Methods is the standard deviation. This is a measure of the ‘average’ spread of observations around the mean. The third statistical property that is usually distinguished refers to the shape of a distribution. This property is most readily apparent from a graphic presentation called a frequency or percentage polygon.

Bivariate Analysis Univariate analysis is a useful tool to give the researcher a feel for the data. It is also an essential stage in the quality check of a data set. But, as noted in chapter, researchers are more often than not interested in relationships between variables. In our example, we formulated three hypotheses regarding such relationships: between university and political party support, between ethnic group and political party support and between gender and political party support. The researcher generally asks two questions in this regard: does the hypothesised relationship exist and secondly, how much effect or influence does one variable have on the other. As with univariate analysis, the choice of the actual technique depends on the level of measurement. We shall consider an example of a bivariate analysis involving nominal scale variables. When the tables analysed have only a few categories, as in many nominal level measurements, bivariate data is presented in tables. Such tables are known as cross-tabulations or contingency tables. A cross-tabulation requires a table with rows representing the categories of one variable and columns representing the categories of the other. When a dependent variable can be identified, it is customary to make it the row variable and the independent variable the column variable. In our case we would define ‘party political support’ as the dependent variable and ‘ethnic group’ as the independent variable. Let us return to our example and illustrate a crosstabulation between ethnic group and political party affiliation. Table. Cross-tabulation of ethnic group by political party support. Political party Ethnic Group Row African black ‘Coloured’ White Total ANC 5 3 0 8 2.0 2.4 3.6 40.0% 62.5% 37.5% 0% 100.0% 50.0% 0% NP 0 3 9 12 3.0 3,6 5.4 60.0% 0% 25.0% 75.0% 0% 50.0% 100.0% Column 5 6 9 20 Total 25.0% 30.0% 45.0% 100.0% (Chi-square = 13.75 (Df = 2, prob < .001)) Even a cursory look at the table reveals a clear link between ethnic group and political party support. All the African students support the ANC, coloured students are divided Research Process 179 in their support of the ANC and the NP, whereas all the white students indicated their affiliation to the NP. This is already strong evidence in favour of our second hypothesis. The other two hypotheses can be tested in the same way. But the interesting question about relationships is not whether they are found in samples of populations, but whether they reflect the true population values. The question is not simply whether a relationship exists in the sample data. The researcher must also determine whether the observed cell frequencies reveal a true relationship between these variables in the population or whether they are simply the result of sampling error and other random error. This leads us to a brief discussion of the concept of statistical significance.

Inferential Statistics

Fig. Fig. Statistical techniques and their applications. The statistic most commonly used to establish whether the observed results in a cross-tabulation represent true population values is the chi-square test for independence. 180 Research Methodology and Statistical Methods

The chi-square test is based on a comparison of the observed cell frequencies with the cell frequencies one would expect if there were no relationship between the variables. Table presents the count expected cell frequencies, row percentages and column percentages. The larger the differences between the actual cell frequencies and those expected assuming no relationship, the larger will be the value of chi-square and the more likely that the relationship exists in the population. When we report chi-square values as being significant, we are saying that it is highly unlikely that the results that we obtained were due to some form of sampling or random error. In conclusion: data analysis is all about investigating variables, the relationships between variables and the patterns in these relationships. Figure summarises some of the main statistical techniques as they fit into the above distinctions.

Qualitative Data Analysis Most qualitative researchers would not deny the value of quantitative analysis, even in so-called qualitative studies. However, they will certainly object to the wholesale use of such techniques to the exclusion of other methods of analysis. Let me refer to a typical formulation of a qualitative approach that is based on symbolic interactionism, which is a sociological tradition that has its roots in a rejection of the basic tenets of a positivist view of social reality. In a recent article, Paul Rock gave a useful summary of the main principles of symbolic interactionism as it applies to social research. According to Rock, symbolic interactionism emphasises three fundamental features of social life. Firstly people can make reflexive use of the symbols they employ, that is to say, they can interpret and unravel the meanings of events without merely reacting to them. Secondly people are symbolic objects to themselves. They constantly construct, judge and modify themselves as social entities. Thirdly perspectives and plans emerge from the interplay between a socially constructed self and a socially constructed environment. Selves and settings are lent an additional structure by their location in time. They are awarded biographies, and their emergent properties are traced to define a range of possible futures. Against this set of philosophical principles, symbolic interactionism is concerned with four levels of analysis: It is concerned with: • The ways in which the self renders its environment socially significant, is transformed by such a rendition and construes the environment anew; • The way in which social worlds are built by negotiated perspectives that continually redefine reality; • The manner in which social worlds influence one another and make new constellations of meaning possible; and • The relationship between such worlds and the larger, overriding symbolism that lends coherence to society. Research Process 181

According to Rock, such an approach to social inquiry has the following practical implications for data analysis: • Any attempt to divide the social world into discrete parts must be rejected. Methods such as computer simulations and statistical analysis, which represent any such form of discrete analysis, are therefore not acceptable. • Symbolic interactionists attempt to relate the procedures that are routinely employed to build up ‘social scenes’. Any practice that invokes causes, forces or principles that are too abstract from the actor’s perspective, is held to lack credibility. • A further consequence of adopting such a level of magnification is that interactionism commands a reluctance to generalise features of social worlds. Perspectives, meanings and identities are inextricably anchored in their contexts. A person praying, drinking or fighting is accorded social significance in terms of the setting of his or her behaviour. In concrete terms it means that qualitative analysis focusses on: • Understanding rather than explaining social actions and events within their particular settings and contexts; • Remaining true to the natural setting of the actors and the concepts they use to describe and understand themselves; • Constructing, with regard to the social world, stories, accounts and ‘theories’ that retain the internal meaning and coherence of the social phenomenon rather than breaking it up into its constituent ‘components’; and • Contextually valid accounts of social life rather than formally generalisable explanations. Because of this emphasis on the integrated, meaningful and contextual nature of social phenomena, qualitative researchers have had to develop new methods and strategies of ‘analysing’, or even better, of ‘interpreting’ and ‘understanding’ the social world. Examples of such approaches are: • The grounded theory approach; • Analytic induction; • Phase analysis; • Phenomenological analysis; • Discourse analysis; and • Conversation analysis. Some of the principles of qualitative data analysis have actually been operationalised and are now used in computer software programmes such as Ethnograph, Nud*ist and Kwalitan. For an authoritative and recent overview of such programmes, compare Weitzman and Miles. In conclusion: the approaches of quantitative and qualitative analysts are clearly quite different. The quantitative researcher analyses data by looking at the particular elements, first in isolation and then in various combinations with other 182 Research Methodology and Statistical Methods elements. A crucial question in generalising studies is whether the results obtained from the sample data are representative of population characteristics. This question leads to the use of inferential statistics to either estimate population parameters or test hypotheses. In qualitative research, the investigator usually works with a wealth of rich descriptive data, collected through methods such as participant observation, in-depth interviewing and document analysis. The research strategy is usually of a contextual nature. This implies a focus on the individual case in its specific context of meanings and significance. Analysis in these cases means reconstructing the inherent significance structures and the self- understanding of individuals by staying close to the subject. This approach is known as the insider perspective. The overall coherence and meaning of the data is more important than the specific meanings of its parts. This leads to the use of methods of data analysis that are more holistic, synthetic and interpretative.

WRITING THE RESEARCH REPORT In a recent study Böhme defends the view that argumentation constitutes the unique context of science and in this way determines the nature of scientific communication. • In contrast to most other types of communication scientific communication, however, is argumentation: the coherence of communication is the coherence of an argumentative context. This thesis may seem trivial, but that it is not so is shown by the fact that scientific communication is frequently understood to be an exchange of information. Even the communication of pure measurement results usually is the adducing of empirical evidence for a hypothesis or even is itself an empirical hypothesis for which theoretical arguments have to be brought forward in the publication. Böhme compares the act of scientific communication, as distinct from the research process, with the traditional context of validation or justification. Scientific communication or reporting is an act of validation; an act in which the scientist argues for a specific view, hypothesis or finding relative to the position taken by other scientists. The logic of reporting is the logic of validation. It is the act of advancing arguments or reasons, empirical or theoretical, in support or refutation of a specific hypothesis or finding. This reference to the ‘logic of social inquiry’ should remind you of the PEC-framework discussed in chapter. The basic logic of all research is captured in the specific relationship between the research problem, the evidence collected and the conclusions drawn on the basis of the evidence. At the same time, a central theme in this book is that research is a social activity. We have, in line with numerous prominent scholars emphasised the social nature of scientific praxis. Research occurs in different social and academic contexts, ranging from the specific interests of a research project, through the institutional, to national and international contexts. Scientific report writing does not take place in a vacuum. The nature of scientific Research Process 183 communication, of which research reports or dissertations and articles are prime examples, is determined by the very logic of social science. Like the study itself, the nature of the report is a function of factors such as the purpose of the study, the interests of the researcher and the practical constraints of resources. In the following discussion we show how the ‘logics’ of different research contexts lead to differences in report writing. We shall distinguish between a master’s thesis, a doctoral dissertation and a journal article.

Kinds of Reports As the word suggests, a master’s thesis indicates that its author has mastered a certain domain of research or a topic. The master’s thesis provides, or should provide, evidence that a person has mastered the knowledge and skills appropriate to a certain subject. What does this mean? Clearly it must mean that the master’s candidate has read and understood the most important previous research on a particular topic. It must also mean that the candidate has successfully integrated such previous research into a new, interesting research problem. Furthermore it means that the candidate has been able to design and execute a research project and has employed the appropriate methodologies to addresses the research problem adequately. And it finally means that the researcher is able to analyse and interpret the results of his or her study in a meaningfully and coherently. The emphasis in assessing a master’s thesis is hence on whether the candidate provides sufficient evidence of having mastered the skills involved in these various activities, namely literature review, formulating the problem, designing a study and analysing and interpreting the evidence. Put briefly, the master’s candidate must ‘prove’ to the reader that he or she knows how to do proper research! All of these requirements apply equally to a doctoral study. At doctoral level it is taken for granted that the candidate has the required knowledge and skills in a specific domain. The additional and crucial criterion of a doctoral study is that the candidate must make some contribution to the existing body of knowledge. Innovation and novelty are key requirements at this level. Just being able to demonstrate mastery of a topic will not suffice. The doctoral student also has to add to our collective understanding of the social world. Such a contribution can take various forms such as testing existing theories and models and making suggestions for their improvement, evaluating social interventions in order to improve their efficacy, analysing certain key concepts in a discipline and improving our understanding of these. A researcher starts writing and publishing journal articles after having mastered a certain domain and also when he/she has something new to contribute to the topic. A journal article is far more focussed and specific than a master’s thesis or a doctoral dissertation. An article reports on new empirical findings or a new conceptualisation of an old problem, without providing extensive coverage of the literature. Of course, it is so that different contributions to journals have different aims. Examples of differences in approach are standard research articles; state-of-the-art reviews, which must cover the most recent literature; and discussion articles, which 184 Research Methodology and Statistical Methods engage in debate with other scholars; and book reviews. I want to relate this discussion to the comment above on the logic of research. I shall show that the PEC-framework is useful in understanding the different requirements and interests of the three kinds of research reports that we have discussed. My basic point is simple: the relative weight of each of these ‘elements’ differs in each of these cases. Let us summarise the central requirement of each kind of report: • A master’s thesis: to prove that one has mastered a certain topic. • A doctoral dissertation: to make a contribution to the body of knowledge. • A journal article: to contribute to a topical and well-defined research issue. I believe that the above implies that a master’s candidate must devote a disproportionate part of the actual thesis to a literature study and a discussion of the research design. It is through a literature review and the subsequent formulation of the research problem that we get an idea of the candidate’s knowledgeability and skill in that area. The discussion of the research design and methodology also gives an indication of whether the candidate ‘knows what she is doing’. In addition to what is required at master’s level, a doctoral student must present the new findings or insights in some detail. Not only must she provide evidence of how the findings were arrived at, which involves the same steps as at the master’s level, but why these results are worth noting. This means that the weight of the reporting shifts noticeably from a focus on the statement of the Problem and the design/collection of the Evidence to the presentation and discussion of the Conclusions. In an important sense, a journal article is a smaller version of a doctoral dissertation. When an article is accepted by a good quality journal, the scientific community assumes, although not always uncritically, that the article is the outcome of extensive and well- designed research. This implies that, with the exception of state-of-the-art reviews, normal journal articles will not include extensive literature reviews, simply because we assume that the author knows what he/she is doing. The reader can already assess whether the researcher is ‘in touch’ with the field by checking which references have been cited. This discussion is summarised in table. Although it might seem somewhat artificial to assign a value to each of the main ‘elements’ in a study, it does illustrate how different contexts of report writing influence the logic and structure of a report. Table. Percentage of respective types of report allocated to problem, evidence and conclusions. PEC-framework Type of report Master’s Doctoral dissertation Journal article (70-100 pp) (200-300pp) (20pp) Problem (including 30% 20-30% 10-15% literature review) Evidence (design and 30%+ 20%+ 10-15% execution) Conclusions 40% 50-60% 70-80% Research Process 185

Guidelines to Report Writing The guidelines or criteria for better report writing that are discussed below, summarise some of the main points made in the book, especially in this section. Although the student can use these as a kind of ‘checklist’ when writing a report, they should also act as a reminder of the kind of issues involved in actual research. Four categories of guidelines are distinguished, namely theoretical, metatheoretical, methodological and technical.

Theoretical Guidelines It is generally accepted that scientific research does not take place in a vacuum. Although studies or projects are written and published individually, they always form part of a particular theoretical framework. Knowledge in a given field of research should logically form part of a series of interdependent preceding studies and also of some of the theories or models that exceed the boundaries of those used in the particular framework. Given the importance of the argumentative context of scientific research, it follows that a literature survey should not simply comprise a mechanical description of existing theories: one or more theoretical views should be integrated with the logic of the research objective or task. For example, the researcher should be able to answer the following questions: • How does the central theme of the investigation relate to other research and existing theories? • Does the introduction to the study include an explanation of the manner in which the basic argument of the research has been integrated into the wider framework of relevant theory and research?

Guideline 1 The research project should be integrated into the wider framework of relevant theory and research reflected in a review of the literature. • The primary constituents of theories are undoubtedly the concepts in which the researcher categorises the social world as it is observed. Scientists do not always attach the same meaning to concepts. In addition, the social scientist, in contrast to the natural scientist, usually employs lay or everyday terms. It is mainly for these two reasons that concepts with more than one meaning are sometimes used by researchers in a somewhat individualistic way. This happens more frequently when the research deals with problems in which the researcher is personally involved. These considerations form the context of the second guideline.

Guideline 2 All central or important concepts or constructs of the study should be defined explicitly. 186 Research Methodology and Statistical Methods

Metatheoretical Guidelines It is generally accepted in philosophy of science today, that no research findings can be conclusively proved on the basis of empirical research data. In different stages of the scientific research process and for different reasons the researcher is compelled to make assumptions about specific theories and methodological strategies that are not tested in the specific study. One important category of such assumptions comprises the metatheoretical or metaphysical assumptions underlying the theories, models or paradigms that form the definitive context of the study. Because of the argumentative and public nature of scientific communication, this often tacit dimension of scientific practice should be made explicit.

Guideline 3 The scientist should spell out clearly the methatheoretical assumptions, commitments, (pre)suppositions and beliefs that are applicable to her research.

Methodological Guidelines The quality of research findings is directly dependent on the methodological procedures followed in the study. For this reason researchers should provide a complete account of the way in which their research has been planned, structured and executed. The most important steps in the research process, namely the statement of the research problem, the research design, and information on data collection, analysis and interpretation should be incorporated in the specification of methodological guidelines.

Research Problem In empirical research the research hypothesis directs the investigation, while in theoretical research, the central theoretical thesis serves this purpose. In addition, core concepts in the statement of the problem must be clearly defined and in empirical research, also operationalised.

Guideline 4 The research hypothesis or central theoretical thesis must be clearly formulated and operationalised.

Research Design A research design is an exposition or plan of how the researcher plans to execute the research problem that has been formulated. The objective of the research design is to plan, structure and execute the relevant project in such a way that the validity of the findings is maximised. Three aspects are usually included in the research design, namely the aim of the research, data or information sources and considerations of validity and reliability. Research Process 187

The Aim of the Research At the very outset the researcher should state the aim of the project, whether it is exploratory or validational, hypothesis testing or hypothesis generation.

Guideline 5 The research report should specify the aim or objective(s) being pursued.

Data or Information Sources There are a variety of data sources available for social sciences research. There are physical sources, documentary sources, and indirect and direct observation. Indirect observation includes the use of questionnaires, interviews, scales and tests. Irrespective of the data sources used, the researcher should also report on: • The nature, credibility and relevance of the sources; and • The representativeness of the sources. In empirical research on individuals, representativeness refers to the problems related to sampling. In these cases the researcher is required to provide adequate information on aspects such as the sampling techniques and the demographic characteristics of the sample.

Guideline 6 Information should be provided on the nature, credibility, relevance and representativeness of data and information sources.

Reliability and Validity In the research design stage researchers should already be considering the factors that could prevent them from making valid inferences. In theoretical research this problem emerges as a problem of objectivity. Examples of such factors include selection of only those views and arguments that support the researcher’s views, insufficient provision of supporting evidence or reasons for the final conclusion, and implicit prejudice. In empirical research the researcher should take account of a variety of confounding variables that could threaten the final validity of his findings. The aim of a research design is, after all, to employ various measures to control for systematic bias, confounding variables and other sources of error.

Guideline 7 The research report should include information on the ways in which the reliability or validity and objectivity of the data or information have been controlled.

Data Collection Regarding data collection, the researcher should report on the methods and techniques of data collection, the period during which the project was executed, and the events 188 Research Methodology and Statistical Methods that could at the time have had an influence on the data collected and the controls used to ensure that the process of data collection yielded reliable data. Where standardised measuring instruments such as questionnaires and scales have been used they are usually appended at the end of the thesis or dissertation. In the case of journal articles, these would be incorporated into the article.

Guideline 8 The research report should contain detailed information on the methods and context of the data collection.

Data Analysis Analysis includes both qualitative approaches such as historical and conceptual analysis, and quantitative approaches. It is generally accepted that empirical data can be analysed in different ways. Different approaches to such analysis can sometimes lead to different findings. A few examples are the different ways in which large data sets can be reduced. These include analysis of covariance, and bivariate and multivariate approaches. Since different approaches often enjoy virtually the same validity, the researcher must give reasons for specific choices.

Guideline 9 The procedures used for analysis should be described in full.

Data Interpretation In both theoretical and empirical research, the report should be concluded with an interpretation of the findings against the background of the original research problem. The criteria of objectivity demand that the interpretation should not be selective, but that data should be reported in full. A valid conclusion is one in which the data or reasons/evidence provide both sufficient and relevant grounds for the conclusion.

Guideline 10 The interpretation and conclusions should be provided within the framework of the original research problem and design and should include all the relevant information or data.

Technical Guidelines Diverse factors govern the technical editing of a report. The nature and extent of an investigation will obviously determine aspects like the length of the report. The most important aspects to be taken into consideration when editing a report, article or dissertation are: • The format: the length of the text and line spacing; • The length; Research Process 189

• The number of copies; • The reference style; • The necessity for acknowledgements; and • The summary. The precise nature and content of each of these factors will depend on the context of the report. Articles submitted to journals are usually required to comply with the conventions of the journal in question. Universities also have strict rules regarding theses or dissertations submitted to them, while organisations such as the Human Sciences Research Council have their own sets of criteria for research reports. The only guideline that can hence be formulated, is guideline 11.

Guideline 11 The research report should comply with the technical guidelines in terms of format, length, number of copies, reference style and summary as laid down by the organization or journal concerned.

Conclusion The structure of a typical dissertation or thesis is summarised in table. Table. Structure of a thesis. The thesis Main purpose Introduction • Formulate research problem (One or two chapters, depending on • Purpose of study the scope of the literature review) • Literature review • Definitions of key concepts Research design • Operationalisation (Usually one chapter) • Description of measuring instrument • Sampling design • Description of data collection • Methods of data analysis Findings • Discuss results in terms of research problem and research (Numerous chapters, depending on hypothesis (where appropriate) how the data is organised and presented) • Relate findings to literature review Conclusions, interpretation and • Integrate results into main conclusions as they impact on recommendations the central research problem of study (Final chapter) • Where appropriate, make recommendations about further research and other activities 190 Research Methodology and Statistical Methods

4

Normal Distribution

In probability theory, the normal (or Gaussian or Gauss or Laplace-Gauss) distribution is a very common continuous . Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate. The normal distribution is useful because of the central limit theorem. In its most general form, under some conditions (which include finite variance), it states that averages of samples of observations of random variables independently drawn from independent distributions converge in distribution to the normal, that is, become normally distributed when the number of observations is sufficiently large. Physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have distributions that are nearly normal. Moreover, many results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically in explicit form when the relevant variables are normally distributed. The normal distribution is sometimes informally called the bell curve. However, many other distributions are bell-shaped (such as the Cauchy, Student’s t, and logistic distributions). The probability density of the normal distribution is,

2 ()x−µ − 1 2 f() xµ, σ 2 = e 2σ 2πσ 2 Normal Distribution 191 where, • µ is the mean or expectation of the distribution (and also its median and mode), • σ is the standard deviation, and • σ2 is the variance.

DEFINITION

STANDARD NORMAL DISTRIBUTION The simplest case of a normal distribution is known as the standard normal distribution. This is a special case when µ = 0 and σ = 1, and it is described by this probability density function: 1 1 − x2 ϕ ()x= e 2 2π The factor 1 2π in this expression ensures that the total area under the curve ϕ (x) is equal to one. The factor 1/2 in the exponent ensures that the distribution has unit variance (and therefore also unit standard deviation). This function is symmetric around x = 0, where it attains its maximum value 1 2π and has inflection points at x = + 1 and x = – 1. Authors may differ also on which normal distribution should be called the “standard” one. Gauss defined the standard normal as having variance σ 2 = 1/ 2 , that is

2 e− x ϕ ()x = π Stigler goes even further, defining the standard normal with variance σ2 = 1/( 2 π ) :

2 ϕ (x) = e−π x

GENERAL NORMAL DISTRIBUTION Every normal distribution is a version of the standard normal distribution whose domain has been stretched by a factor σ (the standard deviation) and then translated by µ (the mean value): 1 x − µ  f() x µ, σ2 = ϕ  . σ σ  The probability density must be scaled by 1/σ so that the integral is still 1. If Z is a standard normal deviate, then X = σZ + µ will have a normal distribution with expected value µ and standard deviation σ. Conversely, if X is a normal deviate with parameters µ and σ2, then Z = ( X – µ )/σ will have a standard normal distribution. 192 Research Methodology and Statistical Methods

This variate is called the standardized form of X. Every normal distribution is the exponential of a quadratic function:

2 f( x) = eax+ bx + c where a < 0 and c= b2/(4a)+ln(-a/π)/2. In this form, the mean value is µ = –b/(2a), and the variance is σ2 = –1/( 2 a ). For the standard normal distribution, a = – 1/2 b = 0, and c = –ln(2π)/2.

NOTATION The probability density of the standard Gaussian distribution (standard normal distribution) (with zero mean and unit variance) is often denoted with the Greek letter φ. The alternative form of the Greek letter phi, ϕ, is also used quite often. The normal distribution is often referred to as N (µ, σ2) or N (µ, σ 2 ). Thus when a random variable X is distributed normally with mean µ and variance σ2, one may write, X ~N (µ , σ 2 ) .

ALTERNATIVE PARAMETERIZATIONS Some authors advocate using the precision τ as the parameter defining the width of the distribution, instead of the deviation σ or the variance σ2. The precision is normally defined as the reciprocal of the variance, 1/σ2. The formula for the distribution then becomes,

τ 2 f() x= e−τ()x − µ /2 . 2π This choice is claimed to have advantages in numerical computations when σ is very close to zero and simplify formulas in some contexts, such as in the Bayesian inference of variables with multivariate normal distribution. Also the reciprocal of the standard deviation τ'= 1/ σ might be defined as the precision and the expression of the normal distribution becomes,

τ ' 2 2 f() x= e−()()τ'x − µ /2 . 2π According to Stigler, this formulation is advantageous because of a much simpler and easier-to-remember formula, and simple approximate formulas for the quantiles of the distribution.

HISTORY

DEVELOPMENT Some authors attribute the credit for the discovery of the normal distribution to de Moivre, who in 1738 published in the second edition of his “The Doctr ine of Chances” Normal Distribution 193 the study of the coefficients in the binomial expansion of (a + b)n. De Moivre proved that the middle term in this expansion has the approximate magnitude of 2 2π n , and that “If m or ½n be a Quantity infinitely great, then the Logarithm of the Ratio, which 2ℓℓ a Term distant from the middle by the Interval ℓ , has to the middle Term, is − .” n Although this theorem can be interpreted as the first obscure expression for the normal probability law, Stigler points out that de Moivre himself did not interpret his results as anything more than the approximate rule for the binomial coefficients, and in particular de Moivre lacked the concept of the probability density function. In 1809 Gauss published his monograph “Theoria motus corporum coelestium in sectionibus conicis solem ambientium” where among other things he introduces several important statistical concepts, such as the method of least squares, the method of maximum likelihood, and the normal distribution. Gauss used M, M′, M′′, … to denote the measurements of some unknown quantity V, and sought the “most probable” estimator: the one that maximizes the probability ϕ(M – V) · ϕ(M′ – V) · ϕ(M′′ – V) · … of obtaining the observed experimental results. In his notation ϕ∆ is the probability law of the measurement errors of magnitude ∆. Not knowing what the function ϕ is, Gauss requires that his method should reduce to the well-known answer: the arithmetic mean of the measured values. Starting from these principles, Gauss demonstrates that the only law that rationalizes the choice of arithmetic mean as an estimator of the location parameter, is the normal law of errors: h ϕ∆ = e−hh ∆∆ , π where h is “the measure of the precision of the observations”. Using this normal law as a generic model for errors in the experiments, Gauss formulates what is now known as the non-linear weighted least squares (NWLS) method. Although Gauss was the first to suggest the normal distribution law, Laplace made significant contributions. It was Laplace who first posed the problem of aggregating several observations in 1774, although his own solution led to the Laplacian distribution. 2 It was Laplace who first calculated the value of the integral ∫ e−t dt = π in 1782, providing the normalization constant for the normal distribution. Finally, it was Laplace who in 1810 proved and presented to the Academy the fundamental central limit theorem, which emphasized the theoretical importance of the normal distribution. It is of interest to note that in 1809 an American mathematician Adrain published two derivations of the normal probability law, simultaneously and independently from Gauss. His works remained largely unnoticed by the scientific community, until in 1871 they were “rediscovered” by Abbe. In the middle of the 19th century Maxwell demonstrated that the normal distribution is not just a convenient mathematical tool, but may also occur in natural phenomena: 194 Research Methodology and Statistical Methods

The number of particles whose velocity, resolved in a certain direction, lies between x and x + dx is,

x2 − 1 2 N eα dx . α π

NAMING Since its introduction, the normal distribution has been known by many different names: the law of error, the law of facility of errors, Laplace’s second law, Gaussian law, etc. Gauss himself apparently coined the term with reference to the “normal equations” involved in its applications, with normal having its technical meaning of orthogonal rather than “usual”. However, by the end of the 19th century some authors had started using the name normal distribution, where the word “normal” was used as an adjective – the term now being seen as a reflection of the fact that this distribution was seen as typical, common – and thus “normal”. Peirce (one of those authors) once defined “normal” thus: “... the ‘normal’ is not the average (or any other kind of mean) of what actually occurs, but of what would, in the long run, occur under certain circumstances.” Around the turn of the 20th century Pearson popularized the term normal as a designation for this distribution. • Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ‘abnormal’.—Pearson (1920) Also, it was Pearson who first wrote the distribution in terms of the standard deviation ó as in modern notation. Soon after this, in year 1915, Fisher added the location parameter to the formula for normal distribution, expressing it in the way it is written nowadays:

1 −()x − m 2 /() 2σ 2 df= e dx 2σ2 π The term “standard normal”, which denotes the normal distribution with zero mean and unit variance came into general use around the 1950s, appearing in the popular textbooks by P.G. Hoel (1947) “Introduction to mathematical statistics” and A.M. Mood (1950) “Introduction to the theory of statistics”. When the name is used, the “Gaussian distribution” was named after Carl Friedrich Gauss, who introduced the distribution in 1809 as a way of rationalizing the method of least squares as outlined above. Among English speakers, both “normal distribution” and “Gaussian distribution” are in common use, with different terms preferred by different communities. Normal Distribution 195

PROPERTIES The normal distribution is the only absolutely continuous distribution whose cumulants beyond the first two (i.e., other than the mean and variance) are zero. It is also the continuous distribution with the maximum entropy for a specified mean and variance. Geary has shown, assuming that the mean and variance are finite, that the normal distribution is the only distribution where the mean and variance calculated from a set of independent draws are independent of each other. The normal distribution is a subclass of the elliptical distributions. The normal distribution is symmetric about its mean, and is non-zero over the entire real line. As such it may not be a suitable model for variables that are inherently positive or strongly skewed, such as the weight of a person or the price of a share. Such variables may be better described by other distributions, such as the log-normal distribution or the . The value of the normal distribution is practically zero when the value x lies more than a few standard deviations away from the mean. Therefore, it may not be an appropriate model when one expects a significant fraction of outliers—values that lie many standard deviations away from the mean—and least squares and other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data. In those cases, a more heavy-tailed distribution should be assumed and the appropriate robust statistical inference methods applied. The Gaussian distribution belongs to the family of stable distributions which are the attractors of sums of independent, identically distributed distributions whether or not the mean or variance is finite. Except for the Gaussian which is a limiting case, all stable distributions have heavy tails and infinite variance. It is one of the few distributions that are stable and that have probability density functions that can be expressed analytically, the others being the and the Lévy distribution.

SYMMETRIES AND DERIVATIVES The normal distribution with density f (x) (mean µ and standard deviation σ > 0) has the following properties: • It is symmetric around the point x = µ, which is at the same time the mode, the median and the mean of the distribution. • It is unimodal: its first derivative is positive for x < µ, negative for x > µ, and zero only at x = µ. • The area under the curve and over the x-axis is unity. • Its density has two inflection points (where the second derivative of f is zero and changes sign), located one standard deviation away from the mean, namely at x = µ – σ and x = µ + σ. • Its density is log-concave. • Its density is infinitely differentiable, indeed supersmooth of order 2. 196 Research Methodology and Statistical Methods

Furthermore, the density ϕ of the standard normal distribution (i.e., µ = 0 and σ = 1) also has the following properties: • Its first derivative isϕ'(x) = − x ϕ ( x). • Its second derivative is ϕ ''( x) =( x2 − 1)ϕ ( x) . ()n n • More generally, its nth derivative is ϕ ( x) =( −1) Hen ( x)ϕ ( x), where Hen (x) is the nth (probabilist) Hermite polynomial. • The probability that a normally distributed variable X with known µ and σ is in a particular set, can be calculated by using the fact that the fraction Z = ( X – µ )/σ has a standard normal distribution.

MOMENTS The plain and absolute moments of a variable X are the expected values of Xp and X p , respectively. If the expected value µ and σ is zero, these parameters are called central moments. Usually we are interested only in moments with integer order p. If X has a normal distribution, these moments exist and are finite for any p whose real part is greater than –1. For any non-negative integer p, the plain central moments are:  0 if pisodd, p  EX  =  p σ ()p −1 !! ifp is even. Here n!! denotes the double factorial, that is, the product of all numbers from n to 1 that have the same parity as n. The central absolute moments coincide with plain moments for all even orders, but are nonzero for odd orders. For any non-negative integer p, p +1   2  2 p/2 Γ  p  p  if p is odd  p 2  E X  =σ () p −1 !! ⋅π  = σ .   π 1 if p is even  The last formula is valid also for any non-integer p > – 1. When the mean µ ≠ 0, the plain and absolute moments can be expressed in terms of confluent hypergeometric 1 functions 1F and U. 2 p  p    p  p 1 1 µ E X  =σ ⋅() − i2 U  − , , −    ,  2 2 2 σ   1+ p  Γ   2  p 2  p 1 1 µ  EXF  =σ p ⋅2 p/2  −,,. −    1 1     π  2 2 2 σ   Normal Distribution 197

These expressions remain valid even if p is not integer. Order Non-central moment Central moment 1 µ 0 2 µ2 + σ 2 σ 2 3 µ2 + 3µσ 2 0 4 µ4 + 6µ2σ 2 + 3σ 4 3σ 4 5 µ5 + 10µ3 σ 2 + 15µσ 4 0 6 µ6 + 15µ4σ 2 + 45µ2σ 4 + 15σ 6 15σ 6 7 µ7 + 21µ8σ 2 + 105µ3σ 4 + 105µσ6 0 8 µ8 + 28µ6σ 2 + 210µ4σ 4 + 420µ2σ 8 + 105σ 8 105σ 8 The expectation of X conditioned on the event that X lies in an interval [a, b] is given by

2 f( b) − f( a) E X a< X < b  =µ − σ F()() b− F a where f and F respectively are the density and the cumulative distribution function of X. For b = ∞ this is known as the inverse Mills ratio. Note that above, density f of X is used instead of standard normal density as in inverse Mills ratio, so here we have σ2 instead of σ.

FOURIER TRANSFORM AND CHARACTERISTIC FUNCTION The Fourier transform of a normal density f with mean µ and standard deviation σ is, 1 ∞ − ()σ t 2 fˆ ()() t= f x e−itx dx= e− iµ t e 2 ∫−∞ where i is the imaginary unit. If the mean µ = 0, the first factor is 1, and the Fourier transform is, apart from a constant factor, a normal density on the frequency domain, with mean 0 and standard deviation 1/σ. In particular, the standard normal distribution ϕ is an eigenfunction of the Fourier transform. In probability theory, the Fourier transform of the probability distribution of a real- valued random variable X is closely connected to the characteristic function ϕX (t) of that variable, which is defined as the expected value of eitX, as a function of the real variable t (the frequency parameter of the Fourier transform). This definition can be analytically extended to a complex-value variable t. The relation between both is: ˆ ϕX (t) = f( − t)

MOMENT AND CUMULANT GENERATING FUNCTIONS The moment generating function of a real random variable X is the expected value of e tX, as a function of the real parameter t. For a normal distribution with density f, mean µ and deviation σ, the moment generating function exists and is equal to, 198 Research Methodology and Statistical Methods

1 σ 2t 2 tX ˆ µt 2 M() t= E e  = f() − it = e e The cumulant generating function is the logarithm of the moment generating function, namely, 1 g()() t=ln M t =µ t + σ 2 t 2 2 Since this is a quadratic polynomial in t, only the first two cumulants are nonzero, namely the mean µ and the variance σ 2.

CUMULATIVE DISTRIBUTION FUNCTION The cumulative distribution function (CDF) of the standard normal distribution, usually denoted with the capital Greek letter Φ, is the integral,

1 ∞ 2 Φ()x = ∫ e−t /2 dt 2π −∞ In statistics one often uses the related error function, or erf (x), defined as the probability of a random variable with normal distribution of mean 0 and variance 1/2 falling in the range [ –x, x ]; that is

2 x 2 erf ()x= ∫ e−t dt π 0 These integrals cannot be expressed in terms of elementary functions, and are often said to be special functions. However, many numerical approximations are known. The two functions are closely related, namely 1  x  Φ()x =1 + erf   2  2  For a generic normal distribution with density f, mean µ and deviation σ, the cumulative distribution function is, x − µ  1  x − µ  F() x = Φ  =1 + erf   σ  2  σ 2  The complement of the standard normal CDF, Q( x) =1 − Φ( x) , is often called the Q-function, especially in engineering texts. It gives the probability that the value of a standard normal random variable X will exceed x: P( X> x) . Other definitions of the Q-function, all of which are simple transformations of Φ, are also used occasionally. The graph of the standard normal CDF Φ has 2-fold rotational symmetry around the point (0,1/2); that is, Φ( −x) =1 − Φ( x) . Its antiderivative (indefinite integral) is, ∫ Φ( x) dx = x Φ( x) +ϕ ( x) + C. Normal Distribution 199

The cumulative distribution function (CDF) of the standard normal distribution can be expanded by Integration by parts into a series:

3 5 2n+ 1 1 1 2  x x x  Φ()x =+ ⋅e− x /2  x ++ +⋅⋅⋅+ +⋅⋅⋅ 2 2π  3 3.5() 2n + 1 !!  where ! ! denotes the double factorial.

STANDARD DEVIATION AND COVERAGE About 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule. More precisely, the probability that a normal deviate lies in the range between µ – nσ and µ + nσ is given by n  F()()()()µ+ n σ − F µ − n σ = Φ n − Φ − n = erf  . 2  To 12 significant figures, the values for n = 1, 2, …, 6 are:

QUANTILE FUNCTION The quantile function of a distribution is the inverse of the cumulative distribution function. The quantile function of the standard normal distribution is called the probit function, and can be expressed in terms of the inverse error function: Φ−1 ( p) = 2 erf−1 ( 2 p− 1) , p ∈( 0,1) . For a normal random variable with mean µ and variance σ2, the quantile function is, F−1 ( p) =µ + σ Φ−1 ( p) + σ 2 erf−1 ( 2 p− 1) , p ∈( 0,1) . −1 The quantile Φ ( p) of the standard normal distribution is commonly denoted as zp. These values are used in hypothesis testing, construction of confidence intervals and

Q-Q plots. A normal random variable X will exceed µ + z pσ with probability 1 – p, and will lie outside the interval µ ± zpσ with probability 2 ( 1 – p ). In particular, the quantile z 0.975 is 1.96; therefore a normal random variable will lie outside the interval µ ± 1.96 200 Research Methodology and Statistical Methods

σ in only 5% of cases. The following table gives the quantile z p such that X will lie in the range µ ± zpσ with a specified probability p. These values are useful to determine tolerance interval for sample averages and other statistical estimators with normal (or asymptotically normal) distributions:. NOTE: The following table shows, p +1  2erf−1 () p = Φ−1  , not Φ-1 ()p 2  as defined above. p p z p z p 0.80 1.281551565545 0.999 3.290526731492 0.90 1.644853626951 0.9999 3.890591886413 0.95 1.959963984540 0.99999 4.417173413469 0.98 2.326347874041 0.999999 4.891638475699 0.99 2.575829303549 0.9999999 5.326723886384 0.995 2.807033768344 0.99999999 5.730728868236 0.998 3.090232306168 0.999999999 6.109410204869 For small p, the quantile function has the useful asymptotic expansion,

1 1 Φ−1 ()p = −ln − ln ln − ln()() 2π + o 1 p2 p 2

ZERO-VARIANCE LIMIT In the limit when σ tends to zero, the probability density f (x) eventually tends to zero at any x ≠ µ , but grows without limit if x = µ, while its integral remains equal to 1. Therefore, the normal distribution cannot be defined as an ordinary function when σ = 0. However, one can define the normal distribution with zero variance as a generalized function; specifically, as Dirac’s “delta function” δ translated by the mean µ, that is f (x) = δ (x – µ). Its CF is then the Heaviside step function translated by the mean µ, namely,

0 if x < µ F() x =  1 if x ≥ µ

OCCURRENCE AND APPLICATIONS The occurrence of normal distribution in practical problems can be loosely classified into four categories: 1. Exactly normal distributions; 2. Approximately normal laws, for example when such approximation is justified by the central limit theorem; and Normal Distribution 201

3. Distributions modeled as normal – the normal distribution being the distribution with maximum entropy for a given mean and variance. 4. Regression problems – the normal distribution being found after systematic effects have been modeled sufficiently well.

EXACT NORMALITY Certain quantities in physics are distributed normally, as was first demonstrated by James Clerk Maxwell. Examples of such quantities are: • Probability density function of a ground state in a quantum harmonic oscillator. • The position of a particle that experiences diffusion. If initially the particle is located at a specific point (that is its probability distribution is the dirac delta function), then after time t its location is described by a normal distribution with variance t, which satisfies the diffusion equation, Template:Sfrac2 f(x,t) = Template:Sfrac2 Template:Sfrac2 f(x,t). If the initial location is given by a certain density function g(x), then the density at time t is the convolution of g and the normal PDF.

APPROXIMATE NORMALITY Approximately normal distributions occur in many situations, as explained by the central limit theorem. When the outcome is produced by many small effects acting additively and independently, its distribution will be close to normal. The normal approximation will not be valid if the effects act multiplicatively (instead of additively), or if there is a single external influence that has a considerably larger magnitude than the rest of the effects. • In counting problems, where the central limit theorem includes a discrete-to- continuum approximation and where infinitely divisible and decomposable distributions are involved, such as – Binomial random variables, associated with binary response variables; – Poisson random variables, associated with rare events; • Thermal radiation has a Bose–Einstein distribution on very short time scales, and a normal distribution on longer timescales due to the central limit theorem.

ASSUMED NORMALITY • I can only recognize the occurrence of the normal curve – the Laplacian curve of errors – as a very abnormal phenomenon. It is roughly approximated to in certain distributions; for this reason, and on account for its beautiful simplicity, we may, perhaps, use it as a first approximation, particularly in theoretical investigations.—Pearson (1901) There are statistical methods to empirically test that assumption. • In biology, the logarithm of various variables tend to have a normal 202 Research Methodology and Statistical Methods

distribution, that is, they tend to have a log-normal distribution (after separation on male/female subpopulations), with examples including: – Measures of size of living tissue (length, height, skin area, weight); – The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth; presumably the thickness of tree bark also falls under this category; – Certain physiological measurements, such as blood pressure of adult humans. • In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and stock market indices are assumed normal (these variables behave like compound interest, not like simple interest, and so are multiplicative). Some mathematicians such as Benoit Mandelbrot have argued that log-Levy distributions, which possesses heavy tails would be a more appropriate model, in particular for the analysis for stock market crashes. The use of the assumption of normal distribution occurring in financial models has also been criticized by Nassim Nicholas Taleb in his works. • Measurement errors in physical experiments are often modeled by a normal distribution. This use of a normal distribution does not imply that one is assuming the measurement errors are normally distributed, rather using the normal distribution produces the most conservative predictions possible given only knowledge about the mean and variance of the errors. • In standardized testing, results can be made to have a normal distribution by either selecting the number and difficulty of questions (as in the IQ test) or transforming the raw test scores into “output” scores by fitting them to the normal distribution. For example, the SAT’s traditional range of 200–800 is based on a normal distribution with a mean of 500 and a standard deviation of 100. • Many scores are derived from the normal distribution, including percentile ranks (“percentiles” or “quantiles”), normal curve equivalents, stanines, z-scores, and T-scores. Additionally, some behavioral statistical procedures assume that scores are normally distributed; for example, t-tests and ANOVAs. Bell curve grading assigns relative grades based on a normal distribution of scores. • In hydrology the distribution of long duration river discharge or rainfall, e.g., monthly and yearly totals, is often thought to be practically normal according to the central limit theorem. The blue picture, made with CumFreq, illustrates an example of fitting the normal distribution to ranked October rainfalls showing the 90% confidence belt based on the . The rainfall data are represented by plotting positions as part of the cumulative frequency analysis.

PRODUCED NORMALITY In regression analysis, lack of normality in residuals simply indicates that the model postulated is inadequate in accounting for the tendency in the data and needs to be Normal Distribution 203 augmented; in other words, normality in residuals can always be achieved given a properly constructed model.

GENERATING VALUES FROM NORMAL DISTRIBUTION In computer simulations, especially in applications of the Monte-Carlo method, it is often desirable to generate values that are normally distributed. The algorithms listed below all generate the standard normal deviates, since a N(µ,σ2) can be generated as X = µ + σZ, where Z is standard normal. All these algorithms rely on the availability of a random number generator U capable of producing uniform random variates. • The most straightforward method is based on the probability integral transform property: if U is distributed uniformly on (0,1), then Φ–1(U) will have the standard normal distribution. The drawback of this method is that it relies on calculation of the probit function Φ–1, which cannot be done analytically. Some approximate methods are described in Hart (1968) and in the erf article. Wichura gives a fast algorithm for computing this function to 16 decimal places, which is used by R to compute random variates of the normal distribution. • An easy to programme approximate approach, that relies on the central limit theorem, is as follows: generate 12 uniform U(0,1) deviates, add them all up, and subtract 6 – the resulting random variable will have approximately standard normal distribution. In truth, the distribution will be Irwin–Hall, which is a 12-section eleventh-order polynomial approximation to the normal distribution. This random deviate will have a limited range of (–6, 6). • The Box–Muller method uses two independent random numbers U and V distributed uniformly on (0,1). Then the two random variables X and Y, XUVYUV= −2ln cos( 2π ) ,= − 2ln sin( 2π ) . will both have the standard normal distribution, and will be independent. This formulation arises because for a bivariate normal random vector (X, Y) the squared norm X2 + Y2 will have the chi-squared distribution with two degrees of freedom, which is an easily generated exponential random variable corresponding to the quantity –2ln(U) in these equations; and the angle is distributed uniformly around the circle, chosen by the random variable V. • Marsaglia polar method is a modification of the Box–Muller method algorithm, which does not require computation of functions sin() and cos(). In this method U and V are drawn from the uniform (–1,1) distribution, and then S = U2 + V2 is computed. If S is greater or equal to one then the method starts over, otherwise two quantities, 204 Research Methodology and Statistical Methods

−2ln S −2ln S XUYV= , = S S are returned. Again, X and Y will be independent and standard normally distributed. • The Ratio method is a rejection method. The algorithm proceeds as follows: – Generate two independent uniform deviates U and V; – Compute X=8 / e( v − 0.5) / U ; – Optional: if X2 ≤ 5 – 4e1/4U then accept X and terminate algorithm; – Optional: if X2 ≥ 4e–1.35/U + 1.4 then reject X and start over from step 1; – If X2 ≥ –4 ln U then accept X, otherwise start over the algorithm. The two optional steps allow the evaluation of the logarithm in the last step to be avoided in most cases. These steps can be greatly improved so that the logarithm is rarely evaluated. • The ziggurat algorithm is faster than the Box–Muller transform and still exact. In about 97% of all cases it uses only two random numbers, one random integer and one random uniform, one multiplication and an if-test. Only in 3% of the cases, where the combination of those two falls outside the “core of the ziggurat” (a kind of rejection sampling using logarithms), do exponentials and more uniform random numbers have to be employed. • Integer arithmetic can be used to sample from the standard normal distribution. This method is exact in the sense that it satisfies the conditions of ideal approximation; i.e., it is equivalent to sampling a real number from the standard normal distribution and rounding this to the nearest representable floating point number. • There is also some investigation into the connection between the fast Hadamard transform and the normal distribution, since the transform employs just addition and subtraction and by the central limit theorem random numbers from almost any distribution will be transformed into the normal distribution. In this regard a series of Hadamard transforms can be combined with random permutations to turn arbitrary data sets into a normally distributed data. Statistical Hypothesis Testing 205

5

Statistical Hypothesis Testing

A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference. Commonly, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis that proposes no relationship between two data sets. The comparison is deemed statistically significant if the relationship between the data sets would be an unlikely realization of the null hypothesis according to a threshold probability—the significance level. Hypothesis tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance. The process of distinguishing between the null hypothesis and the alternative hypothesis is aided by identifying two conceptual types of errors (type 1 & type 2), and by specifying parametric limits on e.g., how much type 1 error will be permitted. An alternative framework for statistical hypothesis testing is to specify a set of statistical models, one for each candidate hypothesis, and then use model selection techniques to choose the most appropriate model. The most common selection techniques are based on either Akaike information criterion or Bayes factor. Confirmatory data analysis can be contrasted with exploratory data analysis, which may not have pre-specified hypotheses. 206 Research Methodology and Statistical Methods

VARIATIONS AND SUB-CLASSES Statistical hypothesis testing is a key technique of both frequentist inference and Bayesian inference, although the two types of inference have notable differences. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly deciding that a default position (null hypothesis) is incorrect. The procedure is based on how likely it would be for a set of observations to occur if the null hypothesis were true. Note that this probability of making an incorrect decision is not the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis. One naïve Bayesian approach to hypothesis testing is to base decisions on the posterior probability, but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory, attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via decision theory and optimal decisions, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the power of tests, i.e., the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of sample size determination prior to the collection of data.

THE TESTING PROCESS In the statistics literature, statistical hypothesis testing plays a fundamental role. The usual line of reasoning is as follows: 1. There is an initial research hypothesis of which the truth is unknown. 2. The first step is to state the relevant null and alternative hypotheses. This is important, as mis-stating the hypotheses will muddy the rest of the process. 3. The second step is to consider the statistical assumptions being made about the sample in doing the test; for example, assumptions about the statistical independence or about the form of the distributions of the observations. This is equally important as invalid assumptions will mean that the results of the test are invalid. 4. Decide which test is appropriate, and state the relevant test statistic T. 5. Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard casess this will be a well-known result. For example, the test statistic might follow a Student’s t distribution or a normal distribution. 6. Select a significance level (α), a probability threshold below which the null hypothesis will be rejected. Common values are 5% and 1%. Statistical Hypothesis Testing 207

7. The distribution of the test statistic under the null hypothesis partitions the possible values of T into those for which the null hypothesis is rejected—the so-called critical region—and those for which it is not. The probability of the critical region is α. 8. Compute from the observations the observed value tobs of the test statistic T. 9. Decide to either reject the null hypothesis in favour of the alternative or not reject it. The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and to accept or “fail to reject” the hypothesis otherwise. An alternative process is commonly used: 1. Calculate the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed. 2. Reject the null hypothesis, in favour of the alternative hypothesis, if and only if the p-value is less than the significance level (the selected probability) threshold. The two processes are equivalent. The former process was advantageous in the past when only tables of test statistics at common probability thresholds were available. It allowed a decision to be made without the calculation of a probability. It was adequate for classwork and for operational use, but it was deficient for reporting results. The latter process relied on extensive tables or on computational support not always available. The explicit calculation of a probability is useful for reporting. The calculations are now trivially performed with appropriate software. The difference in the two processes applied to the Radioactive suitcase example (below): • “The Geiger-counter reading is 10. The limit is 9. Check the suitcase.” • “The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase.” The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked. It is important to note the difference between accepting the null hypothesis and simply failing to reject it. The “fail to reject” terminology highlights the fact that the null hypothesis is assumed to be true from the start of the test; if there is a lack of evidence against it, it simply continues to be assumed true. The phrase “accept the null hypothesis” may suggest it has been proved simply because it has not been disproved, a logical fallacy known as the argument from ignorance. Unless a test with particularly high power is used, the idea of “accepting” the null hypothesis may be dangerous. Nonetheless the terminology is prevalent throughout statistics, where the meaning actually intended is well understood. The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations. It is particularly critical that appropriate 208 Research Methodology and Statistical Methods sample sizes be estimated before conducting the experiment. The phrase “test of significance” was coined by statistician Ronald Fisher.

INTERPRETATION The p-value is the probability that a given result (or a more significant result) would occur under the null hypothesis. For example, say that a fair coin is tested for fairness (the null hypothesis). At a significance level of 0.05, the fair coin would be expected to (incorrectly) reject the null hypothesis in about 1 out of every 20 tests. The p-value does not provide the probability that either hypothesis is correct (a common source of confusion). If the p-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the chosen level of significance. Rejection of the null hypothesis is a conclusion. This is like a “guilty” verdict in a criminal trial: the evidence is sufficient to reject innocence, thus proving guilt. We might accept the alternative hypothesis (and the research hypothesis). If the p-value is not less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the evidence is insufficient to support a conclusion. (This is similar to a “not guilty” verdict.) The researcher typically gives extra consideration to those cases where the p-value is close to the significance level. Some people find it helpful to think of the hypothesis testing framework as analogous to a mathematical proof by contradiction. In the Lady tasting tea example, Fisher required the Lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur. Whether rejection of the null hypothesis truly justifies acceptance of the research hypothesis depends on the structure of the hypotheses. Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of Bigfoot. Hypothesis testing emphasizes the rejection, which is based on a probability, rather than the acceptance, which requires extra steps of logic. “The probability of rejecting the null hypothesis is a function of five factors: whether the test is one- or two tailed, the level of significance, the standard deviation, the amount of deviation from the null hypothesis, and the number of observations.” These factors are a source of criticism; factors under the control of the experimenter/analyst give the results an appearance of subjectivity.

USE AND IMPORTANCE Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. Statistical Hypothesis Testing 209

In the Lady tasting tea example, it was “obvious” that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the “obvious”. Real world applications of hypothesis testing include: • Testing whether more men than women suffer from nightmares • Establishing authorship of documents • Evaluating the effect of the full moon on behaviour • Determining the range at which a bat can detect an insect by echo • Deciding whether hospital carpeting results in more infections • Selecting the best means to stop smoking • Checking whether bumper stickers reflect car owner behaviour • Testing the claims of handwriting analysts. Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: “Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future”. Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the Journal of Applied Psychology during the early 1990s). Other fields have favored the estimation of parameters (e.g., effect size). Significance testing is used as a substitute for the traditional comparison of predicted value and experimental result at the core of the scientific method. When theory is only capable of predicting the sign of a relationship, a directional (one-sided) hypothesis test can be configured so that only a statistically significant result supports theory. This form of theory appraisal is the most heavily criticized application of hypothesis testing.

CAUTIONS “If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed.” This caution applies to hypothesis tests and alternatives to them. The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion might be wrong. The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including: • The clever Hans effect. A horse appeared to be capable of doing simple arithmetic. • The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse. • The placebo effect. Pills with no medically active ingredients were remarkably effective. 210 Research Methodology and Statistical Methods

A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy. The book How to Lie with Statistics is the most popular book on statistics ever published. It does not much consider hypothesis testing, but its cautions are applicable, including: Many claims are made on the basis of samples too small to convince. If a report does not mention sample size, be doubtful. Hypothesis testing acts as a filter of statistical conclusions; only those results meeting a probability threshold are publishable. Economics also acts as a publication filter; only those results favorable to the author and funding source may be submitted for publication. The impact of filtering on publication is termed publication bias. A related problem is that of multiple testing (sometimes linked to data mining), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported. These are often dealt with by using multiplicity correction procedures that control the family wise error rate (FWER) or the false discovery rate (FDR). Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, “Figures never lie, but liars figure” (anonymous).

EXAMPLES

TLADY ASTING TEA In a famous example of hypothesis testing, known as the Lady tasting tea, Dr. Muriel Bristol, a female colleague of Fisher claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the 4 cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (< 5%; 1 of 70 ≈ 1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup, which would be considered a statistically significant result.

COURTROOM TRIAL A statistical test procedure is comparable to a criminal trial; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt Statistical Hypothesis Testing 211 of the defendant. Only when there is enough evidence for the prosecution is the defendant convicted.

In the start of the procedure, there are two hypotheses H0: “the defendant is not guilty”, and H1: “the defendant is guilty”. The first one, H0, is called the null hypothesis, and is for the time being accepted. The second one, H1, is called the alternative hypothesis. It is the alternative hypothesis that one hopes to support. The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn’t want to convict an innocent defendant. Such an error is called error of the first kind (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an error of the second kind (acquitting a person who committed the crime), is more common.

H0 is true H1 is true Truly not guilty Truly guilty Accept null hypothesis Right decision Wrong decision Acquittal Type II Error Reject null hypothesis Wrong decision Right decision Conviction Type I Error A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold (“beyond a reasonable doubt”). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.

PHILOSOPHER’S BEANS The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was formalized and popularized. Few beans of this handful are white. Most beans in this bag are white. Therefore: Probably, these beans were taken from another bag. This is an hypothetical inference. The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the “obvious” difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard. A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different from that of the bag, then the sample probably originated from another 212 Research Methodology and Statistical Methods bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test. The statement also relies on the inference that the sampling was random. If someone had been picking through the bag to find white beans, then it would explain why the handful had so many white beans, and also explain why the number of white beans in the bag was depleted (although the bag is probably intended to be assumed much larger than one’s hand).

CLAIRVOYANT CARD GAME A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is: the person is (more or less) clairvoyant. If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly p. The hypotheses, then, are: 1 • Null hypothesis: H: p = (just guessing) and 0 4 1 • Alternative hypothesis: H 1: H: p < (true clairvoyant). 1 4 When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? What is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value c? It is obvious that with the choice c = 25 (i.e., we only accept clairvoyance when all cards are predicted correctly) we’re more critical than with c = 10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a false positive, or Type I error. With c = 25 the probability of such an error is:

25 1   1  −15 P( reject H0H 0 is valid) = P X=25 p =  =  ≈ 10 , 4   4  and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times. Being less critical, with c=10, gives: Statistical Hypothesis Testing 213

 1 25  1  PHH(reject0 0 is valid) = P X≥ 10 p = =∑ P  X = k p = ≈ 0.07.  4 k =10  4  Thus, c = 10 yields a much greater probability of false positive. Before the test is actually performed, the maximum acceptable probability of a Type I error (α) is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value c is calculated. For example, if we select an error rate of 1%, c is calculated thus: 1  P(reject H0 H 0 is valid) = P X ≥ c p =  ≤ 0.01. 4  From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: c = 13.

RADIOACTIVE SUITCASE As an example, consider determining whether a suitcase contains some radioactive material. Placed under a Geiger counter, it produces 10 counts per minute. The null hypothesis is that no radioactive material is in the suitcase and that all measured counts are due to ambient radioactivity typical of the surrounding air and harmless objects. We can then calculate how likely it is that we would observe 10 counts per minute if the null hypothesis were true. If the null hypothesis predicts (say) on average 9 counts per minute, then according to the Poisson distribution typical for radioactive decay there is about 41% chance of recording 10 or more counts. Thus we can say that the suitcase is compatible with the null hypothesis (this does not guarantee that there is no radioactive material, just that we don’t have enough evidence to suggest there is). On the other hand, if the null hypothesis predicts 3 counts per minute (for which the Poisson distribution predicts only 0.1% chance of recording 10 or more counts) then the suitcase is not compatible with the null hypothesis, and there are likely other factors responsible to produce the measurements. The test does not directly assert the presence of radioactive material. A successful test asserts that the claim of no radioactive material present is unlikely given the reading (and therefore...). The double negative (disproving the null hypothesis) of the method is confusing, but using a counter-example to disprove is standard mathematical practice. The attraction of the method is its practicality. We know (from experience) the expected range of counts with only ambient radioactivity present, so we can say that a measurement is unusually large. Statistics just formalizes the intuitive by using numbers instead of adjectives. We probably do not know the characteristics of the radioactive suitcases; We just assume that they produce larger readings. 214 Research Methodology and Statistical Methods

To slightly formalize intuition: radioactivity is suspected if the Geiger-count with the suitcase is among or exceeds the greatest (5% or 1%) of the Geiger-counts made with ambient radiation alone. This makes no assumptions about the distribution of counts. Many ambient radiation observations are required to obtain good probability estimates for rare events. The test described here is more fully the null-hypothesis statistical significance test. The null hypothesis represents what we would believe by default, before seeing any evidence. Statistical significance is a possible finding of the test, declared when the observed sample is unlikely to have occurred by chance if the null hypothesis were true. The name of the test describes its formulation and its possible outcome. One characteristic of the test is its crisp decision: to reject or not reject the null hypothesis. A calculated value is compared to a threshold, which is determined from the tolerable risk of error. DEFINITION OF TERMS The following definitions are mainly based on the exposition in the book by Lehmann and Romano: • Statistical hypothesis: A statement about the parameters describing a population (not a sample). • Statistic: A value calculated from a sample, often to summarize the sample for comparison purposes. • Simple hypothesis: Any hypothesis which specifies the population distribution completely. • Composite hypothesis: Any hypothesis which does not specify the population distribution completely. • Null hypothesis (H0): A hypothesis associated with a contradiction to a theory one would like to prove. • Alternative hypothesis (H1): A hypothesis (often composite) associated with a theory one would like to prove. • Statistical test: A procedure whose inputs are samples and whose result is a hypothesis. • Region of acceptance: The set of values of the test statistic for which we fail to reject the null hypothesis. • Region of rejection/Critical region: The set of values of the test statistic for which the null hypothesis is rejected. • Critical value: The threshold value delimiting the regions of acceptance and rejection for the test statistic. • (1 – β): The test’s probability of correctly rejecting the null hypothesis. The complement of the false negative rate, β. Power is termed sensitivity in biostatistics. (“This is a sensitive test. Because the result is negative, we can confidently say that the patient does not have the condition.”) Statistical Hypothesis Testing 215

• Size: For simple hypotheses, this is the test’s probability of incorrectly rejecting the null hypothesis. The false positive rate. For composite hypotheses this is the supremum of the probability of rejecting the null hypothesis over all cases covered by the null hypothesis. The complement of the false positive rate is termed specificity in biostatistics. (“This is a specific test. Because the result is positive, we can confidently say that the patient has the condition.”) • Significance level of a test (α): It is the upper bound imposed on the size of a test. Its value is chosen by the statistician prior to looking at the data or choosing any particular test to be used. It is the maximum exposure to erroneously rejecting H0 he/she is ready to accept. Testing H0 at significance level α means testing H0 with a test whose size does not exceed α. In most cases, one uses tests whose size is equal to the significance level. • p-value: The probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. • Statistical significance test: A predecessor to the statistical hypothesis test. An experimental result was said to be statistically significant if a sample was sufficiently inconsistent with the (null) hypothesis. This was variously considered common sense, a pragmatic heuristic for identifying meaningful experimental results, a convention establishing a threshold of statistical evidence or a method for drawing conclusions from data. The statistical hypothesis test added mathematical rigor and philosophical consistency to the concept by making the alternative hypothesis explicit. The term is loosely used to describe the modern version which is now part of statistical hypothesis testing. • Conservative test: A test is conservative if, when constructed for a given nominal significance level, the true probability of incorrectly rejecting the null hypothesis is never greater than the nominal level. • Exact test: A test in which the significance level or critical value can be computed exactly, i.e., without any approximation. In some contexts this term is restricted to tests applied to categorical data and to permutation tests, in which computations are carried out by complete enumeration of all possible outcomes and their probabilities. A statistical hypothesis test compares a test statistic (z or t for examples) to a threshold. The test statistic (the formula found in the table below) is based on optimality. For a fixed level of Type I error rate, use of these statistics minimizes Type II error rates (equivalent to maximizing power). The following terms describe tests in terms of such optimality: • Most powerful test: For a given size or significance level, the test with the greatest power (probability of rejection) for a given value of the parameter(s) being tested, contained in the alternative hypothesis. • Uniformly most powerful test (UMP): A test with the greatest power for all values of the parameter(s) being tested, contained in the alternative hypothesis. 216 Research Methodology and Statistical Methods

ORIGINS AND EARLY CONTROVERSY Significance testing is largely the product of Karl Pearson (p-value, Pearson’s chi- squared test), William Sealy Gosset (Student’s t-distribution), and Ronald Fisher (“null hypothesis”, analysis of variance, “significance test”), while hypothesis testing was developed by and (son of Karl). Ronald Fisher began his life in statistics as a Bayesian (Zabell 1992), but Fisher soon grew disenchanted with the subjectivity involved (namely use of the principle of indifference when determining prior probabilities), and sought to provide a more “objective” approach to inductive inference. Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an inconsistent hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century. While hypothesis testing was popularized early in the 20th century, evidence of its use can be found much earlier. In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect. Fisher popularized the “significance test”. He required a null-hypothesis (corresponding to a population frequency distribution) and a sample. His (now familiar) calculations determined whether to reject the null-hypothesis or not. Significance testing did not utilize an alternative hypothesis so there was no concept of a Type II error. The p-value was devised as an informal, but objective, index meant to help a researcher determine (based on other knowledge) whether to modify future experiments or strengthen one’s faith in the null hypothesis. Hypothesis testing (and Type I/II errors) was devised by Neyman and Pearson as a more objective alternative to Fisher’s p- value, also meant to determine researcher behaviour, but without requiring any inductive inference by the researcher. Neyman & Pearson considered a different problem (which they called “hypothesis testing”). They initially considered two simple hypotheses (both with frequency distributions). They calculated two probabilities and typically selected the hypothesis associated with the higher probability (the hypothesis more likely to have generated the sample). Their method always selected a hypothesis. It also allowed the calculation of both types of error probabilities. Fisher and Neyman/Pearson clashed bitterly. Neyman/Pearson considered their formulation to be an improved generalization of significance testing.(The defining paper was abstract. Mathematicians have generalized and refined the theory for decades.) Fisher thought that it was not applicable to scientific research because often, during the course of the experiment, it is discovered that the initial assumptions about the null Statistical Hypothesis Testing 217 hypothesis are questionable due to unexpected sources of error. He believed that the use of rigid reject/accept decisions based on models formulated before data is collected was incompatible with this common scenario faced by scientists and attempts to apply this method to scientific research would lead to mass confusion. The dispute between Fisher and Neyman–Pearson was waged on philosophical grounds, characterized by a philosopher as a dispute over the proper role of models in statistical inference. Events intervened: Neyman accepted a position in the western hemisphere, breaking his partnership with Pearson and separating disputants (who had occupied the same building) by much of the planetary diameter. World War II provided an intermission in the debate. The dispute between Fisher and Neyman terminated (unresolved after 27 years) with Fisher’s death in 1962. Neyman wrote a well-regarded eulogy. Some of Neyman’s later publications reported p-values and significance levels. The modern version of hypothesis testing is a hybrid of the two approaches that resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s. (But signal detection, for example, still uses the Neyman/Pearson formulation.) Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher’s method than theirs. This history explains the inconsistent terminology (example: the null hypothesis is never accepted, but there is a region of acceptance). Sometime around 1940, in an apparent effort to provide researchers with a “non- controversial” way to have their cake and eat it too, the authors of statistical text books began anonymously combining these two strategies by using the p-value in place of the test statistic (or data) to test against the Neyman–Pearson “significance level”. Thus, researchers were encouraged to infer the strength of their data against some null hypothesis using p-values, while also thinking they are retaining the post-data collection objectivity provided by hypothesis testing. It then became customary for the null hypothesis, which was originally some realistic research hypothesis, to be used almost solely as a strawman “nil” hypothesis (one where a treatment has no effect, regardless of the context). Table: A comparison between Fisherian, frequentist (Neyman–Pearson) Fisher’s null hypothesis testing Neyman–Pearson decision theory

1. Set up a statistical null hypothesis. The null need 1. Set up two statistical hypotheses, H1 and H2, not be a nil hypothesis (i.e., zero difference). and decide about α, β, and sample size before the experiment, based on subjective cost-benefit considerations. These define a rejection region for each hypothesis. 218 Research Methodology and Statistical Methods

Fisher’s null hypothesis testing Neyman–Pearson decision theory 2. Report the exact level of significance 2. If the data falls into the rejection region of (e.g., p = 0.051 or p = 0.049). Do not use a H1, accept H2; otherwise accept H1. Note that conventional 5% level, and do not talk about accepting a hypothesis does not mean that you accepting or rejecting hypotheses. If the result is believe in it, but only that you act as if it were “not significant”, draw no conclusions and make no true. decisions, but suspend judgement until further data is available. 3. Use this procedure only if little is known about the 3. The usefulness of the procedure is limited problem at hand, and only to draw provisional among others to situations where you have a conclusions in the context of an attempt to understand disjunction of hypotheses (e.g., either µ1 = 8 or the experimental situation. µ2 = 10 is true) and where you can make meaningful cost-benefit trade-offs for choosing alpha and beta.

EARLY CHOICES OF NULL HYPOTHESIS Paul Meehl has argued that the epistemological importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to “no difference” or “no effect”, a more precise experiment is a less severe test of the theory that motivated performing the experiment. An examination of the origins of the latter practice may therefore be useful: 1778: Pierre Laplace compares the birthrates of boys and girls in multiple European cities. He states: “it is natural to conclude that these possibilities are very nearly in the same ratio”. Thus Laplace’s null hypothesis that the birthrates of boys and girls should be equal given “conventional wisdom”. 1900: Karl Pearson develops the chi squared test to determine “whether a given form of frequency curve will effectively describe the samples drawn from a given population.” Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the Weldon dice throw data. 1904: Karl Pearson develops the concept of “contingency” in order to determine whether outcomes are independent of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g., scar formation and death rates from smallpox). The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that lead Fisher and others to dismiss the use of “inverse probabilities”.

NULL HYPOTHESIS STATISTICAL SIGNIFICANCE TESTING An example of Neyman–Pearson hypothesis testing can be made by a change to the radioactive suitcase example. If the “suitcase” is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three Statistical Hypothesis Testing 219 hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The Neyman–Pearson lemma of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a likelihood ratio). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for proving a negative. Null hypotheses should be at least falsifiable. Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions. The former allows each test to consider the results of earlier tests (unlike Fisher’s significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses. The two forms of hypothesis testing are based on different problem formulations. The original test is analogous to a true/false question; the Neyman–Pearson test is more like multiple choice. In the view of Tukey the former produces a conclusion on the basis of only strong evidence while the latter produces a decision on the basis of available evidence. While the two tests seem quite different both mathematically and philosophically, later developments lead to the opposite claim. Consider many tiny radioactive sources. The hypotheses become 0,1,2,3... grains of radioactive sand. There is little distinction between none or some radiation (Fisher) and 0 grains of radioactive sand versus all of the alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933 also considered composite hypotheses (ones whose distribution includes an unknown parameter). An example proved the optimality of the (Student’s) t-test, “there can be no better test for the hypothesis under consideration”. Neyman–Pearson theory was proving the optimality of Fisherian methods from its inception. Fisher’s significance testing has proven a popular flexible statistical tool in application with little mathematical growth potential. Neyman–Pearson hypothesis testing is claimed as a pillar of mathematical statistics, creating a new paradigm for the field. It also stimulated new applications in statistical process control, detection theory, decision theory and game theory. Both formulations have been successful, but the successes have been of a different character. The dispute over formulations is unresolved. Science primarily uses Fisher’s (slightly modified) formulation as taught in introductory statistics. Statisticians study Neyman– Pearson theory in graduate school. Mathematicians are proud of uniting the formulations. Philosophers consider them separately. Learned opinions deem the formulations variously competitive (Fisher vs Neyman), incompatible or complementary. The dispute has become more complex since Bayesian inference has achieved respectability. The terminology is inconsistent. Hypothesis testing can mean any mixture of two formulations that both changed with time. Any discussion of significance testing vs 220 Research Methodology and Statistical Methods hypothesis testing is doubly vulnerable to confusion. Fisher thought that hypothesis testing was a useful strategy for performing industrial quality control, however, he strongly disagreed that hypothesis testing could be useful for scientists. Hypothesis testing provides a means of finding test statistics used in significance testing. The concept of power is useful in explaining the consequences of adjusting the significance level and is heavily used in sample size determination. The two methods remain philosophically distinct. They usually (but not always) produce the same mathematical answer. The preferred answer is context dependent. While the existing merger of Fisher and Neyman–Pearson theories has been heavily criticized, modifying the merger to achieve Bayesian goals has been considered. CRITICISM Criticism of statistical hypothesis testing fills volumes citing 300–400 primary references. Much of the criticism can be summarized by the following issues: • The interpretation of a p-value is dependent upon stopping rule and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e., “p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren’t”). • Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct. • Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments. • Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias. Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused. • When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g., increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%. • Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts. If the decisions are based on convention they are termed arbitrary or mindless while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so “... it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis.”. “Statistically significant findings are often misleading” in psychology. Statistical significance does not imply practical significance and correlation does not Statistical Hypothesis Testing 221

imply causation. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis. • “[I]t does not tell us what we want to know”. Lists of dozens of complaints are available. Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is inadequate as the sole tool for statistical analysis. Successfully rejecting the null hypothesis may offer no support for the research hypothesis. The continuing controversy concerns the selection of the best statistical practices for the near-term future given the (often poor) existing practices. Critics would prefer to ban NHST completely, forcing a complete departure from those practices, while supporters suggest a less absolute change. Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review, medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively. Textbooks have added some cautions and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Major organizations have not abandoned use of significance tests although some have discussed doing so. ALTERNATIVES The numerous criticisms of significance testing do not lead to a single alternative. A unifying position of critics is that statistics should not lead to a conclusion or a decision but to a probability or to an estimated value with a confidence interval rather than to an accept-reject decision regarding a particular hypothesis. It is unlikely that the controversy surrounding significance testing will be resolved in the near future. Its supposed flaws and unpopularity do not eliminate the need for an objective and transparent means of reaching conclusions regarding studies that produce statistical results. Critics have not unified around an alternative. Other forms of reporting confidence or uncertainty could probably grow in popularity. One strong critic of significance testing suggested a list of reporting alternatives: effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta- analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. “The distinction between the... approaches is largely one of reporting and interpretation.” 222 Research Methodology and Statistical Methods

On one “alternative” there is no disagreement: Fisher himself said, “In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” Cohen, an influential critic of significance testing, concurred, “... don’t look for a magic alternative to NHST [null hypothesis significance testing]... It doesn’t exist.” “... given the problems of statistical induction, we must finally rely, as have the older sciences, on replication.” The “alternative” to significance testing is repeated testing. The easiest way to decrease statistical uncertainty is by obtaining more data, whether by increased sample size or by repeated tests. Nickerson claimed to have never seen the publication of a literally replicated experiment in psychology. An indirect approach to replication is meta-analysis. Bayesian inference is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)). For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist John K. Kruschke has suggested Bayesian estimation as an alternative for the t-test. Alternatively two competing models/hypothesis can be compared using Bayes factors. Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used. Neither the prior probabilities nor the probability distribution of the test statistic under the alternative hypothesis are often available in the social sciences. Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected. Neither Fisher’s significance testing, nor Neyman–Pearson hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of Bayes’ Theorem, which was unsatisfactory to both the Fisher and Neyman–Pearson camps due to the explicit use of subjectivity in the form of the prior probability. Fisher’s strategy is to sidestep this with the p-value (an objective index based on the data alone) followed by inductive inference, while Neyman–Pearson devised their approach of inductive behaviour.

PHILOSOPHY Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, “All knowledge degenerates into probability.” Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the philosophy Statistical Hypothesis Testing 223 of science. Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical. Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments. Hypothesis testing is of continuing interest to philosophers.

EDUCATION Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught. Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. An informed public should understand the limitations of statistical conclusions and many college fields of study require a course in statistics for the same reason. An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis. An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like z, Student’s t, F and chi-squared). Statistical hypothesis testing is considered a mature area within statistics, but a limited amount of development continues. The cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors. While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing. Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.

ONE- AND TWO-TAILED TESTS In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value may be more than or less than the reference value, for example, whether a test taker may score above or below the historical average. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, for example, whether a machine produces more than one-percent defective products. Alternative names are one-sided and two-sided tests; the terminology “tail” is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often “tail off” towards zero as in the normal distribution or “bell curve”. 224 Research Methodology and Statistical Methods

Fig. Fig. A two-tailed test applied to the normal distribution.

Fig. Fig. A one-tailed test, showing the p-value as the size of one tail.

APPLICATIONS One-tailed tests are used for asymmetric distributions that have a single tail, such as the chi-squared distribution, which are common in measuring goodness-of-fit, or for one side of a distribution that has two tails, such as the normal distribution, which is common in estimating location; this corresponds to specifying a direction. Two-tailed tests are only applicable when there are two tails, such as in the normal distribution, and correspond to considering either direction significant. In the approach of Ronald Fisher, the null hypothesis H0 will be rejected when the p- value of the test statistic is sufficiently extreme (vis-a-vis the test statistic’s sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, “extreme” is decided beforehand as either meaning “sufficiently small” or meaning “sufficiently large” – values in the other direction are considered not significant. In a two-tailed test, “extreme” means “either sufficiently small or sufficiently large”, and values in either direction are considered significant. For a given test statistic there is a single two-tailed test, and two one-tailed tests, one each for either direction. Given data Statistical Hypothesis Testing 225 of a given significance level in a two-tailed test for a test statistic, in the corresponding one-tailed tests for the same test statistic it will be considered either twice as significant (half the p-value), if the data is in the direction specified by the test, or not significant at all (p-value above 0.05), if the data is in the direction opposite that specified by the test. For example, if flipping a coin, testing whether it is biased towards heads is a one- tailed test, and getting data of “all heads” would be seen as highly significant, while getting data of “all tails” would be not significant at all (p = 1). By contrast, testing whether it is biased in either direction is a two-tailed test, and either “all heads” or “all tails” would both be seen as highly significant data. In medical testing, while one is generally interested in whether a treatment results in outcomes that are better than chance, thus suggesting a one-tailed test; a worse outcome is also interesting for the scientific field, therefore one should use a two-tailed test that corresponds instead to testing whether the treatment results in outcomes that are different from chance, either better or worse. In the archetypal lady tasting tea experiment, Fisher tested whether the lady in question was better than chance at distinguishing two types of tea preparation, not whether her ability was different from chance, and thus he used a one-tailed test.

COIN FLIPPING EXAMPLE In coin flipping, the null hypothesis is a sequence of Bernoulli trials with probability 0.5, yielding a random variable X which is 1 for heads and 0 for tails, and a common test statistic is the sample mean (of the number of heads) X . If testing for whether the coin is biased towards heads, a one-tailed test would be used – only large numbers of heads would be significant. In that case a data set of five heads (HHHHH), with sample mean of 1, has a 1/32 = 0.03125 ≈ 0.03 chance of occurring, (5 consecutive flips with 2 outcomes - ((1/2)5 =1/ 32), and thus would have p ≈ 0.03 and would be significant (rejecting the null hypothesis) if using 0.05 as the cutoff. However, if testing for whether the coin is biased towards heads or tails, a two-tailed test would be used, and a data set of five heads (sample mean 1) is as extreme as a data set of five tails (sample mean 0), so the p-value would be 2/32 = 0.0625 ≈ 0.06 and this would not be significant (not rejecting the null hypothesis) if using 0.05 as the cutoff.

HISTORY The p-value was introduced by Karl Pearson in the Pearson’s chi-squared test, where he defined P (original notation) as the probability that the statistic would be at or above a given level. This is a one-tailed definition, and the chi-squared distribution is asymmetric, only assuming positive or zero values, and has only one tail, the upper one. It measures goodness of fit of data with a theoretical distribution, with zero corresponding to exact agreement with the theoretical distribution; the p-value thus measures how likely the fit would be this bad or worse. 226 Research Methodology and Statistical Methods

The distinction between one-tailed and two-tailed tests was popularized by Ronald Fisher in the influential book Statistical Methods for Research Workers, where he applied it especially to the normal distribution, which is a symmetric distribution with two equal tails. The normal distribution is a common measure of location, rather than goodness-of-fit, and has two tails, corresponding to the estimate of location being above or below the theoretical location (e.g., sample mean compared with theoretical mean). In the case of a symmetric distribution such as the normal distribution, the one-tailed p- value is exactly half the two-tailed p-value: Some confusion is sometimes introduced by the fact that in some cases we wish to know the probability that the deviation, known to be positive, shall exceed an observed value, whereas in other cases the probability required is that a deviation, which is equally frequently positive and negative, shall exceed an observed value; the latter probability is always half the former.—Ronald Fisher, Statistical Methods for Research Workers Fisher emphasized the importance of measuring the tail – the observed value of the test statistic and all more extreme – rather than simply the probability of specific outcome itself, in his The Design of Experiments (1935). He explains this as because a specific set of data may be unlikely (in the null hypothesis), but more extreme outcomes likely, so seen in this light, the specific but not extreme unlikely data should not be considered significant.

SPECIFIC TESTS If the test statistic follows a Student’s t-distribution in the null hypothesis – which is common where the underlying variable follows a normal distribution with unknown scaling factor, then the test is referred to as a one-tailed or two-tailed t-test. If the test is performed using the actual population mean and variance, rather than an estimate from a sample, it would be called a one-tailed or two-tailed Z-test. The statistical tables for t and for Z provide critical values for both one- and two- tailed tests. That is, they provide the critical values that cut off an entire region at one or the other end of the sampling distribution as well as the critical values that cut off the regions (of half the size) at both ends of the sampling distribution. STATISTICAL POWER The power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true. The statistical power ranges from 0 to 1, and as statistical power increases, the probability of making a type 2 error decreases. For a type 2 error probability of β, the corresponding statistical power is 1-β. For example, if experiment 1 has a statistical power of 0.7, and experiment 2 has a statistical power of 0.95, then there is a stronger probability that experiment 1 had a type 2 error than experiment 2, and experiment 2 is more reliable than experiment 1 due to the Statistical Hypothesis Testing 227 reduction in probability of a type 2 error. It can be equivalently thought of as the probability of accepting the alternative hypothesis (H1) when it is true—that is, the ability of a test to detect a specific effect, if that specific effect actually exists. That is,

Power = Pr( reject H0 H 1 is true)

If H1 is not an equality but rather simply the negation of H0 (so for example with H0: µ = 0 for some unobserved population parameter µ, we have simply H1: µ ≠ 0) then power cannot be calculated unless probabilities are known for all possible values of the parameter that violate the null hypothesis. Thus one generally refers to a test’s power against a specific alternative hypothesis. As the power increases, there is a decreasing probability of a Type II error (false negative), also referred to as the false negative rate (β) since the power is equal to 1 – β. A similar concept is the Type I error probability, also referred to as the “false positive rate” or the level of a test under the null hypothesis. However, the false negative rate (FNR) is not necessarily the same as a false negative (when normalized over all samples) because the FNR is divided by only by the number of actual condition positive samples rather than all of them. This effect is more pronounced when the actual positive set size is very different than the actual negative set size. Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. For example: “how many times do I need to toss a coin to conclude it is rigged by a certain amount?” Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric test and a nonparametric test of the same hypothesis. A similar but somewhat different concept is statistical sensitivity, which measures how likely it is that a given test gives the correct result (e.g., the likelihood that a test to determine if a patient has a particular disease correctly recognizes the disease).

BACKGROUND Statistical tests use data from samples to assess, or make inferences about, a statistical population. In the concrete setting of a two-sample comparison, the goal is to assess whether the mean values of some attribute obtained for individuals in two sub- populations differ. For example, to test the null hypothesis that the mean scores of men and women on a test do not differ, samples of men and women are drawn, the test is administered to them, and the mean score of one group is compared to that of the other group using a statistical test such as the two-sample z-test. The power of the test is the probability that the test will find a statistically significant difference between men and women, as a function of the size of the true difference between those two populations. 228 Research Methodology and Statistical Methods

FACTORS INFLUENCING POWER Statistical power may depend on a number of factors. Some factors may be particular to a specific testing situation, but at a minimum, power nearly always depends on the following three factors: • The statistical significance criterion used in the test • The magnitude of the effect of interest in the population • The sample size used to detect the effect. A significance criterion is a statement of how unlikely a positive result must be, if the null hypothesis of no effect is true, for the null hypothesis to be rejected. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of the data implying an effect at least as large as the observed effect when the null hypothesis is true must be less than 0.05, for the null hypothesis of no effect to be rejected. One easy way to increase the power of a test is to carry out a less conservative test by using a larger significance criterion, for example 0.10 instead of 0.05. This increases the chance of rejecting the null hypothesis (i.e., obtaining a statistically significant result) when the null hypothesis is false, that is, reduces the risk of a Type II error (false negative regarding whether an effect exists). But it also increases the risk of obtaining a statistically significant result (i.e., rejecting the null hypothesis) when the null hypothesis is not false; that is, it increases the risk of a Type I error (false positive). The magnitude of the effect of interest in the population can be quantified in terms of an effect size, where there is greater power to detect larger effects. An effect size can be a direct value of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. For example, in an analysis comparing outcomes in a treated and control population, the difference of outcome means Y X would be a direct estimate of the effect size, whereas (Y X)/ó where ó is the common standard deviation of the outcomes in the treated and control groups, would be an estimated standardized effect size. If constructed appropriately, a standardized effect size, along with the sample size, will completely determine the power. An unstandardized (direct) effect size will rarely be sufficient to determine the power, as it does not contain information about the variability in the measurements. The sample size determines the amount of sampling error inherent in a test result. Other things being equal, effects are harder to detect in smaller samples. Increasing sample size is often the easiest way to boost the statistical power of a test. How increased sample size translates to higher power is a measure of the efficiency of the test—for example, the sample size required for a given power. The precision with which the data are measured also influences statistical power. Consequently, power can often be improved by reducing the measurement error in the data. A related concept is to improve the “reliability” of the measure being assessed (as in psychometric reliability). Statistical Hypothesis Testing 229

The design of an experiment or observational study often influences the power. For example, in a two-sample testing situation with a given total sample size n, it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same). In regression analysis and analysis of variance, there are extensive theories and practical strategies for improving the power based on optimally setting the values of the independent variables in the model.

INTERPRETATION Although there are no formal standards for power (sometimes referred to as π), most researchers assess the power of their tests using π = 0.80 as a standard for adequacy. This convention implies a four-to-one trade off between β-risk and α-risk. (β is the probability of a Type II error, and α is the probability of a Type I error; 0.2 and 0.05 are conventional values for β and α). However, there will be times when this 4-to-1 weighting is inappropriate. In medicine, for example, tests are often designed in such a way that no false negatives (Type II errors) will be produced. But this inevitably raises the risk of obtaining a false positive (a Type I error). The rationale is that it is better to tell a healthy patient “we may have found something—let’s test further,” than to tell a diseased patient “all is well.” Power analysis is appropriate when the concern is with the correct rejection of a false null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around 0.50, a sample size of 20 will give us approximately 80% power (alpha = 0.05, two-tail) to reject the null hypothesis of zero correlation. However, in doing this study we are probably more interested in knowing whether the correlation is 0.30 or 0.60 or 0.50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. Techniques similar to those employed in a traditional power analysis can be used to determine the sample size required for the width of a confidence interval to be less than a given value. Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities are nuisance parameters. In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more “exploratory”, there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is 230 Research Methodology and Statistical Methods related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well. Any statistical analysis involving multiple hypotheses is subject to inflation of the type I error rate if appropriate measures are not taken. Such measures typically involve applying a higher threshold of stringency to reject a hypothesis in order to compensate for the multiple comparisons being made (e.g., as in the Bonferroni method). In this situation, the power analysis should reflect the multiple testing approach to be used. Thus, for example, a given study may be well powered to detect a certain effect size when only one test is to be made, but the same effect size may have much lower power if several tests are to be performed. It is also important to consider the statistical power of a hypothesis test when interpreting its results. A test’s power is the probability of correctly rejecting the null hypothesis when it is false; a test’s power is influenced by the choice of significance level for the test, the size of the effect being measured, and the amount of data available. A hypothesis test may fail to reject the null, for example, if a true difference exists between two populations being compared by a t-test but the effect is small and the sample size is too small to distinguish the effect from random chance. Many clinical trials, for instance, have low statistical power to detect differences in adverse effects of treatments, since such effects may be rare and the number of affected patients small.

A PRIORI VS. POST HOC ANALYSIS Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data are collected. A priori power analysis is conducted prior to the research study, and is typically used in estimating sufficient sample sizes to achieve adequate power. Post-hoc analysis of “observed power” is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population. Whereas the utility of prospective power analysis in experimental design is universally accepted, post hoc power analysis is fundamentally flawed. Falling for the temptation to use the statistical analysis of the collected data to estimate the power will result in uninformative and misleading values. In particular, it has been shown that post-hoc “observed power” is a one-to-one function of the p-value attained. This has been extended to show that all post-hoc power analyses suffer from what is called the “power approach paradox” (PAP), in which a study with a null result is thought to show more evidence that the null hypothesis is actually true when the p-value is smaller, since the apparent power to detect an actual effect would be higher. In fact, a smaller p-value is properly understood to make the null hypothesis relatively less likely to be true. Statistical Hypothesis Testing 231

APPLICATION Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis, for example to determine the minimum number of animal test subjects needed for an experiment to be informative. In frequentist statistics, an underpowered study is unlikely to allow one to choose between hypotheses at the desired significance level. In Bayesian statistics, hypothesis testing of the type used in classical power analysis is not done. In the Bayesian framework, one updates his or her prior beliefs using the data obtained in a given study. In principle, a study that would be deemed underpowered from the perspective of hypothesis testing could still be used in such an updating process. However, power remains a useful measure of how much a given experiment size can be expected to refine one’s beliefs. A study with low power is unlikely to lead to a large change in beliefs.

EXTENSION

Bayesian Power In the frequentist setting, parameters are assumed to have a specific value which is unlikely to be true. This issue can be addressed by assuming the parameter has a distribution. The resulting power is sometimes referred to as Bayesian power which is commonly used in clinical trial design.

Predictive Probability of Success Both frequentist power and Bayesian power use statistical significance as the success criterion. However, statistical significance is often not enough to define success. To address this issue, the power concept can be extended to the concept of predictive probability of success (PPOS). The success criterion for PPOS is not restricted to statistical significance and is commonly used in clinical trial designs.

SOFTWARE FOR POWER AND SAMPLE SIZE CALCULATIONS Numerous free and/or open source programmes are available for performing power and sample size calculations. These include, • G*Power • Free and open source online calculators • PS • Power Up! provides convenient excel-based functions to determine minimum detectable effect size and minimum required sample size for various experimental and quasi-experimental designs. • R package • Russ Lenth’s power and sample-size page • WebPower Free online statistical power analysis. 232 Research Methodology and Statistical Methods

• Samp Size app for Android and iOS iPhone and iPad.

PERMUTATION TESTS A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels. Confidence intervals can then be derived from the tests. The theory has evolved from the works of Ronald Fisher and E. J. G. Pitman in the 1930s. To illustrate the basic idea of a permutation test, suppose we have two groups A and

B whose sample means are xAB and x , and that we want to test, at 5% significance level, whether they come from the same distribution. Let nABand n be the sample size corresponding to each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject the null hypothesis H0 that the two groups have identical probability distributions. The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic, T(obs). Then the observations of groups A and B are pooled. Next, the difference in sample means is calculated and recorded for every possible way of dividing these pooled values into two groups of size nABand n (i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences under the null hypothesis that group label does not matter. The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to T(obs). The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to ABS(T(obs)). If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the recorded differences, and then observe if T(obs) is contained within the middle 95% of them. If it is not, we reject the hypothesis of identical probability curves at the 5% significance level.

MULTIPLE COMPARISONS PROBLEM In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. In certain fields it is known as the look-elsewhere effect. Statistical Hypothesis Testing 233

The more inferences are made, the more likely erroneous inferences are to occur. Several statistical techniques have been developed to prevent this from happening, allowing significance levels for single and multiple comparisons to be directly compared. These techniques generally require a stricter significance threshold for individual comparisons, so as to compensate for the number of inferences being made.

HISTORY The interest in the problem of multiple comparisons began in the 1950s with the work of Tukey and Scheffé. Other methods, such as the closed testing procedure (Marcus et al., 1976) and the Holm–Bonferroni method (1979), later emerged. In 1995, work on the false discovery rate began. In 1996, the first conference on multiple comparisons took place in Israel. This was followed by conferences around the world, usually taking place about every two years.

DEFINITION Multiple comparisons arise when a statistical analysis involves multiple simultaneous statistical tests, each of which has a potential to produce a “discovery.” A stated confidence level generally applies only to each test considered individually, but often it is desirable to have a confidence level for the whole family of simultaneous tests. Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples: • Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random sampling error alone. • Suppose we consider the efficacy of a drug in terms of the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes increasingly likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom. In both examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison. For example, if one test is performed at the 5% level and the corresponding null hypothesis is true, there is only a 5% chance of incorrectly rejecting the null hypothesis. However, if 100 tests are conducted and all corresponding null hypotheses are true, the expected number of incorrect rejections (also known as false positives or Type I errors) is 5. If the tests are statistically independent from each other, the probability of at least 234 Research Methodology and Statistical Methods one incorrect rejection is 99.4%. The multiple comparisons problem also applies to confidence intervals. A single confidence interval with a 95% coverage probability level will contain the population parameter in 95% of experiments. However, if one considers 100 confidence intervals simultaneously, each with 95% coverage probability, the expected number of non-covering intervals is 5. If the intervals are statistically independent from each other, the probability that at least one interval does not contain the population parameter is 99.4%. Techniques have been developed to prevent the inflation of false positive rates and non-coverage rates that occur with multiple statistical tests.

CLASSIFICATION OF MULTIPLE HYPOTHESIS TESTS The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: H1, H2,..., Hm. Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if the test is non-significant. Summing each type of outcome over all Hi yields the following random variables: Null hypothesis is true Alternative hypothesis is true Total

(H0) (HA) Test is declared significant V S R Test is declared non-significant U T m – R

Total m0 m – m0 m • m is the total number hypotheses tested

• m0 is the number of true null hypotheses, an unknown parameter • m – m0 is the number of true alternative hypotheses • V is the number of false positives (Type I error) (also called “false discoveries”) • S is the number of true positives (also called “true discoveries”) • T is the number of false negatives (Type II error) • U is the number of true negatives • R = V + S is the number of rejected null hypotheses (also called “discoveries”, either true or false) In m hypothesis tests of which m0 are true null hypotheses, R is an observable random variable, and S are unobservable random variables.

CONTROLLING PROCEDURES If m independent comparisons are performed, the family-wise error rate (FWER), is given by,

m α=1 − 1 − α . ( {per comparison} ) Hence, unless the tests are perfectly positively dependent (i.e., identical), α increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say: Statistical Hypothesis Testing 235

α≤m ⋅ α{per comparison} , which follows from Boole’s inequality. Example: 0.2649 = 1 – ( 1 –.05 )6 ≤ .05 × 6 = 0.3 There are different ways to assure that the family-wise error rate is at most α . The most conservative method, which is free of dependence and distributional assumptions, is the Bonferroni correctionα{per comparison} = α /.m A marginally less conservative correction can be obtained by solving the equation for the family-wise error rate of m independent comparisons for α {per comparison}. This yields, 1/m α{per comparison} =1 −( 1 −α ) , which is known as the Šidák correction. Another procedure is the Holm–Bonferroni method, which uniformly delivers more power than the simple Bonferroni correction, by testing only the lowest p-value i = 1against the strictest criterion, and the higher p- values (i > 1) against progressively less strict criteria.

α{per comparison} =α /m − i + 1. Multiple testing correction refers to re-calculating probabilities obtained from a statistical test which was repeated multiple times. In order to retain a prescribed family- wise error rate α in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than α. Boole’s inequality implies that if each of m tests is performed to have type I error rate α/m, the total error rate will not exceed α. This is called the Bonferroni correction, and is one of the most commonly used approaches for multiple comparisons. In some situations, the Bonferroni correction is substantially conservative, i.e., the actual family-wise error rate is much less than the prescribed level α. This occurs when the test statistics are highly dependent (in the extreme case where the tests are perfectly dependent, the family-wise error rate with no multiple comparisons adjustment and the per-test error rates are identical). For example, in fMRI analysis, tests are done on over 100,000 voxels in the brain. The Bonferroni method would require p-values to be smaller than.05/100000 to declare significance. Since adjacent voxels tend to be highly correlated, this threshold is generally too stringent. Because simple techniques such as the Bonferroni method can be conservative, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without excessively inflating the rate of false negatives. Such methods can be divided into general categories: • Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide “strong” control against Type I error, in all conditions including a partially correct null hypothesis. 236 Research Methodology and Statistical Methods

• Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions. • Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA, MANOVA, or Tukey’s range test. These methods generally provide only “weak” control of Type I error, except for certain numbers of hypotheses. • Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data. The advent of computerized resampling methods, such as bootstrapping and Monte Carlo simulations, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.

LARGE-SCALE MULTIPLE TESTING Traditional methods for multiple comparisons adjustments focus on correcting for modest numbers of comparisons, often in an analysis of variance. A different set of techniques have been developed for “large-scale multiple testing”, in which thousands or even greater numbers of tests are performed. For example, in genomics, when using technologies such as microarrays, expression levels of tens of thousands of genes can be measured, and genotypes for millions of genetic markers can be measured. Particularly in the field of genetic association studies, there has been a serious problem with non-replication — a result being strongly statistically significant in one study but failing to be replicated in a follow-up study. Such non-replication can have many causes, but it is widely considered that failure to fully account for the consequences of making multiple comparisons is one of the causes. In different branches of science, multiple testing is handled in different ways. It has been argued that if statistical tests are only performed when there is a strong basis for expecting the result to be true, multiple comparisons adjustments are not necessary. It has also been argued that use of multiple testing corrections is an inefficient way to perform empirical research, since multiple testing adjustments control false positives at the potential expense of many more false negatives. On the other hand, it has been argued that advances in measurement and information technology have made it far easier to generate large datasets for exploratory analysis, often leading to the testing of large numbers of hypotheses with no prior basis for expecting many of the hypotheses to be true. In this situation, very high false positive rates are expected unless multiple comparisons adjustments are made. For large-scale testing problems where the goal is to provide definitive results, the familywise error rate remains the most accepted parameter for ascribing significance levels to statistical tests. Alternatively, if a study is viewed as exploratory, or if significant Statistical Hypothesis Testing 237 results can be easily re-tested in an independent study, control of the false discovery rate (FDR) is often preferred. The FDR, loosely defined as the expected proportion of false positives among all significant tests, allows researchers to identify a set of “candidate positives” that can be more rigorously evaluated in a follow-up study. The practice of trying many unadjusted comparisons in the hope of finding a significant one is a known problem, whether applied unintentionally or deliberately, is sometimes called “p-hacking.”

Assessing Whether any Alternative Hypotheses are True A basic question faced at the outset of analyzing a large set of testing results is whether there is evidence that any of the alternative hypotheses are true. One simple meta-test that can be applied when it is assumed that the tests are independent of each other is to use the Poisson distribution as a model for the number of significant results at a given level á that would be found when all null hypotheses are true. If the observed number of positives is substantially greater than what should be expected, this suggests that there are likely to be some true positives among the significant results. For example, if 1000 independent tests are performed, each at level α = 0.05, we expect 0.05 × 1000 = 50 significant tests to occur when all null hypotheses are true. Based on the Poisson distribution with mean 50, the probability of observing more than 61 significant tests is less than 0.05, so if more than 61 significant results are observed, it is very likely that some of them correspond to situations where the alternative hypothesis holds. A drawback of this approach is that it over-states the evidence that some of the alternative hypotheses are true when the test statistics are positively correlated, which commonly occurs in practice.. On the other hand, the approach remains valid even in the presence of correlation among the test statistics, as long as the Poisson distribution can be shown to provide a good approximation for the number of significant results. This scenario arises, for instance, when mining significant frequent itemsets from transactional datasets. Furthermore, a careful two stage analysis can bound the FDR at a pre-specified level. Another common approach that can be used in situations where the test statistics can be standardized to Z-scores is to make a normal quantile plot of the test statistics. If the observed quantiles are markedly more dispersed than the normal quantiles, this suggests that some of the significant results may be true positives. 238 Research Methodology and Statistical Methods

6

Nonparametric Statistics

Nonparametric statistics is the branch of statistics that is not based solely on parameterized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distribution- free or having a specified distribution but with the distribution’s parameters unspecified. Nonparametric statistics includes both descriptive statistics and statistical inference.

DEFINITIONS The statistician Larry Wasserman has said that “it is difficult to give a precise definition of nonparametric inference”. The term “nonparametric statistics” has been imprecisely defined in the following two ways, among others. 1. The first meaning of nonparametric covers techniques that do not rely on data belonging to any particular distribution. These include, among others: – Distribution free methods, which do not rely on assumptions that the data are drawn from a given probability distribution. As such it is the opposite of parametric statistics. It includes nonparametric descriptive statistics, statistical models, inference and statistical tests. – Non-parametric statistics (in the sense of a statistic over data, which is defined to be a function on a sample that has no dependency on a parameter), whose interpretation does not depend on the population fitting any parameterised distributions. Order statistics, which are based on the ranks of observations, are one example of such statistics and these play a central role in many nonparametric approaches. Nonparametric Statistics 239

The following discussion is taken from Kendall’s. • Statistical hypotheses concern the behaviour of observable random variables.... For example, the hypothesis (a) that a normal distribution has a specified mean and variance is statistical; so is the hypothesis (b) that it has a given mean but unspecified variance; so is the hypothesis (c) that a distribution is of normal form with both mean and variance unspecified; finally, so is the hypothesis (d) that two unspecified continuous distributions are identical. • It will have been noticed that in the examples (a) and (b) the distribution underlying the observations was taken to be of a certain form (the normal) and the hypothesis was concerned entirely with the value of one or both of its parameters. Such a hypothesis, for obvious reasons, is called parametric. • Hypothesis (c) was of a different nature, as no parameter values are specified in the statement of the hypothesis; we might reasonably call such a hypothesis non-parametric. Hypothesis (d) is also non- parametric but, in addition, it does not even specify the underlying form of the distribution and may now be reasonably termed distribution-free. Not with standing these distinctions, the statistical literature now commonly applies the label “non-parametric” to test procedures that we have just termed “distribution-free”, thereby losing a useful classification. 2. The second meaning of nonparametric covers techniques that do not assume that the structure of a model is fixed. Typically, the model grows in size to accommodate the complexity of the data. In these techniques, individual variables are typically assumed to belong to parametric distributions, and assumptions about the types of connections among variables are also made. These techniques include, among others: – Nonparametric regression, which refers to modeling where the structure of the relationship between variables is treated non-parametrically, but where nevertheless there may be parametric assumptions about the distribution of model residuals. – Nonparametric hierarchical Bayesian models, such as models based on the Dirichlet process, which allow the number of latent variables to grow as necessary to fit the data, but where individual variables still follow parametric distributions and even the process controlling the rate of growth of latent variables follows a parametric distribution. APPLICATIONS AND PURPOSE Nonparametric methods are widely used for studying populations that take on a ranked order (such as movie reviews receiving one to four stars). The use of non- 240 Research Methodology and Statistical Methods parametric methods may be necessary when data have a ranking but no clear numerical interpretation, such as when assessing preferences. In terms of levels of measurement, non-parametric methods result in “ordinal” data. As nonparametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, nonparametric methods are more robust. Another justification for the use of nonparametric methods is simplicity. In certain cases, even when the use of parametric methods is justified, nonparametric methods may be easier to use. Due both to this simplicity and to their greater robustness, nonparametric methods are seen by some statisticians as leaving less room for improper use and misunderstanding. The wider applicability and increased robustness of nonparametric tests comes at a cost: in cases where a parametric test would be appropriate, nonparametric tests have less power. In other words, a larger sample size can be required to draw conclusions with the same degree of confidence.

NON-PARAMETRIC MODELS

HISTOGRAM A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. It is a kind of bar graph. To construct a histogram, the first step is to “bin” the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size. If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency—the number of cases in each bin. A histogram may also be normalized to display “relative” frequencies. It then shows the proportion of cases that fall into each of several categories, with the sum of the heights equaling 1. However, bins need not be of equal width; in that case, the erected rectangle is defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the frequency but frequency density—the number of cases per unit of the variable on the horizontal axis. Examples of variable bin width are displayed on Census bureau data below. As the adjacent bins leave no gaps, the rectangles of a histogram touch each other to indicate that the original variable is continuous. give a rough sense of the density of the underlying distribution of the data, and often for density estimation: estimating the probability density function of the Nonparametric Statistics 241 underlying variable. The total area of a histogram used for probability density is always normalized to 1. If the length of the intervals on the x-axis are all 1, then a histogram is identical to a relative frequency plot. A histogram can be thought of as a simplistic kernel density estimation, which uses a kernel to smooth frequencies over the bins. This yields a smoother probability density function, which will in general more accurately reflect distribution of the underlying variable. The density estimate could be plotted as an alternative to the histogram, and is usually drawn as a curve rather than a set of boxes. Another alternative is the average shifted histogram, which is fast to compute and gives a smooth curve estimate of the density without using kernels. The histogram is one of the seven basic tools of quality control. Histograms are sometimes confused with bar charts. A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables. Some authors recommend that bar charts have gaps between the rectangles to clarify the distinction.

KERNEL DENSITY ESTIMATION In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.

Fig.Fig.Fig. Kernel density estimation of 100 normally distributed random numbers using different smoothing bandwidths.

Statistical Implementation A non-exhaustive list of software implementations of kernel density estimators includes: 242 Research Methodology and Statistical Methods

• In Analytica release 4.4, the Smoothing option for PDF results uses KDE, and from expressions it is available via the built-in Pdf function. • In C/C++, FIGTree is a library that can be used to compute kernel density estimates using normal kernels. MATLAB interface available. • In C++, libagf is a library for variable kernel density estimation. • In CrimeStat, kernel density estimation is implemented using five different kernel functions – normal, uniform, quartic, negative exponential, and triangular. Both single- and dual-kernel density estimate routines are available. Kernel density estimation is also used in interpolating a Head Bang routine, in estimating a two-dimensional Journey-to-crime density function, and in estimating a three-dimensional Bayesian Journey-to-crime estimate. • In ELKI, kernel density functions can be found in the package. • In ESRI products, kernel density mapping is managed out of the Spatial Analyst toolbox and uses the Quartic(biweight) kernel. • In Excel, the Royal Society of Chemistry has created an add-in to run kernel density estimation based on their Analytical Methods Committee Technical Brief 4. • In gnuplot, kernel density estimation is implemented by the smooth kdensity option, the datafile can contain a weight and bandwidth for each point, or the bandwidth can be set automatically according to “Silverman’s rule of thumb”. • In Haskell, kernel density is implemented in the statistics package. • In Java, the Weka (machine learning) package provides weka.estimators.Kernel Estimator, among others. • In JavaScript, the visualization package D3.js offers a KDE package in its science.stats package. • In JMP, The Distribution platform can be used to create univariate kernel density estimates, and the Fit Y by X platform can be used to create bivariate kernel density estimates. • In Julia, kernel density estimation is implemented in the KernelDensity.jl package. • In MATLAB, kernel density estimation is implemented through the ksdensity function (Statistics Toolbox). This function does not provide an automatic data-driven bandwidth but uses a rule of thumb, which is optimal only when the target density is normal. The (Statistics Toolbox) also offers the function fitdist, with which a kernel distribution can be fitted to a specific data set. With this function it is possible to chose the kernel function and set the bandwidth. Alternatively, a free MATLAB software package which implements an automatic bandwidth selection method is available from the MATLAB Central File Exchange for, – 1-dimensional data – 2-dimensional data Nonparametric Statistics 243

– n-dimensional data – A free MATLAB toolbox with implementation of kernel regression, kernel density estimation, kernel estimation of hazard function and many others is available on these pages (this toolbox is a part of the book ). • In Mathematica, numeric kernel density estimation is implemented by the function SmoothKernelDistribution here and symbolic estimation is implemented using the function KernelMixtureDistribution here both of which provide data-driven bandwidths. • In , the Royal Society of Chemistry has created a macro to run kernel density estimation based on their Analytical Methods Committee Technical Brief 4. • In the NAG Library, kernel density estimation is implemented via the g10ba routine (available in both the Fortran and the C versions of the Library). • In Nuklei, C++ kernel density methods focus on data from the Special Euclidean group SE (3). • In Octave, kernel density estimation is implemented by the kernel_density option (econometrics package). • In Origin, 2D kernel density plot can be made from its user interface, and two functions, Ksdensity for 1D and Ks2 density for 2D can be used from its LabTalk, Python, or C code. • In Perl, an implementation can be found in the Statistics-KernelEstimation module • In PHP, an implementation can be found in the MathPHP library • In Python, many implementations exist: pyqt_fit.kde Module in the PyQt-Fit package, SciPy (scipy.stats.gaussian_kde), Statsmodels (KDEUnivariate and KDEMultivariate), and Scikit-learn (KernelDensity). • In R, it is implemented through the density, the bkde function in the KernSmooth library and the pareto density estimation in the ParetoDensityEstimation function AdaptGauss library (the first two included in the base distribution), the kde function in the ks library, the dkden and dbckden functions in the evmix library (latter for boundary corrected kernel density estimation for bounded support), the npudens function in the np library (numeric and categorical data), the sm. density function in the sm library. For an implementation of the kde. R function, which does not require installing any packages or libraries. • In SAS, proc kde can be used to estimate univariate and bivariate kernel densities. • In , it is implemented through kdensity; for example histogram x, kdensity. Alternatively a free Stata module KDENS is available from here allowing a user to estimate 1D or 2D density functions. • In Apache Spark, you can use the KernelDensity() class 244 Research Methodology and Statistical Methods

NONPARAMETRIC REGRESSION Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

Gaussian Process Regression or Kriging In Gaussian process regression, also known as Kriging, a Gaussian prior is assumed for the regression curve. The errors are assumed to have a multivariate normal distribution and the regression curve is estimated by its posterior mode. The Gaussian prior may depend on unknown hyperparameters, which are usually estimated via empirical Bayes. Smoothing splines have an interpretation as the posterior mode of a Gaussian process regression.

Kernel Regression

Fig. Fig. Example of a curve (red line) fit to a small data set (black points) with nonparametric regression using a Gaussian kernel smoother. The pink shaded area illustrates the kernel function applied to obtain an estimate of y for a given value of x. The kernel function defines the weight given to each data point in producing the estimate for a target point. Kernel regression estimates the continuous dependent variable from a limited set of data points by convolving the data points’ locations with a kernel function— approximately speaking, the kernel function specifies how to “blur” the influence of the data points so that their values can be used to predict the value for nearby locations.

Nonparametric Multiplicative Regression Nonparametric multiplicative regression (NPMR) is a form of nonparametric regression based on multiplicative kernel estimation. Like other regression methods, Nonparametric Statistics 245 the goal is to estimate a response (dependent variable) based on one or more predictors (independent variables). NPMR can be a good choice for a regression method if the following are true: 1. The shape of the response surface is unknown. 2. The predictors are likely to interact in producing the response; in other words, the shape of the response to one predictor is likely to depend on other predictors. 3. The response is either a quantitative or binary (0/1) variable. This is a smoothing technique that can be cross-validated and applied in a predictive way.

NPMR Behaves Like an Organism NPMR has been useful for modeling the response of an organism to its environment. Organismal response to environment tends to be nonlinear and have complex interactions among predictors. NPMR allows you to model automatically the complex interactions among predictors in much the same way that organisms integrate the numerous factors affecting their performance. A key biological feature of an NPMR model is that failure of an organism to tolerate any single dimension of the predictor space results in overall failure of the organism. For example, assume that a plant needs a certain range of moisture in a particular temperature range. If either temperature or moisture fall outside the tolerance of the organism, then the organism dies. If it is too hot, then no amount of moisture can compensate to result in survival of the plant. Mathematically this works with NPMR because the product of the weights for the target point is zero or near zero if any of the weights for individual predictors (moisture or temperature) are zero or near zero. Note further that in this simple example, the second condition listed above is probably true: the response of the plant to moisture probably depends on temperature and vice versa. Optimizing the selection of predictors and their smoothing parameters in a multiplicative model is computationally intensive. With a large pool of predictors, the computer must search through a huge number of potential models in search for the best model. The best model has the best fit, subject to overfitting constraints or penalties.

The Local Model NPMR can be applied with several different kinds of local models. By “local model” we mean the way that data points near a target point in the predictor space are combined to produce an estimate for the target point. The most common choices for the local models are the local mean estimator, a local linear estimator, or a local logistic estimator. In each case the weights can be extended multiplicatively to multiple dimensions. In words, the estimate of the response is a local estimate (for example a local mean) of the observed values, each value weighted by its proximity to the target 246 Research Methodology and Statistical Methods point in the predictor space, the weights being the product of weights for individual predictors. The model allows interactions, because weights for individual predictors are combined by multiplication rather than addition.

Overfitting Controls Understanding and using these controls on overfitting is essential to effective modeling with nonparametric regression. Nonparametric regression models can become overfit either by including too many predictors or by using small smoothing parameters (also known as bandwidth or tolerance). This can make a big difference with special problems, such as small data sets or clumped distributions along predictor variables. The methods for controlling overfitting differ between NPMR and the generalized linear modeling (GLMs). The most popular overfitting controls for GLMs are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) for model selection. The AIC and BIC depend on the number of parameters in a model. Because NPMR models do not have explicit parameters as such, these are not directly applicable to NPMR models. Instead, one can control overfitting by setting a minimum average neighborhood size, minimum data:predictor ratio, and a minimum improvement required to add a predictor to a model. Nonparametric regression models sometimes use an AIC based on the “effective number of parameters”. This penalizes a measure of fit by the trace of the smoothing matrix—essentially how much each data point contributes to estimating itself, summed across all data points. If, however, you use leave-one-out cross validation in the model fitting phase, the trace of the smoothing matrix is always zero, corresponding to zero parameters for the AIC. Thus, NPMR with cross-validation in the model fitting phase already penalizes the measure of fit, such that the error rate of the training data set is expected to approximate the error rate in a validation data set. In other words, the training error rate approximates the prediction (extra-sample) error rate.

Related Techniques NPMR is essentially a smoothing technique that can be cross-validated and applied in a predictive way. Many other smoothing techniques are well known, for example smoothing splines and wavelets. The optimal choice of a smoothing method depends on the specific application. Nonparametric regression models always fits for larger data.

Regression Trees Decision tree learning algorithms can be applied to learn to predict a dependent variable from data. Although the original CART formulation applied only to predicting univariate data, the framework can be used to predict multivariate data including time series. Nonparametric Statistics 247

DATA ENVELOPMENT ANALYSIS Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers. It is used to empirically measure productive efficiency of decision making units (or DMUs). Although DEA has a strong link to production theory in economics, the tool is also used for benchmarking in operations management, where a set of measures is selected to benchmark the performance of manufacturing and service operations. In the circumstance of benchmarking, the efficient DMUs, as defined by DEA, may not necessarily form a “production frontier”, but rather lead to a “best-practice frontier” (Cook, Tone and Zhu, 2014). DEA is referred to as “balanced benchmarking” by Sherman and Zhu (2013). Non-parametric approaches have the benefit of not assuming a particular functional form/shape for the frontier, however they do not provide a general relationship (equation) relating output and input. There are also parametric approaches which are used for the estimation of production frontiers. These require that the shape of the frontier be guessed beforehand by specifying a particular function relating output to input. One can also combine the relative strengths from each of these approaches in a hybrid method (Tofallis, 2001) where the frontier units are first identified by DEA and then a smooth surface is fitted to these. This allows a best-practice relationship between multiple outputs and multiple inputs to be estimated. “The framework has been adapted from multi-input, multi-output production functions and applied in many industries. DEA develops a function whose form is determined by the most efficient producers. This method differs from the Ordinary Least Squares (OLS) statistical technique that bases comparisons relative to an average producer. Like Stochastic Frontier Analysis (SFA), DEA identifies a “frontier” which are characterized as an extreme point method that assumes that if a firm can produce a certain level of output utilizing specific input levels, another firm of equal scale should be capable of doing the same. The most efficient producers can form a ‘composite producer’, allowing the computation of an efficient solution for every level of input or output. Where there is no actual corresponding firm, ‘virtual producers’ are identified to make comparisons” (Berg 2010). Attempts to synthesize DEA and SFA, improving upon their drawbacks, were also made in the literature, via proposing various versions of non-parametric SFA and Stochastic DEA.

History In microeconomic production theory a firm’s input and output combinations are depicted using a production function. Using such a function one can show the maximum output which can be achieved with any possible combination of inputs, that is, one can construct a production technology frontier (Sieford & Thrall 1990). Some 30 years ago DEA (and frontier techniques in general) set out to answer the question of how to use 248 Research Methodology and Statistical Methods this principle in empirical applications while overcoming the problem that for actual firms (or other DMUs) one can never observe all the possible input-output combinations. Building on the ideas of Farrell (1957), the seminal work “Measuring the efficiency of decision making units” by Charnes, Cooper & Rhodes (1978) applies linear programming to estimate an empirical production technology frontier for the first time. In Germany, the procedure was used earlier to estimate the marginal productivity of R&D and other factors of production (Brockhoff 1970). Since then, there have been a large number of books and journal articles written on DEA or applying DEA on various sets of problems. Other than comparing efficiency across DMUs within an organization, DEA has also been used to compare efficiency across firms. There are several types of DEA with the most basic being CCR based on Charnes, Cooper & Rhodes, however there are also DEA which address varying returns to scale, either CRS (constant returns to scale, VRS (variable), non increasing returns to scale or the non decreasing returns to scale by Ylvinger (2000). The main developments of DEA in the 1970s and 1980s are documented by Seiford & Thrall (1990).

Techniques Data envelopment analysis (DEA) is a linear programming methodology to measure the efficiency of multiple decision-making units (DMUs) when the production process presents a structure of multiple inputs and outputs. “DEA has been used for both production and cost data. Utilizing the selected variables, such as unit cost and output, DEA software searches for the points with the lowest unit cost for any given output, connecting those points to form the efficiency frontier. Any company not on the frontier is considered inefficient. A numerical coefficient is given to each firm, defining its relative efficiency. Different variables that could be used to establish the efficiency frontier are: number of employees, service quality, environmental safety, and fuel consumption. An early survey of studies of electricity distribution companies identified more than thirty DEA analyses—indicating widespread application of this technique to that network industry. (Jamasb, T. J., Pollitt, M. G. 2001). A number of studies using this technique have been published for water utilities. The main advantage to this method is its ability to accommodate a multiplicity of inputs and outputs. It is also useful because it takes into consideration returns to scale in calculating efficiency, allowing for the concept of increasing or decreasing efficiency based on size and output levels. A drawback of this technique is that model specification and inclusion/exclusion of variables can affect the results.” (Berg 2010) Under general DEA benchmarking, for example, “if one benchmarks the performance of computers, it is natural to consider different features (screen size and resolution, memory size, process speed, hard disk size, and others). One would then have to classify these features into “inputs” and “outputs” in order to apply a proper DEA analysis. Nonparametric Statistics 249

However, these features may not actually represent inputs and outputs at all, in the standard notion of production. In fact, if one examines the benchmarking literature, other terms, such as “indicators”, “outcomes”, and “metrics”, are used. The issue now becomes one of how to classify these performance measures into inputs and outputs, for use in DEA.” (Cook, Tone, and Zhu, 2014) Some of the advantages of DEA are: • No need to explicitly specify a mathematical form for the production function • Proven to be useful in uncovering relationships that remain hidden for other methodologies • Capable of handling multiple inputs and outputs • Vapable of being used with any input-output measurement • The sources of inefficiency can be analysed and quantified for every evaluated unit. Some of the disadvantages of DEA are: • Results are sensitive to the selection of inputs and outputs (Berg 2010). • You cannot test for the best specification (Berg 2010). • The number of efficient firms on the frontier tends to increase with the number of inputs and output variables (Berg 2010). A desire to Improve upon DEA, by reducing its disadvantages or strengthening its advantages has been a major cause for many discoveries in the recent literature. The currently most often DEA-based method to obtain unique efficiency rankings is called cross-efficiency. Originally developed by Sexton et al. in 1986, it found widespread application ever since Doyle and Green’s 1994 publication. Cross-efficiency is based on the original DEA results, but implents a secondary objective where each DMU peer- appraises all other DMU’s with its own factor weights. The average of these peer- appraisal scores is then used to calculate a DMU’s cross-efficiency score. This approach avoids DEA’s disadvantages of having multiple efficient DMUs and potentially non- unique weights. Another approach to remedy some of DEA’s drawbacks is Stochastic DEA, which makes a synthesises of DEA and SFA.

Sample Applications DEA is commonly applied in the electric utilities sector. For instance a government authority can choose data envelopment analysis as their measuring tool to design an individualized regulatory rate for each firm based on their comparative efficiency. The input components would include man-hours, losses, capital (lines and transformers only), and goods and services. The output variables would include number of customers, energy delivered, length of lines, and degree of coastal exposure. (Berg 2010) DEA is also regularly used to assess the efficiency of public and not-for-profit organizations, e.g., hospitals (Kuntz, Scholtes & Vera 2007; Kuntz & Vera 2007; Vera & Kuntz 2007) or police forces (Thanassoulis 1995; Sun 2002; Aristovnik et al. 2013, 2014). 250 Research Methodology and Statistical Methods

Examples In the DEA methodology, formally developed by Charnes, Cooper and Rhodes (1978), efficiency is defined as a ratio of weighted sum of outputs to a weighted sum of inputs, where the weights structure is calculated by means of mathematical programming and constant returns to scale (CRS) are assumed. In 1984, Banker, Charnes and Cooper developed a model with variable returns to scale (VRS). Assume that we have the following data: • Unit 1 produces 100 items per day, and the inputs per item are 10 dollars for materials and 2 labour-hours • Unit 2 produces 80 items per day, and the inputs are 8 dollars for materials and 4 labour-hours • Unit 3 produces 120 items per day, and the inputs are 12 dollars for materials and 1.5 labour-hours To calculate the efficiency of unit 1, we define the objective function as

• maximize efficiency = (u1 · 100)/(v1 · 10 + v2 · 2) Which is subject to all efficiency of other units (efficiency cannot be larger than 1):

• Subject to the efficiency of unit 1: (u1 · 100)/(v1 · 10 + v2 · 2) ≤ 1 • Subject to the efficiency of unit 2: (u1 · 80)/(v1 · 8 + v2 · 4) ≤ 1 • Subject to the efficiency of unit 3: (u1 · 120)/(v1 · 12 + v2 · 1.5) ≤ 1 and non- negativity: • All u and v ≥ 0. But since linear programming cannot handle fraction, we need to transform the formulation, such that we limit the denominator of the objective function and only allow the linear programming to maximize the numerator. So the new formulation would be:

• Maximize efficiency = u1 · 100 • Subject to the efficiency of unit 1: (u1 · 100)/(v1 · 10 + v2 · 2) ≤ 0 • Subject to the efficiency of unit 2: (u1 · 80)/(v1 · 8 + v2 · 4) ≤ 0 • Subject to the efficiency of unit 3: (u1 · 120)/(v1 · 12 + v2 · 1.5) ≤ 0 • Subject to v1 · 10 + v2 · 2 = 1 • All u and v ≥ 0.

Inefficiency Measuring Data Envelopment Analysis (DEA) has been recognized as a valuable analytical research instrument and a practical decision support tool. DEA has been credited for not requiring a complete specification for the functional form of the production frontier nor the distribution of inefficient deviations from the frontier. Rather, DEA requires general production and distribution assumptions only. However, if those assumptions are too weak, inefficiency levels may be systematically underestimated in small samples. In addition, erroneous assumptions may cause inconsistency with a bias over the frontier. Therefore, the ability to alter, test and select Nonparametric Statistics 251 production assumptions is essential in conducting DEA-based research. However, the DEA models currently available offer a limited variety of alternative production assumptions only.

K-NEAREST NEIGHBORS ALGORITHM In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression: • In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbour. • In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k- NN algorithm is among the simplest of all machine learning algorithms. Both for classification and regression, a useful technique can be to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbour a weight of 1/d, where d is the distance to the neighbour. The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the data. The algorithm is not to be confused with k-means, another popular machine learning technique.

Algorithm The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has also been employed with correlation coefficients 252 Research Methodology and Statistical Methods such as Pearson and Spearman. Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbour or Neighbourhood components analysis. A drawback of the basic “majority voting” classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. K-NN can then be applied to the SOM.

Fig. Fig. Example of k-NN classification. The test sample (circle) should be classified either to the first class of squares or to the second class of triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

Parameter Selection The best choice of k depends upon the data; generally, larger values of k reduces effect of the noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques. The special case where the class is predicted to be the class of the closest training sample (i.e., when k = 1) is called the nearest neighbour algorithm. The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to Nonparametric Statistics 253 optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method.

Metric Learning The K-nearest neighbour classification performance can often be significantly improved through (supervised) metric learning. Popular algorithms are neighbourhood components analysis and large margin nearest neighbour. Supervised metric learning algorithms use the label information to learn a new metric or pseudo-metric.

Feature Extraction When the input data to an algorithm is too large to be processed and it is suspected to be redundant (e.g., the same measurement in both feet and meters) then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information from the input data in order to perform the desired task using this reduced representation instead of the full size input. Feature extraction is performed on raw data prior to applying k-NN algorithm on the transformed data in feature space. An example of a typical computer vision computation pipeline for face recognition using k-NN including feature extraction and dimension reduction pre-processing steps (usually implemented with OpenCV): 1. Haar face detection 2. Mean-shift tracking analysis 3. PCA or Fisher LDA projection into feature space, followed by k-NN classification

Dimension Reduction For high-dimensional data (e.g., with number of dimensions more than 10) dimension reduction is usually performed prior to applying the k-NN algorithm in order to avoid the effects of the curse of dimensionality. The curse of dimensionality in the k-NN context basically means that Euclidean distance is unhelpful in high dimensions because all vectors are almost equidistant to the search query vector (imagine multiple points lying more or less on a circle with the query point at the center; the distance from the query to all data points in the search space is almost the same). Feature extraction and dimension reduction can be combined in one step using principal component analysis (PCA), linear discriminant analysis (LDA), or canonical 254 Research Methodology and Statistical Methods correlation analysis (CCA) techniques as a pre-processing step, followed by clustering by k-NN on feature vectors in reduced-dimension space. In machine learning this process is also called low-dimensional embedding. For very-high-dimensional datasets (e.g., when performing a similarity search on live video streams, DNA data or high- dimensional time series) running a fast approximate k-NN search using locality sensitive hashing, “random projections”, “sketches” or other high-dimensional similarity search techniques from the VLDB toolbox might be the only feasible option.

Decision Boundary Nearest neighbour rules in effect implicitly compute the decision boundary. It is also possible to compute the decision boundary explicitly, and to do so efficiently, so that the computational complexity is a function of the boundary complexity.

Data Reduction Data reduction is one of the most important problems for work with huge data sets. Usually, only some of the data points are needed for accurate classification. Those data are called the prototypes and can be found as follows: 1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for a given k) 2. Separate the rest of the data into two sets: i. The prototypes that are used for the classification decisions and ii. The absorbed points that can be correctly classified by k-NN using prototypes. The absorbed points can then be removed from the training set.

Selection of Class-outliers A training example surrounded by examples of other classes is called a class outlier. Causes of class outliers include: • Random error • Insufficient training examples of this class (an isolated example appears instead of a cluster) • Missing important features (the classes are separated in other dimensions which we do not know) • Too many training examples of other classes (unbalanced classes) that create a “hostile” background for the given small class Class outliers with k-NN produce noise. They can be detected and separated for future analysis. Given two natural numbers, k>r>0, a training example is called a (k,r)NN class-outlier if its k nearest neighbors include more than r examples of other classes.

CNN for Data Reduction Condensed nearest neighbour (CNN, the Hart algorithm) is an algorithm designed to reduce the data set for k-NN classification. It selects the set of prototypes U from the Nonparametric Statistics 255 training data, such that 1NN with U can classify the examples almost as accurately as 1NN does with the whole data set. Given a training set X, CNN works iteratively: 1. Scan all elements of X, looking for an element x whose nearest prototype from U has a different label than x. 2. Remove x from X and add it to U 3. Repeat the scan until no more prototypes are added to U. Use U instead of X for classification. The examples that are not prototypes are called “absorbed” points. It is efficient to scan the training examples in order of decreasing border ratio. The border ratio of a training example x is defined as, x'− y a() x = x− y where ||x-y|| is the distance to the closest example y having a different colour than x, and ||x’-y|| is the distance from y to its closest example x’ with the same label as x. The border ratio is in the interval [0,1] because ||x’-y||never exceeds ||x-y||. This ordering gives preference to the borders of the classes for inclusion in the set of prototypes U. A point of a different label than x is called external to x. The calculation of the border ratio is illustrated by the figur. The data points are labeled by colours: the initial point is x and its label is red. External points are blue and green. The closest to x external point is y. The closest to y red point is x’. The border ratio, a(x) = ||x′-y||/||x-y|| is the attribute of the initial point x. Below is an illustration of CNN in a series of figures. There are three classes (red, green and blue). Fig: initially there are 60 points in each class. Fig. shows the 1NN classification map: each pixel is classified by 1NN using all the data. Fig. shows the 5NN classification map. White areas correspond to the unclassified regions, where 5NN voting is tied (for example, if there are two green, two red and one blue points among 5 nearest neighbors). Fig shows the reduced data set. The crosses are the class-outliers selected by the (3,2)NN rule (all the three nearest neighbors of these instances belong to other classes); the squares are the prototypes, and the empty circles are the absorbed points. The left bottom corner shows the numbers of the class-outliers, prototypes and absorbed points for all three classes. The number of prototypes varies from 15% to 20% for different classes in this example. Fig shows that the 1NN classification map with the prototypes is very similar to that with the initial data set. The figures were produced using the Mirkes applet. • CNN model reduction for k-NN classifiers 256 Research Methodology and Statistical Methods

Fig. Fig. The dataset.

Fig. Fig. The 1NN classification map.

Fig. Fig. The 5NN classification map. Nonparametric Statistics 257

Fig.Fig.Fig. The CNN reduced dataset.

Fig.Fig.Fig. The 1NN classification map based on the CNN extracted prototypes. FCNN (for Fast Condensed Nearest Neighbour) is a variant of CNN, which turns out to be one of the fastest data set reduction algorithms for k-NN classification. k-NN Regression In k-NN regression, the k-NN algorithm is used for estimating continuous variables. One such algorithm uses a weighted average of the k nearest neighbors, weighted by the inverse of their distance. This algorithm works as follows: 1. Compute the Euclidean or Mahalanobis distance from the query example to the labeled examples. 258 Research Methodology and Statistical Methods

2. Order the labeled examples by increasing distance. 3. Find a heuristically optimal number k of nearest neighbors, based on RMSE. This is done using cross validation. 4. Calculate an inverse distance weighted average with the k-nearest multivariate neighbours. k-NN Outlier The distance to the kth nearest neighbour can also be seen as a local density estimate and thus is also a popular outlier score in anomaly detection. The larger the distance to the k-NN, the lower the local density, the more likely the query point is an outlier. To take into account the whole neighborhood of the query point, the average distance to the k-NN can be used. Although quite simple, this outlier model, along with another classic data mining method, local outlier factor, works quite well also in comparison to more recent and more complex approaches, according to a large scale experimental analysis.

Validation of Results A confusion matrix or “matching matrix” is often used as a tool to validate the accuracy of k-NN classification. More robust statistical methods such as likelihood- ratio test can also be applied.

SUPPOR T VECTOR MACHINE In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non- probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. In addition to performing linear classification, SVMs can efficiently perform a non- linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support vector clustering algorithm created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize Nonparametric Statistics 259 unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.

Motivation Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new Data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (a list of p numbers), and we want to know whether we can separate such points with a ( p – 1 )-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.

Definition More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

Fig. Fig. Kernel machine. Whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, presumably making the separation easier in that space. To keep the computational load reasonable, the mappings used by SVM schemes are designed to ensure that dot products may be computed easily in terms of the variables 260 Research Methodology and Statistical Methods in the original space, by defining them in terms of a kernel function k ( x, y) selected to suit the problem. The hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters αi of images of feature vectors xi that occur in the data base. With this choice of a hyperplane, the points x in the feature space that are mapped into the hyperplane are defined by the relation:

∑iα ik( x i , x) = constant Note that if k (x, y) becomes small as y grows further away from x, each term in the sum measures the degree of closeness of the test point x to the corresponding data base point xi. In this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Note the fact that the set of points x mapped into any hyperplane can be quite convoluted as a result, allowing much more complex discrimination between sets which are not convex at all in the original space.

Applications SVMs can be used to solve various real world problems: • SVMs are helpful in text and hypertext categorization as their application can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings. • Classification of images can also be performed using SVMs. Experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback. This is also true of image segmentation systems, including those using a modified version SVM that uses the privileged approach as suggested by Vapnik. • Hand-written characters can be recognized using SVM. • The SVM algorithm has been widely applied in the biological and other sciences. They have been used to classify proteins with up to 90% of the compounds classified correctly. Permutation tests based on SVM weights have been suggested as a mechanism for interpretation of SVM models. Support vector machine weights have also been used to interpret SVM models in the past. Posthoc interpretation of support vector machine models in order to identify features used by the model to make predictions is a relatively new area of research with special significance in the biological sciences.

History The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963. In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Nonparametric Statistics 261

Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick to maximum-margin hyperplanes. The current standard incarnation (soft margin) was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.

Implementation The parameters of the maximum-margin hyperplane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the QP problem that arises from SVMs, mostly relying on heuristics for breaking the problem down into smaller, more-manageable chunks. Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush–Kuhn–Tucker conditions of the primal and dual problems. Instead of solving a sequence of broken down problems, this approach directly solves the problem altogether. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used in the kernel trick. Another common method is Platt’s sequential minimal optimization (SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that are solved analytically, eliminating the need for a numerical optimization algorithm and matrix storage. This algorithm is conceptually simple, easy to implement, generally faster, and has better scaling properties for difficult SVM problems. The special case of linear support vector machines can be solved more efficiently by the same kind of algorithms used to optimize its close cousin, logistic regression; this class of algorithms includes sub-gradient descent (e.g., PEGASOS) and coordinate descent (e.g., LIBLINEAR). LIBLINEAR has some attractive training time properties. Each convergence iteration takes time linear in the time taken to read the train data and the iterations also have a Q-Linear Convergence property, making the algorithm extremely fast. The general kernel SVMs can also be solved more efficiently using sub- gradient descent (e.g., P-packSVM), especially when parallelization is allowed. Kernel SVMs are available in many machine learning toolkits, including LIBSVM, MATLAB, SAS, SVMlight, kernlab, scikit-learn, Shogun, Weka, Shark, JKernelMachines, OpenCV and others. METHODS

ANALYSIS OF SIMILARITIES Analysis of similarities (ANOSIM) is a non-parametric statistical test widely used in the field of ecology. The test was first suggested by K. R. Clarke as an ANOVA-like test, where instead of operating on raw data, operates on a ranked dissimilarity matrix. Given a matrix of rank dissimilarities between a set of samples, each solely belong to one treatment group, the ANOSIM tests whether we can reject the null hypothesis that the similarity between groups is greater than or equal to the similarity within the groups. 262 Research Methodology and Statistical Methods

The test statistic R is calculated in the following way: r− r R = BW M / 2 where rB is the average of rank similarities of pairs of samples (or replicates) originating from different sites, rW is the average of rank similarity of pairs among replicates within sites, and M = n(n – 1)/2 where n is the number of samples. The test statistic R is constrained between the values –1 to 1, where positive numbers suggest more similarity within sites and values close to zero represent no difference between within sites and within sites similarities. Negative R values suggest more similarity between sites than within sites and may raise the possibility of wrong assignment of samples to sites. For the purpose of hypothesis testing, where the null hypothesis is that the similarities within sites are smaller or equal to the similarities between sites, the R statistic is usually compared to a set of R′ values that are achieved by means of randomly shuffling site labels between the samples and calculating the resulting R′, repeated many times. The percent of times that the actual R surpassed the permutations derived R′ values is the p- value for the actual R statistic. Ranking of dissimilarity in ANOSIM and NMDS (non-metric multidimensional scaling) go hand in hand. Combining both methods complement visualisation of group differences along with significance testing. ANOSIM is implemented in several statistical software including PRIMER, R Vegan package and PAST.

ANDERSON–DARLING TEST The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. When applied to testing whether a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality. K-sample Anderson–Darling tests are available for testing whether several collections of observations can be modelled as coming from a single population, where the distribution function does not have to be specified. In addition to its use as a test of fit for distributions, it can be used in parameter estimation as the basis for a form of minimum distance estimation procedure. The test is named after Theodore Wilbur Anderson (1918–2016) and Donald A. Darling (1915–2014), who invented it in 1952. Nonparametric Statistics 263

Non-parametric k-sample Tests Fritz Scholz and Michael A. Stephens (1987) discuss a test, based on the Anderson– Darling measure of agreement between distributions, for whether a number of random samples with possibly different sample sizes may have arisen from the same distribution, where this distribution is unspecified. The R package kSamples implements this rank test for comparing k samples among several other such rank tests.

COCHRAN’S Q TEST In statistics, in the analysis of two-way randomized block designs where the response variable can take only two possible outcomes (coded as 0 and 1), Cochran’s Q test is a non-parametric statistical test to verify whether k treatments have identical effects. It is named after William Gemmell Cochran. Cochran’s Q test should not be confused with Cochran’s C test, which is a variance outlier test. Put in simple technical terms, Cochran’s Q test requires that there only be a binary response (e.g., success/failure or 1/0) and that there be more than 2 groups of the same size. The test assesses whether the proportion of successes is the same between groups. Often it is used to assess if different observers of the same phenomenon have consistent results (interobserver variability).

Assumptions Cochran’s Q test is based on the following assumptions: 1. A large sample approximation; in particular, it assumes that b is “large”. 2. The blocks were randomly selected from the population of all possible blocks. 3. The outcomes of the treatments can be coded as binary responses (i.e., a “0” or “1”) in a way that is common to all treatments within each block.

COHEN’S KAPPA Cohen’s kappa coefficient (κ) is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen’s Kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

Same Percentages but Different Numbers A case sometimes considered to be a problem with Cohen’s Kappa occurs when comparing the Kappa calculated for two pairs of raters with the two raters in each pair having the same percentage agreement but one pair give a similar number of ratings in each class while the other pair give a very different number of ratings in each class. (In the cases below, notice B has 70 yeses and 30 nos, in the first case, but those numbers are reversed in the second.) For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) in terms of agreement in each 264 Research Methodology and Statistical Methods class, so we would expect the relative values of Cohen’s Kappa to reflect this. However, calculating Cohen’s Kappa for each:

we find that it shows greater similarity between A and B in the second case, compared to the first. This is because while the percentage agreement is the same, the percentage agreement that would occur ‘by chance’ is significantly higher in the first case (0.54 compared to 0.46).

Significance and Magnitude Statistical significance for kappa is rarely reported, probably because even relatively low values of kappa can nonetheless be significantly different from zero but not of sufficient magnitude to satisfy investigators. Still, its standard error has been described and is computed by various computer programmes. If statistical significance is not a useful guide, what magnitude of kappa reflects adequate agreement? Guidelines would be helpful, but factors other than agreement can influence its magnitude, which makes interpretation of a given magnitude problematic. As Sim and Wright noted, two important factors are prevalence (are the codes equiprobable or do their probabilities vary) and bias (are the marginal probabilities for the two observers similar or different). Other things being equal, kappas are higher when codes are equiprobable. On the other hand, Kappas are higher when codes are distributed asymmetrically by the two observers. In contrast to probability variations, the effect of bias is greater when Kappa is small than when it is large. Another factor is the number of codes. As number of codes increases, kappas become higher. Based on a simulation study, Bakeman and colleagues concluded that for fallible observers, values for kappa were lower when codes were fewer. And, in agreement with Sim & Wrights’s statement concerning prevalence, kappas were higher when codes were roughly equiprobable. Thus Bakeman et al. concluded that “no one value of kappa can be regarded as universally acceptable. They also provide a computer programme Nonparametric Statistics 265 that lets users compute values for kappa specifying number of codes, their probability, and observer accuracy. For example, given equiprobable codes and observers who are 85% accurate, value of kappa are 0.49, 0.60, 0.66, and 0.69 when number of codes is 2, 3, 5, and 10, respectively. Nonetheless, magnitude guidelines have appeared in the literature. Perhaps the first was Landis and Koch, who characterized values < 0 as indicating no agreement and 0– 0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement. This set of guidelines is however by no means universally accepted; Landis and Koch supplied no evidence to support it, basing it instead on personal opinion. It has been noted that these guidelines may be more harmful than helpful. Fleiss’s equally arbitrary guidelines characterize kappas over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as poor.

Weighted Kappa Weighted kappa lets you count disagreements differently and is especially useful when codes are ordered. Three matrices are involved, the matrix of observed scores, the matrix of expected scores based on chance agreement, and the weight matrix. Weight matrix cells located on the diagonal (upper-left to bottom-right) represent agreement and thus contain zeros. Off-diagonal cells contain weights indicating the seriousness of that disagreement. Often, cells one off the diagonal are weighted 1, those two off 2, etc. The equation for weighted κ is:

k k w x ∑i=1 ∑ j = 1 ij ij κ =1 − k k w m ∑i=1 ∑ j = 1 ij i j where k=number of codes and wij, x ij and m ij are elements in the weight, observed, and expected matrices, respectively. When diagonal cells contain weights of 0 and all off- diagonal cells weights of 1, this formula produces the same value of kappa as the calculation given above.

Kappa Maximum Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. Still, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for κ maximum is:

PPmax− exp κmax = 1− Pexp 266 Research Methodology and Statistical Methods where, k k

Pexp =∑ Pi + p+ i as usual Pmax = ∑ min()pi+ , P + i , i=1 i=1

k = number of codes, Pi+ are the row probabilities, and P+i are the column probabilities.

Limitations Kappa is an index that considers observed agreement with respect to a baseline agreement. However, investigators must consider carefully whether Kappa’s baseline agreement is relevant for the particular research question. Kappa’s baseline is frequently described as the agreement due to chance, which is only partially correct. Kappa’s baseline agreement is the agreement that would be expected due to random allocation, given the quantities specified by the marginal totals of square contingency table. Thus, Kappa = 0 when the observed allocation is apparently random, regardless of the quantity disagreement as constrained by the marginal totals. However, for many applications, investigators should be more interested in the quantity disagreement in the marginal totals than in the allocation disagreement as described by the additional information on the diagonal of the square contingency table. Thus for many applications, Kappa’s baseline is more distracting than enlightening. Consider the following example:

Fig. Fig. Kappa example. Nonparametric Statistics 267

The disagreement proportion is 14/16 or.875. The disagreement is due to quantity because allocation is optimal. Kappa is.01.

The disagreement proportion is 2/16 or.125. The disagreement is due to allocation because quantities are identical. Kappa is -0.07. Here, reporting quantity and allocation disagreement is informative while Kappa obscures information. Furthermore, Kappa introduces some challenges in calculation and interpretation because Kappa is a ratio. It is possible for Kappa’s ratio to return an undefined value due to zero in the denominator. Furthermore, a ratio does not reveal its numerator nor its denominator. It is more informative for researchers to report disagreement in two components, quantity and allocation. These two components describe the relationship between the categories more clearly than a single summary statistic. When predictive accuracy is the goal, researchers can more easily begin to think about ways to improve a prediction by using two components of quantity and allocation, rather than one ratio of Kappa. Some researchers have expressed concern over κ’s tendency to take the observed categories’ frequencies as givens, which can make it unreliable for measuring agreement in situations such as the diagnosis of rare diseases. In these situations, κ tends to underestimate the agreement on the rare category. For this reason, κ is considered an overly conservative measure of agreement. Others contest the assertion that kappa “takes into account” chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.

FRIEDMAN TEST The Friedman test is a non-parametric statistical test developed by Milton Friedman. Similar to the parametric repeated measures ANOVA, it is used to detect differences in treatments across multiple test attempts. The procedure involves ranking each row (or block) together, then considering the values of ranks by columns. Applicable to complete block designs, it is thus a special case of the Durbin test. Classic examples of use are: • n wine judges each rate k different wines. Are any of the k wines ranked consistently higher or lower than the others? • n welders each use k welding torches, and the ensuing welds were rated on quality. Do any of the k torches produce consistently better or worse welds? 268 Research Methodology and Statistical Methods

The Friedman test is used for one-way repeated measures analysis of variance by ranks. In its use of ranks it is similar to the Kruskal–Wallis one-way analysis of variance by ranks. Friedman test is widely supported by many statistical software packages.

KENDALL RANK CORRELATION COEFFICIENT In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897. Intuitively, the Kendall correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e., relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully different for a correlation of –1) rank between the two variables. Both Kendall’s τ and Spearman’s ρ can be formulated as special cases of a more general correlation coefficient.

KENDALL’S W Kendall’s W (also known as Kendall’s coefficient of concordance) is a non-parametric statistic. It is a normalization of the statistic of the Friedman test, and can be used for assessing agreement among raters. Kendall’s W ranges from 0 (no agreement) to 1 (complete agreement). Suppose, for instance, that a number of people have been asked to rank a list of political concerns, from most important to least important. Kendall’s W can be calculated from these data. If the test statistic W is 1, then all the survey respondents have been unanimous, and each respondent has assigned the same order to the list of concerns. If W is 0, then there is no overall trend of agreement among the respondents, and their responses may be regarded as essentially random. Intermediate values of W indicate a greater or lesser degree of unanimity among the various responses. While tests using the standard Pearson correlation coefficient assume normally distributed values and compare two sequences of outcomes at a time, Kendall’s W makes no assumptions regarding the nature of the probability distribution and can handle any number of distinct outcomes. W is linearly related to the mean value of the Spearman’s rank correlation coefficients between all pairs of the rankings over which it is calculated.

KOLMOGOROV–SMIRNOV TEST In statistics, the Kolmogorov–Smirnov test (K–S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be Nonparametric Statistics 269 used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). It is named after and Nikolai Smirnov. The Kolmogorov–Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case). In each case, the distributions considered under the null hypothesis are continuous distributions but are otherwise unrestricted. The two-sample K–S test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for normality of the distribution, samples are standardized and compared with a standard normal distribution. This is equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is known that using these to define the specific reference distribution changes the null distribution of the test statistic. Various studies have found that, even in this corrected form, the test is less powerful for testing normality than the Shapiro–Wilk test or Anderson–Darling test. However, these other tests have their own disadvantages. For instance the Shapiro–Wilk test is known not to work well in samples with many identical values.

Fig. Fig. Illustration of the Kolmogorov–Smirnov statistic. Red line is CDF, blue line is an ECDF, and the black arrow is the K–S statistic.

KRUSKAL–WALLIS ONE-WAY ANALYSIS OF VARIANCE The Kruskal–Wallis test by ranks, Kruskal–Wallis H test (named after William Kruskal and W. Allen Wallis), or one-way ANOVA on ranks is a non-parametric method for 270 Research Methodology and Statistical Methods testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test when there are more than two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA). A significant Kruskal–Wallis test indicates that at least one sample stochastically dominates one other sample. The test does not identify where this stochastic dominance occurs or for how many pairs of groups stochastic dominance obtains. For analyzing the specific sample pairs for stochastic dominance in post hoc testing, Dunn’s test, pairwise Mann-Whitney tests without Bonferroni correction, or the more powerful but less well known Conover–Iman test are appropriate. Since it is a non-parametric method, the Kruskal–Wallis test does not assume a normal distribution of the residuals, unlike the analogous one-way analysis of variance. If the researcher can make the less stringent assumptions of an identically shaped and scaled distribution for all groups, except for any difference in , then the null hypothesis is that the medians of all groups are equal, and the alternative hypothesis is that at least one population median of one group is different from the population median of at least one other group.

Exact Probability Tables A large amount of computing resources is required to compute exact probabilities for the Kruskal–Wallis test. Existing software only provides exact probabilities for sample sizes less than about 30 participants. These software programmes rely on asymptotic approximation for larger sample sizes. Exact probability values for larger sample sizes are available. Spurrier (2003) published exact probability tables for samples as large as 45 participants. Meyer and Seaman (2006) produced exact probability distributions for samples as large as 105 participants.

KUIPER’S TEST Kuiper’s test is used in statistics to test that whether a given distribution, or family of distributions, is contradicted by evidence from a sample of data. It is named after Dutch mathematician Nicolaas Kuiper. Kuiper’s test is closely related to the better-known Kolmogorov–Smirnov test (or K-S test as it is often called). As with the K-S test, the discrepancy statistics D+ and D” represent the absolute sizes of the most positive and most negative differences between the two cumulative distribution functions that are being compared. The trick with Kuiper’s test is to use the quantity D+ + D– as the test statistic. This small change makes Kuiper’s test as sensitive in the tails as at the median and also makes it invariant under cyclic transformations of the independent variable. The Anderson–Darling test is another test that provides equal sensitivity at the tails as the median, but it does not provide the cyclic invariance. This invariance under cyclic transformations makes Kuiper’s test invaluable when testing for cyclic variations by Nonparametric Statistics 271 time of year or day of the week or time of day, and more generally for testing the fit of, and differences between, circular probability distributions.

Example We could test the hypothesis that computers fail more during some times of the year than others. To test this, we would collect the dates on which the test set of computers had failed and build an empirical distribution function. The null hypothesis is that the failures are uniformly distributed. Kuiper’s statistic does not change if we change the beginning of the year and does not require that we bin failures into months or the like. Another test statistic having this property is the Watson statistic, which is related to the Cramér–von Mises test. However, if failures occur mostly on weekends, many uniform-distribution tests such as K-S and Kuiper would miss this, since weekends are spread throughout the year. This inability to distinguish distributions with a comb-like shape from continuous uniform distributions is a key problem with all statistics based on a variant of the K-S test. Kuiper’s test, applied to the event times modulo one week, is able to detect such a pattern. Using event times that have been modulated with the K-S test can result in different results depending on how the data is phased. In this example, the K-S test may detect the non-uniformity the data is set to start the week on Saturday, but fail to detect the non-uniformity if the week starts on Wednesday.

LOG-RANK TEST In statistics, the log-rank test is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring must be non-informative). It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart attack). The test is sometimes called the Mantel–Cox test, named after Nathan Mantel and . The log-rank test can also be viewed as a time-stratified Cochran–Mantel–Haenszel test. The test was first proposed by Nathan Mantel and was named the log-rank test by Richard and Julian Peto.

Test Assumptions The logrank test is based on the same assumptions as the Kaplan-Meier survival curve—namely, that censoring is unrelated to prognosis, the survival probabilities are the same for subjects recruited early and late in the study, and the events happened at the times specified. Deviations from these assumptions matter most if they are satisfied differently in the groups being compared, for example if censoring is more likely in one group than another. 272 Research Methodology and Statistical Methods

MANN–WHITNEY U TEST In statistics, the Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample. Unlike the t-test it does not require the assumption of normal distributions. It is nearly as efficient as the t-test on normal distributions. This test can be used to determine whether two independent samples were selected from populations having the same distribution; a similar nonparametric test used on dependent samples is the Wilcoxon signed-rank test.

Assumptions and Formal Statement of Hypotheses Although Mann and Whitney developed the Mann–Whitney U test under the assumption of continuous responses with the alternative hypothesis being that one distribution is stochastically greater than the other, there are many other ways to formulate the null and alternative hypotheses such that the Mann–Whitney U test will give a valid test. A very general formulation is to assume that: 1. All the observations from both groups are independent of each other, 2. The responses are ordinal (i.e., one can at least say, of any two observations, which is the greater), 3. Under the null hypothesis H0, the distributions of both populations are equal. 4. The alternative hypothesis H1 is distributions are not equal. Under the general formulation, the test is only consistent (i.e., has power to reject that approaches 100% as the sample size approaches infinity) when the following occurs under H1: 1. The probability of an observation from population X exceeding an observation from population Y is different (larger, or smaller) than the probability of an observation from Y exceeding an observation from X; i.e., P(X > Y) ≠ P(Y > X) or P(X > Y) + 0.5 · P(X = Y) ≠ 0.5. Under more strict assumptions than the general formulation above, e.g., if the responses are assumed to be continuous and the alternative is restricted to a shift in location, i.e., F1(x) = F2(x + δ), we can interpret a significant Mann–Whitney U test as showing a difference in medians. Under this location shift assumption, we can also interpret the Mann–Whitney U test as assessing whether the Hodges–Lehmann estimate of the difference in central tendency between the two populations differs from zero. The Hodges–Lehmann estimate for this two-sample problem is the median of all possible differences between an observation in the first sample and an observation in the second sample. The Mann– Whitney U test/Wilcoxon rank-sum test is not the same as the Wilcoxon signed-rank Nonparametric Statistics 273 test, although both are nonparametric and involve summation of ranks. The Mann– Whitney U test is applied to independent samples. The Wilcoxon signed-rank test is applied to matched or dependent samples.

Example Statement of Results In reporting the results of a Mann–Whitney U test, it is important to state: • A measure of the central tendencies of the two groups (means or medians; since the Mann–Whitney U test is an ordinal test, medians are usually recommended) • The value of U • The sample sizes • The significance level. In practice some of this information may already have been supplied and common sense should be used in deciding whether to repeat it. A typical report might run, “Median latencies in groups E and C were 153 and 247 ms; the distributions in the two groups differed significantly (Mann–Whitney U = 10.5, n1 = n2 = 8, P < 0.05 two- tailed).” A statement that does full justice to the statistical status of the test might run, “Outcomes of the two treatments were compared using the Wilcoxon–Mann–Whitney two-sample rank-sum test. The treatment effect (difference between treatments) was quantified using the Hodges–Lehmann (HL) estimator, which is consistent with the Wilcoxon test. This estimator (HL∆) is the median of all possible differences in outcomes between a subject in group B and a subject in group A. A non-parametric 0.95 confidence interval for HLÄ accompanies these estimates as does ρ, an estimate of the probability that a randomly chosen subject from population B has a higher weight than a randomly chosen subject from population A. The median [quartiles] weight for subjects on treatment A and B respectively are 147 [121, 177] and 151 [130, 180] kg. Treatment A decreased weight by HL∆ = 5 kg (0.95 CL [2, 9] kg, 2P = 0.02, ρ = 0.58).” However it would be rare to find so extended a report in a document whose major topic was not statistical inference.

Implementations In many software packages, the Mann–Whitney U test (of the hypothesis of equal distributions against appropriate alternatives) has been poorly documented. Some packages incorrectly treat ties or fail to document asymptotic techniques (e.g., correction for continuity). A 2000 review discussed some of the following packages: • MATLAB has ranksum in its Statistics Toolbox. • R’s statistics base-package implements the test wilcox.test in its “stats” package. • SAS implements the test in its PROC NPAR1WAY procedure. 274 Research Methodology and Statistical Methods

• Python (programming language) has an implementation of this test provided by SciPy • SigmaStat (SPSS Inc., Chicago, IL) • SYSTAT (SPSS Inc., Chicago, IL) • Java (programming language) has an implementation of this test provided by Apache Commons • JMP (SAS Institute Inc., Cary, NC) • S-Plus (MathSoft, Inc., Seattle, WA) • (StatSoft, Inc., Tulsa, OK) • (Unistat Ltd, London) • SPSS (SPSS Inc, Chicago) • StatsDirect (StatsDirect Ltd, Manchester, UK) implements all common variants. • Stata (Stata Corporation, College Station, TX) implements the test in its ranksum command. • StatXact (Cytel Software Corporation, Cambridge, MA) • PSPP implements the test in its WILCOXON function.

MCNEMAR’S TEST In statistics, McNemar’s test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is “marginal homogeneity”). It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium.

MEDIAN TEST In statistics, Mood’s median test is a special case of Pearson’s chi-squared test. It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two or more samples are drawn are identical. The data in each sample are assigned to two groups, one consisting of data whose values are higher than the median value in the two groups combined, and the other consisting of data whose values are at the median or below. A Pearson’s chi-squared test is then used to determine whether the observed frequencies in each sample differ from expected frequencies derived from a distribution combining the two groups.

Applications and Comparison to Other Tests The test has low power (efficiency) for moderate to large sample sizes. The Wilcoxon– Mann–Whitney U two-sample test or its generalisation for more samples, the Kruskal- Wallis test, can often be considered instead. The relevant aspect of the median test is that it only considers the position of each observation relative to the overall median, whereas the Wilcoxon–Mann–Whitney test takes the ranks of each observation into Nonparametric Statistics 275 account. Thus the other mentioned tests are usually more powerful than the median test. Moreover, the median test can only be used for quantitative data. However, although the alternative Kruskal-Wallis test does not assume normal distributions, it does assume that the variance is approximately equal across samples. Hence, in situations where that assumption does not hold, the median test is an appropriate test. Moreover, KrusSiegel & Castellan (1988) suggest that there is no alternative to the median test when one or more observations are “off the scale.”

RESAMPLING (STATISTICS) In statistics, resampling is any of a variety of methods for doing one of the following: 1. Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping) 2. Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re- randomization tests) 3. Validating models by using random subsets (bootstrapping, cross validation) Common resampling techniques include bootstrapping, jackknifing and permutation tests.

Bootstrap Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. It may also be used for constructing hypothesis tests. It is often used as a robust alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inference is impossible or requires very complicated formulas for the calculation of standard errors. Bootstrapping techniques are also used in the updating-selection transitions of particle filters, genetic type algorithms and related tesample/teconfiguration Monte Carlo methods used in computational physics and molecular chemistry. In this context, the bootstrap is used to replace sequentially empirical weighted probability measures by empirical measures. The bootstrap allows to replace the samples with low weights by copies of the samples with high weights.

Jackknife Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error (variance) of a statistic, when a random sample of observations is used to calculate it. Historically this method preceded the invention of the bootstrap with Quenouille inventing this method in 1949 and Tukey extending it in 276 Research Methodology and Statistical Methods

1958. This method was foreshadowed by Mahalanobis who in 1946 suggested repeated estimates of the statistic of interest with half the sample chosen at random. He coined the name ‘interpenetrating samples’ for this method. Quenouille invented this method with the intention of reducing the bias of the sample estimate. Tukey extended this method by assuming that if the replicates could be considered identically and independently distributed, then an estimate of the variance of the sample parameter could be made and that it would be approximately distributed as a t variate with n”1 degrees of freedom (n being the sample size). The basic idea behind the jackknife variance estimator lies in systematically recomputing the statistic estimate, leaving out one or more observations at a time from the sample set. From this new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated. Instead of using the jackknife to estimate the variance, it may instead be applied to the log of the variance. This transformation may result in better estimates particularly when the distribution of the variance itself may be non normal. For many statistical parameters the jackknife estimate of variance tends asymptotically to the true value almost surely. In technical terms one says that the jackknife estimate is consistent. The jackknife is consistent for the sample means, sample variances, central and non-central t-statistics (with possibly non-normal populations), sample coefficient of variation, maximum likelihood estimators, least squares estimators, correlation coefficients and regression coefficients. It is not consistent for the sample median. In the case of a unimodal variate the ratio of the jackknife variance to the sample variance tends to be distributed as one half the square of a chi square distribution with two degrees of freedom. The jackknife, like the original bootstrap, is dependent on the independence of the data. Extensions of the jackknife to allow for dependence in the data have been proposed. Another extension is the delete-a-group method used in association with Poisson sampling.

Comparison of Bootstrap and Jackknife Both methods, the bootstrap and the jackknife, estimate the variability of a statistic from the variability of that statistic between subsamples, rather than from parametric assumptions. For the more general jackknife, the delete-m observations jackknife, the bootstrap can be seen as a random approximation of it. Both yield similar numerical results, which is why each can be seen as approximation to the other. Although there are huge theoretical differences in their mathematical insights, the main practical difference for statistics users is that the bootstrap gives different results when repeated on the same data, whereas the jackknife gives exactly the same result each time. Because of this, the jackknife is popular when the estimates need to be verified several times before publishing (e.g., official statistics agencies). On the other hand, when this verification feature is not crucial and it is of interest not to have a number but Nonparametric Statistics 277 just an idea of its distribution, the bootstrap is preferred (e.g., studies in physics, economics, biological sciences). Whether to use the bootstrap or the jackknife may depend more on operational aspects than on statistical concerns of a survey. The jackknife, originally used for bias reduction, is more of a specialized method and only estimates the variance of the point estimator. This can be enough for basic statistical inference (e.g., hypothesis testing, confidence intervals). The bootstrap, on the other hand, first estimates the whole distribution (of the point estimator) and then computes the variance from that. While powerful and easy, this can become highly computer intensive. “The bootstrap can be applied to both variance and distribution estimation problems. However, the bootstrap variance estimator is not as good as the jackknife or the balanced repeated replication (BRR) variance estimator in terms of the empirical results. Furthermore, the bootstrap variance estimator usually requires more computations than the jackknife or the BRR. Thus, the bootstrap is mainly recommended for distribution estimation.” There is a special consideration with the jackknife, particularly with the delete-1 observation jackknife. It should only be used with smooth, differentiable statistics (e.g., totals, means, proportions, ratios, odd ratios, regression coefficients, etc.; not with medians or quantiles). This could become a practical disadvantage. This disadvantage is usually the argument favoring bootstrapping over jackknifing. More general jackknifes than the delete-1, such as the delete-m jackknife, overcome this problem for the medians and quantiles by relaxing the smoothness requirements for consistent variance estimation. Usually the jackknife is easier to apply to complex sampling schemes than the bootstrap. Complex sampling schemes may involve stratification, multiple stages (clustering), varying sampling weights (non-response adjustments, calibration, post- stratification) and under unequal-probability sampling designs. Theoretical aspects of both the bootstrap and the jackknife can be found in Shao and Tu (1995), whereas a basic introduction is accounted in Wolter (2007). The bootstrap estimate of model prediction bias is more precise than jackknife estimates with linear models such as linear discriminant function or multiple regression.

Subsampling Subsampling is an alternative method for approximating the sampling distribution of an estimator. The two key differences to the bootstrap are: 1. The resample size is smaller than the sample size and 2. Rsampling is done without replacement. The advantage of subsampling is that it is valid under much weaker conditions compared to the bootstrap. In particular, a set of sufficient conditions is that the rate of convergence of the estimator is known and that the limiting distribution is continuous; in addition, the resample (or subsample) size must tend to infinity together with the sample size but at a smaller rate, so that their ratio converges to zero. While subsampling 278 Research Methodology and Statistical Methods was originally proposed for the case of independent and identically distributed (IID) data only, the methodology has been extended to cover time series data as well; in this case, one resamples blocks of subsequent data rather than individual data points. There are many cases of applied interest where subsampling leads to valid inference whereas bootstrapping does not; for example, such cases include examples where the rate of convergence of the estimator is not the square root of the sample size or when the limiting distribution is non-normal.

Cross-validation Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out for use as validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy. Cross-Validation is employed repeatedly in building decision trees. One form of cross-validation leaves out a single observation at a time; this is similar to the jackknife. Another, K-fold cross-validation, splits the data into K subsets; each is held out in turn as the validation set. This avoids “self-influence”. For comparison, in regression analysis methods such as linear regression, each y value draws the regression line towards itself, making the prediction of that value appear more accurate than it really is. Cross-validation applied to linear regression predicts the y value for each observation without using that observation. This is often used for deciding how many predictor variables to use in regression. Without cross-validation, adding predictors always reduces the residual sum of squares (or possibly leaves it unchanged). In contrast, the cross-validated mean-square error will tend to decrease if valuable predictors are added, but increase if worthless predictors are added.

Permutation Tests

Relation to Parametric Tests Permutation tests are a subset of non-parametric statistics. The basic premise is to use only the assumption that it is possible that all of the treatment groups are equivalent, and that every member of them is the same before sampling began (i.e., the slot that they fill is not differentiable from other slots before the slots are filled). From this, one can calculate a statistic and then see to what extent this statistic is special by seeing how likely it would be if the treatment assignments had been jumbled. In contrast to permutation tests, the reference distributions for many popular “classical” statistical tests, such as the t-test, F-test, z-test, and χ2 test, are obtained from theoretical probability distributions. Fisher’s exact test is an example of a commonly used permutation test for evaluating the association between two dichotomous variables. Nonparametric Statistics 279

When sample sizes are very large, the Pearson’s chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher’s exact test becomes more appropriate. Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation t-test, a permutation χ2 test of association, a permutation version of Aly’s test for comparing variances and so on. The major down-side to permutation tests are that they • Can be computationally intensive and may require “custom” code for difficult- to-calculate statistics. This must be rewritten for every case. • Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation.

Advantages Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses. Permutation tests can be used for analyzing unbalanced designs and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001). They can also be used to analyze qualitative data that has been quantitized (i.e., turned into numbers). Permutation tests may be ideal for analyzing quantitized data that do not satisfy statistical assumptions underlying traditional parametric tests (e.g., t-tests, ANOVA) (Collingridge, 2013). Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes. Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations, made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based “exact” confidence intervals.

Limitations An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption 280 Research Methodology and Statistical Methods is that tests of difference in location (like a permutation t-test) require equal variance. In this respect, the permutation t-test shares the same weakness as the classical Student’s t-test (the Behrens–Fisher problem). A third alternative in this situation is to use a bootstrap-based test. Good (2005) explains the difference between permutation tests and bootstrap tests the following way: “Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions.” Bootstrap tests are not exact.

RANK PRODUCT The rank product is a biologically motivated test for the detection of differentially expressed genes in replicated microarray experiments. It is a simple non-parametric statistical method based on ranks of fold changes. In addition to its use in expression profiling, it can be used to combine ranked lists in various application domains, including proteomics, metabolomics, statistical meta-analysis, and general feature selection.

Exact Probability Distribution and Accurate Approximation Permutation re-sampling requires a computationally demanding number of permutations to get reliable estimates of the p-values for the most differentially expressed genes, if n is large. Eisinga, Breitling and Heskes (2013) provide the exact probability mass distribution of the rank product statistic. Calculation of the exact p-values offers a substantial improvement over permutation approximation, most significantly for that part of the distribution rank product analysis is most interested in, i.e., the thin right tail. However, exact statistical significance of large rank products may take unacceptable long amounts of time to compute. Heskes, Eisinga and Breitling (2014) provide a method to determine accurate approximate p-values of the rank product statistic in a computationally fast manner.

SIEGEL–TUKEY TEST In statistics, the Siegel–Tukey test, named after Sidney Siegel and , is a non-parametric test which may be applied to data measured at least on an ordinal scale. It tests for differences in scale between two groups. The test is used to determine if one of two groups of data tends to have more widely dispersed values than the other. In other words, the test determines whether one of the two groups tends to move, sometimes to the right, sometimes to the left, but away from the center (of the ordinal scale). The test was published in 1960 by Sidney Siegel and John Wilder Tukey in the Journal of the American Statistical Association, in the article “A Nonparametric Sum of Ranks Procedure for Relative Spread in Unpaired Samples.”

Principle The principle is based on the following idea: Suppose there are two groups A and B with n observations for the first group and m Nonparametric Statistics 281 observations for the second (so there are N = n + m total observations). If all N observations are arranged in ascending order, it can be expected that the values of the two groups will be mixed or sorted randomly, if there are no differences between the two groups (following the null hypothesis H0). This would mean that among the ranks of extreme (high and low) scores, there would be similar values from Group A and Group B. If, say, Group A were more inclined to extreme values (the alternative hypothesis H1), then there will be a higher proportion of observations from group A with low or high values, and a reduced proportion of values at the center. 2 2 2 • Hypothesis H0: σ A = σ B & MeA = MeB (where σ and Me are the variance and the median, respectively) 2 2 • Hypothesis H1: σ A > σ B

SIGN TEST The sign test is a statistical method to test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations (such as weight pre- and post-treatment) for each subject, the sign test determines if one member of the pair (such as pre-treatment) tends to be greater than (or less than) the other member of the pair (such as post-treatment). The paired observations may be designated x and y. For comparisons of paired observations (x,y), the sign test is most useful if comparisons can only be expressed as x > y, x = y, or x < y. If, instead, the observations can be expressed as numeric quantities (x = 7, y = 18), or as ranks (rank of x = 1st, rank of y = 8th), then the paired t-test or the Wilcoxon signed-rank test will usually have greater power than the sign test to detect consistent differences. If X and Y are quantitative variables, the sign test can be used to test the hypothesis that the difference between the X and Y has zero median, assuming continuous distributions of the two random variables X and Y, in the situation when we can draw paired samples from X and Y. The sign test can also test if the median of a collection of numbers is significantly greater than or less than a specified value. For example, given a list of student grades in a class, the sign test can determine if the median grade is significantly different from, say, 75 out of 100. The sign test is a non-parametric test which makes very few assumptions about the nature of the distributions under test – this means that it has very general applicability but may lack the statistical power of the alternative tests. The two conditions for the paired-sample sign test are that a sample must be randomly selected from each population, and the samples must be dependent, or paired. Independent samples cannot be meaningfully paired. Since the test is nonparametric, the samples need not come from normally distributed populations. Also, the test works for left-tailed, right-tailed, and two-tailed tests. 282 Research Methodology and Statistical Methods

SPEARMAN’S RANK CORRELATION COEFFICIENT In statistics, Spearman’s rank correlation coefficient or Spearman’s rho, named after

Charles Spearman and often denoted by the Greek letter ρ (rho) or as rs, is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function. The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or –1 occurs when each of the variables is a perfect monotone function of the other. Intuitively, the Spearman correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e., relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully opposed for a correlation of “1) rank between the two variables. Spearman’s coefficient is appropriate for both continuous and discrete ordinal variables. Both Spearman’s ρ and Kendall’s τ can be formulated as special cases of a more general correlation coefficient.

Related Quantities There are several other numerical measures that quantify the extent of statistical dependence between pairs of observations. The most common of these is the Pearson product-moment correlation coefficient, which is a similar correlation method to Spearman’s rank, that measures the “linear” relationships between the raw numbers rather than between their ranks. An alternative name for the Spearman rank correlation is the “grade correlation”; in this, the “rank” of an observation is replaced by the “grade”. In continuous distributions, the grade of an observation is, by convention, always one half less than the rank, and hence the grade and rank correlations are the same in this case. More generally, the “grade” of an observation is proportional to an estimate of the fraction of a population less than a given value, with the half-observation adjustment at observed values. Thus this corresponds to one possible treatment of tied ranks. While unusual, the term “grade correlation” is still in use.

Interpretation The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable). If Y tends to increase when X increases, the Spearman correlation coefficient is positive. If Y tends to decrease when X increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease Nonparametric Statistics 283 when X increases. The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other. When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1. A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi “ Xj and Yi – Yj always have the same sign. A perfect monotone decreasing relationship implies that these differences always have opposite signs. The Spearman correlation coefficient is often described as being “nonparametric”. This can have two meanings. First, a perfect Spearman correlation results when X and Y are related by any monotonic function. Contrast this with the Pearson correlation, which only gives a perfect value when X and Y are related by a linear function. The other sense in which the Spearman correlation is nonparametric in that its exact sampling distribution can be obtained without requiring knowledge (i.e., knowing the parameters) of the joint probability distribution of X and Y.

SQUARED RANKS TEST In statistics, the Conover squared ranks test is a non-parametric version of the parametric Levene’s test for equality of variance. The only test that appears to be a non- parametric one is the Conover squared ranks test. Other tests of significance of difference of data dispersion are parametric (i.e., are difference of variance tests) whereas Conover’s test is non-parametric. The squared ranks test is arguably a test of significance of difference of data dispersion not variance per se. This becomes important, for example, when the Levene’s test fails to satisfy the rather generous conditions for normality associated with that test and is a default alternative under those conditions for certain statistical software programmes like the VarianceEquivalenceTest routine in Mathematica. The parametric tests include the Bartlett, Brown-Forsythe, and Fisher Ratio tests.

TUKEY–DUCKWORTH TEST In statistics, the Tukey–Duckworth test is a two-sample location test – a statistical test of whether one of two samples was significantly greater than the other. It was introduced by John Tukey, who aimed to answer a request by W. E. Duckworth for a test simple enough to be remembered and applied in the field without recourse to tables, let alone computers. Given two groups of measurements of roughly the same size, where one group contains the highest value and the other the lowest value, then, • Count the number of values in the one group exceeding all values in the other, • Count the number of values in the other group falling below all those in the one, and • Sum these two counts (we require that neither count be zero). The critical values of the total count are, roughly, 7, 10, and 13, i.e., 7 for a two sided 284 Research Methodology and Statistical Methods

5% level, 10 for a two sided 1% level, and 13 for a two sided 0.1% level. The test loses some accuracy if the samples are quite large (greater than 30) or much different in size (ratio more than 4:3). Tukey’s paper describes adjustments for these conditions.

WALD–WOLFOWITZ RUNS TEST The Wald–Wolfowitz runs test (or simply runs test), named after Abraham Wald and Jacob Wolfowitz, is a non-parametric statistical test that checks a randomness hypothesis for a two-valued data sequence. More precisely, it can be used to test the hypothesis that the elements of the sequence are mutually independent.

Related Tests The Kolmogorov–Smirnov test has been shown to be more powerful than the Wald- Wolfowitz test for detecting differences between distributions that differ solely in their location. However, the reverse is true if the distributions differ in variance and have at the most only a small difference in location. The Wald-Wolfowitz runs test has been extended for use with several samples. Bibliography 285

Bibliography

Abraham J and Bibby N: Human Agency: The Black Box of Mathematics in the Curriculum, Zentralblatt fur Didaktik der Mathematik, Vol 2011. Abraham J and Bibby N: Mathematics and Society: Ethnomathematics and the Public Educator Curriculum, For the Learning of Mathematics, 2011. Abramowitz, M: Handbook of Mathematical Functions with Formulas, U.S. Govt. Print. Off. (Washington DC), Reprint by Dover (New York) 2001. Andrew Russell Forsyth: A Treatise on Differential Equations, Courier Corporation, 2009. Andy Francis: Business Mathematics and Statistics, 2004. Andy Francis: Business Mathematics and Statistics, Letts Higher Education List Series, Edition 5, illustrated, reprint D.P. Publications, 2008. Arnold, V.I., Silverman, R.A., Ordinary Differential Equations, MIT Press (Cambridge MA), 2012. Ascher M: Ethnomathematics: A Multicultural View of Mathematical Ideas Belmont, Ca: Brooks/Cole, 2002. Atkinson, K.E.: Elementary Numerical Analysis, 2nd ed., John Wiley (New York). 2009. ay J and Cole M: New Mathematics In An Old Culture, Holt, Rinehart and Winston: New York, 2012. B S Vatssa: Theory of Matrices, New Age International, 2007. 286 Research Methodology and Statistical Methods

Bibby N and Abraham J: Social History of Mathematics Controversies: Some Implications for the Curriculum, in Keitel et al 2000. Birkhoff, G., Rota, G.C.: Ordinary Differential Equations, 4th ed., John Wiley (New York), 2011. Blyth: Basic Linear Algebra, 2nd ED, 2008. Borba M C: Ethnomathematics and Education, For the Learning of Mathematics - An International Journal of Mathematics Education, 2010. Bracewell, R.N.: The Fourier Transform and Its Applications, 2nd ed., McGraw-Hill (New York), 2001. Carmen Chicone: Ordinary Differential Equations with Applications, Science & Business Media, 2006. Carraher T, Carraher D W and Schliemann: Mathematics in The Streets and Schools, British Journal of Developmental Psychology, Vol 2011. Carraher T: Street Mathematics and School Mathematics, In Borbas, A Ed (1988) 12th International PME Conference (Vol 1), Veszprem, Hungary, 2011. Chevallard Y: On mathematics education and culture: critical afterthoughts, Educational Studies in Mathematics, 2007. Confrey J: Conceptual Change Analysis: Implications for Mathematics and Curriculum, Curriculum Inquiry, 2002. Dambrosio U: Ethnomathematics gor The Learning of Mathematics, 2008. David R. Cox, D. V. Hinkley, Theoretical Statistics, Chapman & Hall/CRC, 2009. Davis P J and HershR: The Mathematical Experience, Penguin, Harmondsworth, 2000. Davis, H.F.: Fourier Series and Orthogonal Functions, Allyn and Bacon (Boston), 2003. Digital Logic Design: Brian Holdsworth and Clive Woods, Elsevier, 2008. Edward L. Ince: Ordinary Differential Equations, illustrated, unabridged, Courier Corporation, 2008. Edwards, C.H., Penney, D.E.: Calculus with Analytic Geometry, 5th ed., Prentice Hall (Upper Saddle River, N.J), 2001. Ernest P: (Ed) Mathematics Teaching: The State of the Art, Basingstoke: Falmer Press, 2010. Ernest P: (Ed) Mathematics, Education And Philosophy: An International Perspective, London: Falmer Press, 2004. Ernest P: Mathematical Values: A Personal Experience, Mathematical Education for Teaching, 2009. Ernest P: Recent Developments In Primary Mathematics, University of Exeter School of Education (Unit 1), 2009. Bibliography 287

Ernest P: The Philosophy of Mathematics Education, London: Falmer Press, 2012. Frankenstein M: Relearning Mathematics London: Free Press, 2001. Franklin, J.N.: Matrix Theory, Prentice-Hall (Englewood Cliffs, NJ), 2012. Frantisek Stulajter: Predictions in Time Series Using Regression Models, Springer Science & Business Media, 2002. Fritz Colonius: Dynamical Systems and Linear Algebra, Universities Press, 2017. G. Srinivasa: Business Mathematics and Statistics, New Age International, 2008. George Casella, Roger L. Berger, Statistical Inference, 2nd ed., Duxbury Press, 2001. George E. Andrews: Number Theory, Courier Corporation, 2004. Giroux H: Beyond The Ethics of Flag Waving: Schooling and Citizenship For A Critical Democracy, The Clearing House 2004. Grouws D A: (Ed) Handbook of Research On Mathematics Teaching and Learning, New York: Macmillan, 2010. Grouws D A: (Ed) Handbook of Research On Mathematics Teaching and Learning, New York: Macmillan, 2010. Hildebrand, F.B.: Advanced Calculus for Applications, 2nd ed., Prentice-Hall (Englewood Cliffs, NJ), 2012. Hildebrand, F.B.: Introduction to Numerical Analysis, 2nd ed., McGraw-Hill (New York), 2004. Horton R: African Traditional Thought and Western Science In Young, M F D (Ed) 1971 Knowledge and Control Collier - Macmillan: London, 2000. Hoyland, A., Rausand, M.: System Reliability Theory: Models and Statistical Methods, John Wiley (New York), 2000. Ian McLoughlin: Applied Speech and Audio Processing: With Matlab Examples, Cambridge University Press, 2009. John A. Rice: Mathematical statistics and data analysis, Brooks/Cole Pub. Co., 2008. John A. Rice: Mathematical Statistics and Data Analysis, Edition3, illustrated, Cengage Learning, 2006. John E. Freund: Mathematical statistics, Edition 2, illustrated, Prentice-Hall, 2001. John O. Attia: Electronics and Circuit Analysis Using MATLAB, CRC-Press, 2009. John O. Attia: PSPICE and MATLAB for Electronics: An Integrated Approach, CRC Press, 2009. Julie C, Angelis D, Davis Z :(Eds) Political Dimensions of Mathematics Education: Curriculum Reconstruction For Society In Transition, Johannesburg, South Africa: Maskew, Miller and Longman, 2001 Kaplan, W.: Advanced Calculus, 3rd ed., Addison-Wesley (Cambridge, MA), 2004. 288 Research Methodology and Statistical Methods

Laurene V Fausett: Applied Numerical Analysis Using MATLAB, Pearson Education, 2009. Leonard J. Savage, The Foundations of Statistics, 2nd ed., Dover Publications, Inc. New York, 2002. Monga, G S: Mathematics and Statistics for Economics, Vikas Pub, 2004. Olver, F.W.J.: Asymptotics and Special Functions, Academic Press (New York), 2004. Peter J. Bickel, Kjell A. Doksum, Mathematical Statistics, Volume 1, Basic Ideas and Selected Topics, 2rd ed. Prentice Hall, 2001. Wiebe R. Pestman: Mathematical Statistics: An Introduction, illustrated, Walter de Gruyter, 2008. Wolfram, S.: The Mathematica Book, 4th ed., Cambridge Univ. Press (New York), 2000. Wylie, C.R.: Advanced Engineering Mathematics, 6th ed., McGraw-Hill (New York), 2005. Zhang: Matrix Theory, Springer, Paperback, 2008. Zwillinger, D. (ed.): CRC Standard Mathematical Tables and Formulae, 30th ed., CRC Press (BocsRation, FL), 2007. Zwillinger, D.: Handbook of Differential Equations, 3rd ed., Academic Press (San Diego), 2008.