<<

140 Injury Prevention 1998;4:140

Exploratory analysis: what to do first Inj Prev: first published as 10.1136/ip.4.2.140 on 1 June 1998. Downloaded from

Robert W Platt

One of the more important but often over- approach is to graph a scatterplot of the two looked parts of statistical analysis is the very variables and check for a relationship. first step—an exploratory and descriptive For categorical variables, it is easiest to analysis. Typically, researchers take a quick inspect bivariate (for example 2 × 2) cross look at the data and then dive into more com- tabulations to identify patterns and potentially plex regression models or t tests. In this interesting relationships. These relationships column, I discuss preliminary analysis in provide the baseline for futher analyses. general and look at some techniques less well Finally, a multivariate exploratory analysis known than others, but which provide interest- may be needed to detect possible ing and useful results. (the mixing of eVects of an outcome, an expo- The first step in understanding your data is sure and a third variable that is associated with to establish the kinds of variables you have. Are the primary predictor and also aVects the out- they continuous (ranging over several values, come) or eVect modification (when the eVect like weight or height) or categorical (taking of an exposure on the outcome diVers for only a few values)? Are the continuous diVerent levels of a third variable). The easiest variables bounded (like age, which can’t be less way to do this is with a bivariate analysis strati- than zero) or unbounded? Are there any fied by the third variable. If the latter is outliers or strange values? categorical just look at the relationship between This last question can be looked at in a sim- the other two variables restricted to the levels of ple way. Calculate the and standard the third, and if it is continuous, create a new deviation of a variable, and examine values that . If there is important are more than three, or if you want to be very confounding or eVect modification (the defini- careful, two, standard deviations from the tion of “important” here is arbitrary and mean. If there are outliers, they need to be depends on the needs of the analysis) these investigated, and either eliminated (if they are must be accounted for in the formal models errors) or treated carefully (if they are valid when computing estimates of the primary pre- data points). Next, for the continuous vari- dictor. http://injuryprevention.bmj.com/ ables, look at of your data, and for After these preliminary analyses, the patterns the categorical variables, look at frequency and relationships in the results should be tables. These will tell you roughly what the dis- reasonably clear and the analyses that need to tributions of the variables are and this influ- be done should be obvious. If this is the case, ences the you can use. then the rest is simple—for continuous vari- The next thing to consider is bivariate analy- ables, t tests, ANOVA, or can McGill ses of the data. First, what do we do with con- be used to confirm the exploratory work. Simi- University/Montreal 2 Children’s Hospital tinuous variables? A common mistake is to larly, for categorical data, ÷ or non-parametric Research Institute, examine correlations first. But these are usually tests can be used. 2300 Tupper, Montreal, an ineYcient way of inspecting the data, If patterns in the results are not clear, two PQ H3H 1P3, Canada because significant correlations depend on a things are possible: either there aren’t any

linear relationship between the variables and if interesting relationships, or there are but they on October 2, 2021 by guest. Protected copyright. Correspondence to: Dr Platt the true relationship is curved, the correlation are complex and you need to consult a statisti- ([email protected]). may not indicate the association. Another cian!