Simpsons's Paradox - When Big Data Sets Go Bad It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Unfortunately Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets. Unfortunately, the conclusions from the large set are also usually wrong. To understand this effect we'll use a set of simulated data. Table 1 HS Physics None Improvement shows the average physics grades for students in an engineering program. This is a difficult class used for weeding out weaker Student 50 5 --- students. Most of these students prepared for college by taking Ave Grade 80 70 10 high school (HS) physics. The data illustrates that there is a ten point advantage for those with HS physics. Table 2 shows the average physics grades for students in a liberal arts program. This Table 1. Average college physics grades for students in an class is designed as an elective course for the enrichment of engineering program. students who would otherwise avoid physics. Few students have prepared for this class by taking HS physics. However, those few HS Physics None Improvement who took HS physics have a 10 point grade average. In both classes taking physics clearly produced an advantage. Student 5 50 --- Ave Grade 95 85 10 We now combine the data sets. The combined results for students who took physics are shown in table 3. The average college physics grade has been determined by adding all the grade points Table 2. Average college physics grades for students in a liberal (4475) and then dividing by the total number of students (55). arts program. Table 4 shows the same results for the students without HS # Students Grades Grade Pts physics. The results of tables 3 and 4 indicate that students who take physics perform worse than those who don't by 2.3 points. Engineering 50 80 4000 This is the opposite conclusion from the conclusion of tables 1 and Lib Arts 5 95 475 2. Total 55 4475 Obviously, combining the data sets gives a misleading picture but Average --- 81.4 --- why? The answer lies in two parts. First, the data sets for the two major groups (engineering and liberal art students) were influenced by a lurking variable, course difficulty. The engineering students Table 3. Average college physics grades for students who took received a rigorous course. The liberal arts students a less high school physics. demanding enrichment course. Second, the groups in the data sets # Students Grades Grade Pts were not the same size. This caused the average of college physics grades to be weighted toward engineering student grades for those Engineering 5 70 350 who had taken HS physics. Since the engineering students' course was more rigorous it lowered the average. The opposite was true Lib Arts 50 85 4250 for the combined results of those who didn't take HS physics. Total 4600 Average 83.6
Table 4. Average college physics grades for students who didn't take high school physics. There were four separate groups in the study as follows:
1. Engineering students with HS physics 2. Engineering students without HS physics 3. Liberal arts students with HS physics 4. Liberal arts students without HS physics
If all the four groups had been the same size, the results would have indicated that students with HS physics had a 10 point advantage in their college physics grades regardless of the type of college physics they took. Likewise if an average had been calculated which was not weighted toward group size, the results would also have indicated the same 10 point advantage.
Conclusions
Simpson's Paradox is caused by a combination of a lurking variable and data from unequal sized groups being combined into a single data set. The unequal group sizes, in the presence of a lurking variable, can weight the results incorrectly. This can lead to seriously flawed conclusions. The obvious way to prevent it is to not combine data sets of different sizes from a diverse sources.
Simpson's Paradox will generally not be a problem in a well designed experiment or survey if possible lurking variables are identified ahead of time and properly controlled. This includes eliminating them, holding them constant for all groups or making them part of the study.
