Challenging nostalgia and performance metrics in

Daniel J. Eck

March 28, 2018

Abstract

In this writeup, we show that the great old time players are overrated relative to the great players of recent time and we show that performance metrics used to compare players have substantial era biases. In showing that the old time players are overrated, no individual statistics or era adjusted metrics are used. Instead, we provide substantial evidence that the composition of the eras themselves are drastically different. In particular, there were significantly fewer eligible MLB players available in the older eras of baseball. As a consequence, we argue that ESPN’s greatest MLB players of all time list includes too many old time players in their top 25 and many performance metrics fail to adequately compare players across eras.

1 Introduction

When one looks at the raw or advanced , one is often blown away by the accomplishments of old time baseball players. The greatest players from the old eras of produced mind- boggling numbers. As examples, see ’s average and pitching numbers, ’s 1900 season, ’s 1916 season, Walter Johnson’s 1913 season, ’s 1911 season, ’s 1931 season, ’s 1925 season, and many others. The accomplishments achieved by players during these seasons are far beyond what recent players produce and it seems, at first glance, that players from the old eras were vastly superior to the players that play in more modern eras. But, is this true? Are the old time ball players actually superior? In this paper, we investigate whether or not the old time baseball players are overrated and whether or not the recent baseball players are underrated. We define old time baseball players to be those that played in the MLB in the 1950 season or before. This year is chosen because it coincides with the decennial United States Census and is close to the year in which baseball became integrated. Recent baseball players are defined to be those that played in the MLB in the 1980 season or after. In this analysis, we do not compare baseball players via their statistical accomplishments or the awards they earned over their careers. Such measures often have era biases that are confounded with actual perfor- mance. The only statistics that are of interest are when a player played and the eligible MLB population at that time. The population data (in Section 2) is constructed and used to formally test (in Section 4) whether or not too many old time players are represented in ESPN’s top 25 list (ESPN, 2016). We do discover that there are far too many old time players included in ESPN’s top 25 list. The extent of our conclusions conflicts with many conventional baseball evaluation metrics as well as a sophisticated era-adjustment de- trending metric (Petersen, Penner, and Stanley, 2011). The reference Petersen, Penner, and Stanley (2011) will henceforth be called PPS due to repeated use. A detailed comparison and critique of the methodologies underlying the evaluation metrics and our initial analysis is then given.

1 2 Data

The eligible MLB population is not over any particular time period. As a proxy, we say that the eligible MLB population is the decennial count of males aged 20-24. This demographic data for the United States eligible MLB population is obtained from the US Census (Census, 2016). The MLB started in 1 876 so our data collection begins with the 1880 Census. In 1947, the i ntegration of the MLB began. African American and Hispanic demographic data is added to our dataset starting in 1950. Prior to 1950, the US black and Hispanic populations are excluded from the tallies since baseball was not yet integrated. African American and Hispanic citizens were not the only group discriminated against by pre (and post) integration baseball. Players from Latin American countries were also discriminated against, and citizens from those countries are added to the eligible MLB player pool after 1950. The population data of Latin American countries is obtained from Brea (2003). It should be noted that Brazil and Argentina has been excluded from Latin America in this study since the Brazilians and Argentinians do not largely play baseball and their populations are large compared with Latin American countries where baseball is considered to be a dominant sport. In the 1990s the MLB and minors saw an influx of Asian baseball players from Japan, Korea, and Taiwan. The populations from these countries are added to the eligible MLB player talent pool for those years. The foreign Census data is not as precise so 5% of the total population data is added to our datasets corresponding to the estimate of the proportion of these counties’ populations which are males aged 20-24. The cumulative proportion means that at each era, the population of the previous eras is also included. The population dataset is seen below. As an example of how to interpret this dataset, consider the year 1930. There were an 4.8 million males aged 18-24 in 1930. The proportion of the eligible MLB population throughout all eras that was available to play baseball in 1930 or before is 0.113.

year pop pop prop 1 1880 2.200 0.012 2 1890 2.700 0.026 3 1900 3.200 0.043 4 1910 4.100 0.065 5 1920 4.000 0.087 6 1930 4.800 0.113 7 1940 5.100 0.140 8 1950 5.000 0.167 9 1960 11.515 0.228 10 1970 16.140 0.315 11 1980 21.270 0.429 12 1990 32.385 0.602 13 2000 35.100 0.790 14 2010 39.100 1.000

Table 1: Eligible MLB population throughout the years

3 The Greats

The population data is the essential part of our argument that the great old time baseball players are over- rated. At a quick glance of this population dataset, one can see how small the proportion of the pre-1950

2 eligible MLB population actually is. We now list the top 25 MLB players, along with the years they played, according to ESPN. This list is displayed in Table 2 (ESPN, 2016). Using our definition of old time and recent players, we see that Table 2 has 6 old time players and 1 recent player in the top 10 and this list has 10 old time players and 7 recent players in the top 25. When the eligible MLB population is considered, it appears that old time players are over represented in this top 25 list. In the next Section, we formally test whether or not this is so. rank name years played 1 Babe Ruth 1914-1935 2 1951-1973 3 1954-1976 4 1939-1960 5 1986-2007 6 1951-1968 7 Lou Gehrig 1923-1939 8 Ty Cobb 1905-1928 9 Walter Johnson 1907-1927 10 1941-1963 11 Pedro Martinez 1992-2009 12 1986-2008 13 Honus Wagner 1897-1917 14 Ken Griffey Jr. 1989-2008 15 Joe Dimaggio 1936-1951 16 1955-1966 17 1890-1911 18 1955-1972 19 1984-2007 20 1959-1975 21 Alex Rodriguez 1994-2016 22 1979-2003 23 1988-2009 24 1956-1976 25 Rogers Hornsby 1915-1937

Table 2: ESPN’s list of the top 25 greatest baseball players to ever play in the MLB.

4 Testing

We now test whether or not there are too many old time players or too few recent players included in the top 10 and top 25 lists. It should be noted that these two tests are not independent of one another. If there are too many old time players included, then it is likely that there are too few recent players included simply because there were too many old time players included. One should pick a perspective of interest. However, we will present tests from both perspectives. In order to conduct these statistical tests, we make two assumptions:

• First, we assume that innate talent is uniformly distributed over the eligible MLB population over the different eras.

3 • Second, we assume that the outside competition to the MLB available by other sports leagues is offset by the increased salary incentives received by more recent players.

We now have enough information and assumptions to conduct our tests. We start out testing whether or not ESPN has included too many old time players in their list. The hypotheses that we are interested in testing are of the form:

H1 : The ESPN list in Table 1 is correct. H2 : There are too many old time players included in the ESPN list.

This testing format means that we assume that the ESPN list of the top 10 or top 25 greatest players is correct and we require strong evidence against this assumption to conclude otherwise. We now test whether or not 6 old time players in the top 10 is too many. Based on the population statistics and our two assumptions, the chances that an MLB eligible person from 1950 or before belongs in the top 10 is 0.167. The chances of observing at least 6 such people in the top 10 is calculated using the binomial distribution with n = 10 and p = 0.1667. This chance is found to be 0.0024. This says that the chances of randomly distributed talent producing 6 or more players in the top 10 is 0.24%. This is a very small chance This is strong evidence in favor of H2 as opposed to H1. This calculation is explained in more detail in Appendix A. We now perform a similar test with respect to the top 25 MLB players of all time. There are 10 players included in Table 2 that played before the integration of baseball. We find that the probability of there being 10 or more players in the top 25 on the basis of random chance to be 0.0047. This is significant evidence in favor of there being too many old time players included in the top 25 list of MLB players. We now investigate whether or not there are too few recent players included in ESPN’s list in Table 2. These tests are of the form:

H1 : The ESPN list in Table 1 is correct. H2 : There are too few recent players included in the ESPN list.

It should be noted that these tests are confounded with the tests already conducted. There are two players included in the top 10 that played after 1980. The probability that a player that played after 1980 is included in the top 10 based on population size and random chance is p = 0.571. The probability that at most 1 players from after 1980 would reach the top 10 is equal to 0.003. This chance is small enough to conclude that ESPN has included too few recent players in their top 10. There are a total of 7 players included in the top 25 that played after 1980. We find that the probability of there being at most 7 players from after 1980 as dictated by random chance is 0.003. Again, this chance is small enough to conclude that ESPN has included too few recent players in their top 25. Remark 1. The first test conducted included Ted Williams and Stan Musial in the pre-integration era. However, these two played a significant portion of their careers, with success, in the era after integration. If we redefine old time players to include those who played before 1960, then we arrive at a different conclusion. The probability that 6 or more players from before 1960 would appear on a top 10 list subject to random chance is found to be 0.013. This chance is still very small. Therefore, it would be safe to conclude that there is strong evidence to reject the hypothesis that the ESPN top 10 list is correct when old time players are defined to be those that played before 1960. Remark 2. The redefinition of old time players to those that played before 1960 does not change the conclusion that ESPN’s top 25 list includes too many old time players. The probability that at least 10 players that played before 1960 would appear in a top 25 list is found to be 0.041.

4 5 Comparison of Methods

Our evaluation metric model free analysis of great players conflicts with the top 25 list in Table 2, conven- tional evaluation metrics such as Wins Above Replacement (WAR), and a more sophisticated era-adjusted detrending metric of PPS. The underlying principles of these competing metrics are now compared with out methodology. We start with WAR.

5.1 Critique of WAR Philosophically speaking, WAR is a one-number summary of the added value of a player over a “replace- ment” level player. It is a statistic but it is not thought about as a statistic in the way statisticians think about statistics. For instance, the replacement level player, which is central to the interpretation of WAR, is not a fixed and known entity but rather a random outcome. In each year, the perceived value of the replacement player changes based on the individuals comprising the statistical outcome of that year’s baseball season. Therefore, the most central component to the calculation of WAR is both random and is subject to a strong era-effect. The ESPN top 25 list used a variation of WAR to compare players (Szymborski, 2015). Another disadvantage of WAR is that it is more of a philosophical concept than a calculable statistic. There are many different ways to calculate WAR (Baseball-Reference, 2016; Fangraphs, 2016) and these differing calcu- lations assign different values to the same performance. The arbitrariness of WAR as an actual statistical tool coupled with its inherent randomness makes it a blunt tool for comparing performances across eras. As an example of this, consider the defensive comparison between ’s 1983 season and Rogers Hornsby’s 1917 season provided in Table 3. Clearly Ozzie Smith had the superior defensive season but Rogers Hornsby’s defensive WAR is substantially higher. In all, we see that it does not make sense to use WAR as a way to compare players across eras.

pos Fld% Chances Errors RF/9 dWAR Rogers Hornsby SS 0.939 847 52 5.62 3.5 Ozzie Smith SS 0.975 844 21 5.51 2.3

Table 3: Comparison of Rogers Hornsby’s 1917 defensive season and Ozzie Smith’s 1983 defensive season.

5.2 Critique of detrending We now critique the principles underlying the era-adjusted detrending of PPS. As described by PPS, they detrend player statistics by normalizing achievements to seasonal averages, which they claim accounts for changes in relative player ability resulting from both exogenous and endogenous factors, such as talent dilution from expansion, equipment and training improvements, as well as performance enhancing drug usage. This is stated in their abstract. As an argument in favor of old time players, talent dilution from expansion is terribly misunderstood and is fallacious at best. If anything, the talent pool was more diluted in the earlier eras of baseball because of a small eligible population size and the exclusion of entire populations of people on racial grounds. See Table 4 for the specifics. As an argument in favor of old time players, equipment and training improvements is not without fault because the same improvements are available to every competitor as well. PPS also fails to account for the effect of modern day filming, which did not exist in the older eras, as a way to gain insight on opponents’ strengths and weakness in their training improvements assumption. The principle assumptions about baseball used as justification for PPS detrending are questionable. The mathematics of PPS detrending are also questionable in the context of baseball. We now dive into

5 years eligible pop. number of teams roster size eligible pop. per team 1900 3.2 8 15 26.7 1930 4.8 16 25 12 1950 5 16 25 12.5 1980 21.27 26 25 32.7 2000 35.1 30 25 46.8 2010 39.1 30 25 52.1

Table 4: Relative talent dilution when considering the eligible MLB population sizes at select time periods. Population totals are in thousands. the mathematical formulation of the PPS detrending method. PPS first calculates the prowess Pi(t) of an individual player i as P i(t) = xi(t)/yi(t) (1) where xi(t) is an individual’s total number of successes out of his/her total number of opportunities yi(t) in a given year t. To compute the league-wide average prowess, PPS then computes the weighted average for season t over all players P xi(t) X hP (t)i = i = w (t)P (t) (2) P y (t) i i i i i where yi(t) wi(t) = P . (3) i yi(t) 0 P The index i runs over all players with at least y opportunities during year t, and i yi is the total number of opportunities of all N(t) players during year t. A cutoff y0 = 100 is employed to eliminate statistical fluc- tuations that arise from players with very short seasons. The PPS detrended metric for the accomplishment of player i in year t is then given as, P xD(t) = x (t) (4) i i hP (t)i where P is the average of hP (t)i over the entire period, 2009 1 X P = hP (t)i. 110 t=1900 D The PPS detrended metric xi (t) has its merits. It does adjust individual prowess towards the historical average prowess. It is also dimensionless in a physical sense with respect to the measured accomplish- ment. However, this metric is self-fulfilling; it is specifically constructed in alignment with the opinion that prowesses hP (t)i are simple deviations about P that are best addressed by shrinking era effects towards P . Instead of the detrended metric (4), one could just as easily construct the detrended metric

D0 hP (t)i x (t) = xi(t) . (5) i P which also maintains the merits of dimensionlessness and detrending. However, the metric (5) reaches the D0 exact opposite conclusions as the metric (4). The formulation of xi (t) rewards prowess from strong eras and punishes prowess from weak eras. The only difference between the formulations of the two detrending metrics is the placement of hP (t)i relative to P . The mathematical formulation of PPS detrending is an artifact of the assumptions made. Detrending using (5) maintains all of the merits of PPS detrending but reaches opposite conclusions with the justification of having the opposite opinion as the authors of PPS.

6 5.3 Critique of our testing procedure Our testing procedure is valid under two assumptions made. Our first assumption states that innate talent is uniformly distributed over the entire MLB eligible population over all eras. This assumption is on talent alone and is not concerned with any era effects that are reflected in players’ statistical achievements. It is valid from a genetic perspective where we think that genes associated with baseball talent remain within the MLB eligible population over time. However, the second assumption made is quite strong and it is a result of our data collection methods. This assumption states that the competition to baseball from other sports in more recent eras is offset by increased salary incentives that were not available in older eras. From 1880 to 1900, this assumption may hold. However, many consider the time period from the to the 1960s to be an extended golden era of baseball where competition from other sports was not a serious threat. Baseball players from the 1920s to the 1960s did not make as much money as the modern players, after adjusting for inflation, but the average players in many of these seasons made very competitive salaries. If indeed such a golden era of baseball existed, then the MLB eligible population actually competing for MLB positions would be higher in these eras than in others. Our data collection method has not accounted for this effect. The existence of the golden era places our second assumption in questionable territory. However, we can account for the golden era using various weighting schemes. The idea is to weight each time period’s MLB eligible population with respect to that year’s general population’s overall interest in baseball. The time periods comprising the golden era receive higher weights than the other time periods. MLB interest over the time periods is given by Gallup polling data. The Gallup data is incomplete so values are extrapolated. The extrapolation employed treats the golden era as 1880-1950 and the unobserved weights over this time period are simply given the highest weights in the golden era. The new dataset includes the weighted MLB eligible populations for every ten year period from 1880 to 2010 and the cumulative proportion at each time period is calculated as before. We then perform the same hypothesis testing procedure with respected to the weighted dataset as seen in Section 4. This procedure is replicated over three different weighting schemes which are explained in Appendix B. The p-values from the hypothesis tests with respect to the weighted datasets are seen in Table 5. We also reinvestigate talent dilution with respect to the three weighted datasets as depicted in Table 6.

original data w1 w2 w3 too many old in top 10 0.0024 0.0084 0.0201 0.0422 too many old in top 25 0.0047 0.0248 0.0729 0.1677 too few recent in top 10 0.003 0.0053 0.0086 0.0536 too few recent in top 25 0.0031 0.0076 0.0156 0.1824

Table 5: Checking the robustness of our second assumption. Our second assumption is robust to the first two weighting schemes. It is not robust to the extremely conservative third weighting scheme.

5.4 Comparison of critiques All investigated methods have their shortcomings. However, our testing method holds up well when we reweight our original data in ways that reflect relative disbelief in our second assumption. PPS do not offer any insight on how their method fairs when their underlying assumptions are called into question. We do show that their mathematics will support the exact opposite conclusion if one adopts assumptions that are in complete conflict with their assumptions. As seen in Section 5.1, WAR does not offer a reasonable comparison of MLB players across eras because its derivation involves a strong era effect.

7 years original data w1 w2 w3 1900 26.7 16 21.3 6.7 1930 12 7.2 9.6 4.8 1950 12.5 5 5 4.8 1980 32.7 13.1 13.1 5.2 2000 46.8 18.7 18.7 6.1 2010 52.1 20.9 20.9 6.3

Table 6: Relative talent dilution when considering the weighted eligible MLB population sizes at select time periods. Numbers are in millions of people per team.

6 Discussion

The players from the old eras of baseball receive significant attention and praise as a result of their statistical achievements. On the other hand, more recent players are often tossed aside because of their more modest statistical achievements or suspected steroid use. We argue that, with assumptions on the distribution of talent, that the old time players are overrated and the recent players are underrated. Our statistical tests provide overwhelming evidence in favor of this argument. We conclude that the superior statistical accom- plishments achieved by the old time baseball players are a reflection of our inability to properly compare talent across eras. It is highly unlikely that athletes from such a scarcely populated era of available baseball talent could represent a top 10 and top 25 list so abundantly. It is also highly unlikely that athletes from such an abundantly populated era of available baseball talent could represent a top 10 and top 25 list so scarcely.

Appendix A

We revisit the calculation discussed in the Testing section. The chance that an eligible MLB player whose career was largely before 1950 appears in the top 10 occurs with probability p = 0.167. We use the binomial distribution to calculate the probability of at least 5 eligible MLB players whose career was largely before 1950 appear in the top 10. More formally, we are interested in calculating P (X ≥ 5) where X is the number of eligible MLB players whose career was largely before 1950 appearing in the top 10. So we have

P (X ≥ 6) = P (X = 6) + ··· + P (X = 10).

The binomial distribution is used to calculate P (X = 6), ..., P (X = 10) where

10 P (X = x) = px(1 − p)10−x x for x = 6, 7, 8, 9, 10. For example, when x = 6 we have

10 P (X = 6) = p6(1 − p)4. 6

Specifying x = 6 is equivalent to 6 eligible MLB players whose career was largely before 1950 appearing 10 in the top 10. We do not care about the order that players appear in the top 10. The number 6 corrects 10 for this. The number 6 is the number of possible orderings of 6 “successes” and 4 “failures” that can occur. In our context a success is an eligible MLB player whose career was largely before 1950 and a failure is an eligible MLB player whose career was largely after 1950. Any particular ordering of the top 10 with

8 6 eligible MLB players whose career was largely before 1950 appearing in the list occurs with probability p6(1 − p)4. Therefore the probability that 6 eligible MLB players whose career was largely before 1950 appear in the top 10 is given by 10 P (X = 6) = p6(1 − p)4. 6 R statistical software was used as a calculator throughout the Testing section. Similar calculations to this one are made to calculate p-values in the Testing section.

Appendix B

The weights discussed in Section 5.3 are given in the table below. These weights reflect Gallup polling data (Moore and Carroll, 2002; Carlson, 2001; Gallup, 2016) on the subject of fan interest in baseball throughout the years. Polling data does not exist for all time periods. Specifically, there are no records before 1940. The first two weighting schemes reflect overall fan interest in baseball. It is apparent from the Gallup dataset that the proportion of Americans sampled who are baseball fans has been around 40% from 1940 on. The first weighting scheme (w1) places a 0.60 weight on fan interest with the assumption that more Americans from those time periods favored, and played, baseball due to lack of competition from other sports. The second weighting scheme (w2) places even stronger weights on pre-1940 fan interest. The third weighting scheme (w3) reflects the proportion of Americans sampled that list baseball as their favorite sport. Again, we place a higher weight for pre-1940 fan interest. All three weighting schemes serve as an educated proxy for the eligible MLB population thought to strive towards a career in professional baseball. The weighting schemes w2 and w3 are especially conser- vative. We do not expect fan interest in pre-1940s baseball to be as high as our weighting schemes due to lack of media exposure. Furthermore, we do not expect the proportion of the eligible pre-1940s MLB population thought to strive towards a career in professional baseball due to relatively low MLB salaries in these time periods. The appropriateness of w3 is particularly questionable since individuals play baseball even if it is not the individual’s favorite sport. It should be noted that these weights are obtained from survey data obtained in the United States only and are applied, at equal value, to the other countries. Therefore our weighting schemes are biased but rich survey data on the topic of fan interest in baseball is unavailable from from the other countries.

9 year w1 w2 w3 1 1880 0.60 0.80 0.25 2 1890 0.60 0.80 0.25 3 1900 0.60 0.80 0.25 4 1910 0.60 0.80 0.40 5 1920 0.60 0.80 0.40 6 1930 0.60 0.80 0.40 7 1940 0.40 0.40 0.35 8 1950 0.40 0.40 0.38 9 1960 0.40 0.40 0.34 10 1970 0.40 0.40 0.28 11 1980 0.40 0.40 0.16 12 1990 0.40 0.40 0.16 13 2000 0.40 0.40 0.13 14 2010 0.40 0.40 0.12

Table 7: Weighting schemes

References

Baseball-Reference.com. (2016). WAR Explained. http://www.baseball-reference.com/about/war_explained.shtml

Brea, J. A. (2003). Population Dynamics in Latin America. Population Bulletin, 58, no. 1. http://www.igwg.org/Source/58. 1PopulDynamicsLatinAmer.pdf

Carlson, D. K. (2001). Although Fewer Americans Watch Baseball, Half Say They are Fans. http://www.gallup.com/poll/4624/Although-Fewer-Americans-Watch-Baseball-Half-Say-They-Fans. aspx?g_source=baseball%20interest&g_medium=search&g_campaign=tiles

Census (2016). https://www.census.gov/ ESPN. (2016). ESPN’s Hall of 100. http://www.espn.com/mlb/story/_/page/mlbrank100_greatestplayersalltime/ all-mlbrank-greatest-players-ever

Fangraphs. (2016). What is WAR? http://www.fangraphs.com/library/misc/war/

Gallup. (2016). Sports. http://www.gallup.com/poll/4735/sports.aspx

Moore, D. W. and Carroll, J. (2002). Baseball Fan Numbers Steady, But Decline May Be Pending. http://www.gallup.com/poll/6745/Baseball-Fan-Numbers-Steady-Decline-May-Pending. aspx?g_source=baseball%20interest&g_medium=search&g_campaign=tiles

Petersen, A. M., Penner, O., Stanley, H. E. (2011). Methods for detrending success metrics to account for inflationary and deflationary factors. Eur. Phys. J. B., 79, 67–78. Szymborski, D. (2015). Hall of 100 methodology. http://espn.go.com/mlb/story/_/id/ 12127337/hall-100-methodology-mlb

10