Out of Left Field Evidence for and Against Baseball’S Conventional Wisdom
Total Page:16
File Type:pdf, Size:1020Kb
Olin College Out of Left Field Evidence For and Against Baseball’s Conventional Wisdom Christopher Joyce 10/4/2012 This project aims to map baseball conventional wisdom to facts, and see if the things that ‘everybody knows’ about baseball are supported or contradicted by statistical evidence. I found that most baseball conventional wisdom does not line up with statistics, and in some cases is outright contradicted. Contents Background & Dataset .................................................................................................................................. 2 Exploration #1: Walks per Country .............................................................................................................. 4 Exploration #2: Good-Fielding Shortstops ................................................................................................... 6 Exploration #3: Knuckleballers and Control ................................................................................................. 8 Conclusions ................................................................................................................................................. 10 Bibliography ................................................................................................................................................ 11 Background & Dataset Baseball is a sport with a 140 year old history, but it’s only in the past 30 years or so that there’s been much focus on analyzing baseball with statistics (an approach called sabermetrics), rather than what the old hands think they know about the game. One of the first books about sabermetrics was written by an engineering professor at Johns Hopkins, Earnshaw Cook, in 1964 [1]. Teams had used statistics before this point, but it was somewhat controversial and back-alley – the idea of baseball statistics wasn’t in the public eye. For a few decades, there was a back and forth between those who believed in baseball by statistics, and those who believed in baseball by feel. This debate was silenced in the early 2000’s, when Moneyball was published [2]. Theo Epstein, the GM of the Boston Red Sox said, “That book hit The New York Times best-seller list. People who own baseball teams read The New York Times best-seller list. So they started asking questions about the processes their front offices were using, and it changed things really quickly.” Despite this advance, statistics haven’t fully converted all of baseball’s old guard. While all 30 MLB clubs incorporate statistics into their player evaluations, only 15 to 20 of them rely on it “heavily” [2]. This means a full half to a third of baseball teams still do things, in large part, by ‘feel’. Statistics have certainly informed what scouts look for, by redefining what a useful player is. For example, on- base percentage wasn’t even a metric people looked at heavily as recently as 2002 [2]; but both a walk and a single will get a player on base, and in a position to score a run. It seems pretty clear, given all this, that much of baseball’s conventional wisdom may be downright wrong; however, fans and announcers will still insist about what they ‘know’ of baseball – even without any facts [3]. In example, there’s an old baseball adage that ‘You can’t walk your way off the island’, referring to players from Latin America who will swing wildly at pitches, with the thought that it’s harder to get noticed by pro scouts by exercising restraint at the plate, and fighting through grueling at-bats; than it is by hitting balls way outside the strike zone for home runs [4]. This is something that ‘everybody knows’, but I have never seen any data to support. Some of baseball’s conventional wisdom is so pervasive that it’s actually not possible to do statistical analysis of why it might be true – an example being lefty catchers. The perception exists – mostly unsubstantiated – that it’s a crippling disadvantage for a catcher to be left-handed because they’d have a hard time making a pickoff move to first around a right-handed batter. This feeling is so pervasive that there have only been five left-handed catchers to play in at least 100 games [5]. The last catcher to play as a lefty in the MLB did so in 1989; and as of 2009, there was not a single left-handed catcher in the MLB or the minor leagues [6]. Teams eventually started making their decisions based on data rather than what ‘everyone knows’, in no small part because of the general manager of the Oakland A’s, Billy Beane. Beane was one of the first to make a real effort to go against the grain of what ‘everyone knew’. Rather than just incorporating statistics, where statistics disagreed with his math, he’d trust the math [7]. I used Sean Lahman’s compendium of data, which came in a .csv format. It includes notable individual statistics dating all the way back to 1870, as well as team records. This dataset is free, easily importable into Python for analysis, and quite complete. There were a few mismatched tags I had to modify, but none of them affected numbers within the dataset. I only use a part of the dataset – player tables, batting tables, fielding tables, and pitching tables. This dataset is ripe for more analysis, however, most baseball stereotypes exist about individual players, not teams. Seeing as this dataset came from the internet, I validated it by spot-checking certain player statistics. I checked certain statistics that jumped out at me when reading through summaries I generated of the data set, and found that they were true, just not expected. For example, I found that an individual (Ed Porray) whose birth country was listed as “A Boat on the Atlantic Ocean” was, in fact, born while at sea. I feel confident that my dataset accurately represents what occurred in baseball, with some possible human error, given these tests. Exploration #1: Walks per Country One common piece of conventional wisdom was discussed above – “You can’t walk your way off the island”. The stereotype is that players from Latin America – the Dominican Republic specifically – swing at a lot of balls that aren’t hittable, with the thought that pro scouts don’t notice those who take long at-bats. To test this, I summed all plate appearances and walks by players born in each country, and took the walks-per-plate-appearance number for each nation that has sent a man to the majors. First, I made a histogram of walks by country since 1970. I chose to throw out all pre-1970 data to be sure that I was only getting data in the modern era for this analysis – it seems like a fairer comparison to only look at dates after Hispanic players had been playing in the MLB at a high level for a decade or so. I chose the 1970 season as my cutoff point for this reason. Figure 1: Walks per at-bat, by country. This chart seems to show that Germans are walking machines; while Afghans actively try not to walk. This comes across as funny right from the get-go; because this spread is so huge. Clearly, something is not quite right within the dataset. My next strategy was to create a cutoff for a total number of plate appearances a country needs to have to be considered. I chose this to be 5000; because the average starting player will come to bat about 500 times in a season, and ten player- seasons seems like an appropriate sample size to define as the lower bound for comparison. Figure 2: Walks per at-bat, by country, 5,000 plate appearances per country cutoff. This graph shows the difference much more clearly. The maximum difference is 4 walks per 100 plate appearances. Interestingly, the Dominican Republic does not have the lowest walks-per-plate- appearance figure – they’re about one walk in a hundred above Mexico, which has the lowest walks per 100 of any country. The league average is about one walk in a hundred above the DR. Absolute numbers do indicate a difference in walks by Dominicans versus the league average. It’s hard to create a real test statistic for this data, because my dataset comprises all players who have ever taken a swing in the major leagues; and so extrapolating from this dataset means very little. However, I can reasonably make the claim that, while an absolute difference does exist, it shouldn’t be one that can be reasonably picked out without statistical analysis of the type done here. Monte Carlo simulations indicate that the specific difference being analyzed – that of Dominicans versus the league average – has a probability of happening by chance of less than 1 in ten thousand for ten thousand iterations. Given that, I am willing to call the bias statistically significant; however, that is separable from whether the stereotype is reasonably noticeable without statistical analysis. The thing to note here with respect to debunking or confirming the stereotype is not the likelihood of the effect, but rather the size of the effect. If this behavior was being picked out fairly by announcers, they would be more likely to refer to Canadians as walking machines, or Mexicans as those who never walk, not Dominicans. Those effects are significantly larger; to the tune of 2 walks in a hundred off of the league average; as compared to about 1 in 100 for Dominicans. If an effect twice as strong goes unnoticed, I think it’s reasonable to chalk this effect up to confirmation bias: when someone sees a Dominican strike out, they add it as a data point supporting a stereotype they want to believe. Exploration #2: Good-Fielding Shortstops Another common baseball stereotype is that the better a fielder your shortstop, the worse a hitter he is. To analyze this, I ran two correlations: between games per error and batting average, and between assists per game and batting average. Both of these are good, simple measures of defensive quality. There are more complicated metrics in existence, but I chose to use errors and assists because, if a shortstop has few errors but few assists as well, he may not actually be a good defensive shortstop – he might just be slow.