An article by EMC Proven Professional Knowledge Sharing Elite Author

A WHOLE NEW BALLGAME: LEVERAGING BIG DATA IN

Bruce Yellin Advisory Systems Engineer EMC Corporation [email protected] Table of Contents

Introduction – Baseball, Big Data, and Advanced Analytics ...... 9

Big Data and Baseball - Players, Coaches, Trainers, and Managers ...... 13

Sportvision ...... 15

PITCHf/x ...... 16

HITf/x ...... 25

FIELDf/x ...... 28

Big Data and the Business of Baseball ...... 33

Player Development ...... 36

Revenue From Fans ...... 38

Revenue From Media ...... 42

Big Data Helps Create Algorithmic Baseball Journalism ...... 43

Listen To Your Data - Grady the Goat - The Curse of the Bambino ...... 46

Conclusion: The Future of Big Data Baseball ...... 47

Appendix - PITCHf/x Metadata ...... 50

Footnotes ...... 51

PLEASE NOTE: It is recommended that this paper be printed in color.

Disclaimer: The views, processes, or methodologies published in this article are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2014 EMC Proven Professional Knowledge Sharing 2

Big data and advanced analytics touch every facet of our lives. We see it in action with on-line advertising, cash register printed coupons, and the marketing of airline tickets. Big data measurements range from hundreds of terabytes to petabytes, with analytics refining data quantity into quality actionable information. Baseball has always been a data-rich statistical paradise and is called a data- driven sport, yet the amount of data it used to generate pales in comparison to today’s game. No wonder that , of the , calls the impact of big data on baseball its second great renaissance1.

Up until the last twenty years, we judged players based on basic statistics. Now we have much more in-depth data to make more accurate assessments. Big data can now help hitters battle , pitchers battle hitters, and properly position defensive players for the -hitter dynamic. Just as big data has helped companies worldwide become more profitable and efficient, it will also help (MLB) do the same. Big data will give teams a competitive and financial advantage, and bring the fan a more exciting experience. The data and analytics will come from action on the field and business decisions. It will be available to customers in the ballparks and fans around the world, giving them insights that were impossible just a few years ago.

Pretend you are at a game on a really humid, windless, hot August night at Boston’s century-old Fenway Park. The Red Sox lead their arch rival 3-2 with 2 outs in the top of the 9th . On 2nd base is Robinson Cano and on 3rd, who as one of the fastest runners in baseball could score the tying in about 3 seconds. Just 20 minutes earlier, most of the 34,000 fans stood and sang Neil Diamond’s classic “Sweet Caroline” with gusto. Now they are standing again, this time vocalizing their disdain for the Yankees. Their thunderous applause, foot stomping, and 120+ decibel roar can be heard blocks away as the is delivered.

The 92 mph four-seam crossed the 17” wide home plate and thuds into Jarrod “Salty” Saltalamacchia’s ’s mitt. An instant later, the umpire calls “Strike Two!” against former Red Sox hero and now Yankees “Evil Empire” foot soldier, Kevin “Youk” Youkilis. Youk doesn’t like the umpire’s call as boisterous fans, packed like sardines in

2014 EMC Proven Professional Knowledge Sharing 3 bars along Yawkey Way, Lansdowne Street, and Brookline Avenue scream “Yes!” in near-unison, accompanied by nearly 4 million ESPN viewers across America yelling for or against the right-handed hitter to succeed or fail. A high-definition TV camera zooms in at the perspiration dripping from under Kevin’s helmet. On the mound, Boston’s hard-throwing right-handed warrior , his -knit polyester uniform drenched in sweat, has delivered his 108th pitch of the evening. Statistically, right handed pitchers would rather face right- handed hitters, so this matchup favors Buchholz, a former teammate of Youk.

Buchholz takes a long time between pitches. He must get Youk because the next batter is even more dangerous. Over the next 24 seconds, both teams reset for the next pitch. Before the game, the Yankees’ big data scientists charted the likely type and location of Buchholz’s next pitch. Youk steps out of the batter’s box waiting for 3rd base to relay Yankee manager ’s offensive signal. Thompson touches his cap, chest, thigh, gestures with his hands, then double-claps telling Youk to expect Buchholz to jam him inside based on a PITCHf/x big data analysis. Gardner and Cano are also looking at the signals so they know what is being attempted.

PITCHf/x data shows the Sox the location and type of the pitch Youk likes to . Even with the diminishing skills of a 34-year old, Youk is still a dangerous career .281 hitter who thrives on pitches in the middle of the plate or low and inside2. Red Sox bench coach signals the fielders where to play based on pitching coach ’ signal to Salty. The impassioned fans begin to holler again. Salty signals the 6’3” Buchholz a 1-2-3-2-1 with his fingers and moves his mitt high. The first “1” is a fastball, “2” is an even number for an inside location, and the rest is a decoy should Cano steal the signs and relay them to Youk. Salty and Buchholz have agreed with 2 strikes to throw a fastball high and inside to set up their possible follow-up pitch. Both sides have big data insight into the pitch speed, trajectory, and spin.

2014 EMC Proven Professional Knowledge Sharing 4

Wielding a bat like a gladiator’s sword, Youk is about to go into combat against 9 defenders. Youk and Buchholz are ready, both team’s benches and are focused, and the fans are making Fenway Park vibrate with excitement. Buchholz has an incredibly difficult job - release the ball at the precise moment and with the correct spin and angle, because if it is off by a fraction of a millisecond or degree it will either miss the , or Youk might hit it and tie the game in a mere matter of seconds.

Prior to the game, the relevant big data was sliced and diced and presented on the pitching coach’s iPad. Various game parameters such as his opponents, nightly rest, by- inning pitch counts to tally endurance factors, and even the day/night impact on each type of pitch Buchholz throws could be factored in. This helped Buchholz and Salty develop a pitch strategy for the game based on the humidity, wind speed, and the strengths and weaknesses of the Yankees. In general, if Youk hits a pitch, there is a 40% chance it will be a ground ball (GB), 40% a fly ball (FB), and a 20% chance of a line drive (LD)3. Youk has also done his pre-game iPad homework and knows Buchholz favors the fastball in a 2- strike . He expects the pitch to cross the plate in an area high center to lower outside, so he prepares himself mentally to go with the pitch and try to hit it to right field. Buchholz and Salty are going to attempt to deceive Youk. The runners increase their lead, Salty shifts from center of home plate to the right side, and Buchholz, throwing from the stretch position, begins his delivery.

Buchholz throws a ball with an outer shell made from two pieces of leather laced with 108 red stitches. The stitches are 7 hundredths of an inch above the smooth surface and push against the airflow, causing it to rotate at hundreds to thousands of RPM depending on the pitch. The red stiches on the seam’s axis can appear to the batter as a “red dot” when the ball is released. Hall of Famer says “If you can’t see the rotation, you have to be able to recognize if it’s a curve ball or a . And if you can’t, you’re not going to be a major league player. Anyone who can hit above .270 can see the red dot. Anyone who saw that dot on the ball, you knew it was the slider. If it was a really big dot, you knew it was a hanger and you could hit it out.”4

2014 EMC Proven Professional Knowledge Sharing 5

In a tenth of a second, the ball completes 2 revolutions, travels 13’, and air resistance starts to slow it down5. The stitches break up the airflow and reduce drag, enabling the ball to go further than if it was smooth. Nicknamed the “Greek God of Walks”, Youk has a keen eye for the ball with 20-11 visual acuity (he sees from 20’ what most people see from 11’)6. He spots the rotation and by processing the pitcher’s release point, his hands and fingers, the ball’s spin, and probability of what Buchholz likes to throw, surmises it’s a 4-seam fastball. The ball’s trajectory is inside and high, not low and outside as Youk had orignally guessed. He must “protect” the plate at all costs and prevent getting a 3rd stike, so in the next .15 seconds, he “tells” his body to swing defensively. Salty also recognizes he is out of position and immediately moves his glove to the left. The ball is now 25’ away from the plate. Youk’s knees flex as his weight shifts from a flat stance to his back right leg, and his torso twists clockwise. His left foot strides 12” out and he shifts his weight forward as his coiled torso generates a massive amount of upper body torque. It is a 90 mph fastball. Drag has slowed it by 7 mph. It travels the 55’5” between his release and home in .434 seconds assuming linear decelaration.

2.75" diameter bat Brain processes MUST DECIDE in .03 must hit 3" diameter Brain signals information, The batter sees seconds to swing and ball within 1/8" of the legs to gauges ball the ball, sends another .03 dead center in ±.007 speed and image to brain start stride seconds to swing high, seconds location .1 seconds .03 seconds low, inside, outside .01s swing starts .08 seconds .05s can still check .15s power swing Sec Traveled:.434s .27s .24s .18s .1s Feet Traveled:55' 32' 29' 22' 13'

55' 50' 45' 40' 35' 30' 25' 20' 15' 10' 5' 0'

Drag decelerates Buchholz’s 90 mph fastball to 83 mph as it approaches Youkilis and crosses home plate in .434 seconds

It is high. Gravity pulled the ball down 2.92’, offset by a backspin of -.98’, it is about to cross the plate almost 2’ lower that its release point. Just 23’ and 1/7th of a second away, it drops at a 100 downward trajectory as Youk’s hips begin a counterclockwise turn. He begins his swing by maneuvering the bat through its arc and shifts his weight from back right to front left. If he blinks, he would lose sight for .05 seconds, and it will be all over. A camera flash might cost him .15 seconds7. Youk’s brain and muscle memory instruct his hands for a high and inside fastball. His 31.5 ounce model T141 Louisville Slugger round maple bat is moving at 85 mph and collides with the spinning cork, yarn, and leather-

2014 EMC Proven Professional Knowledge Sharing 6 covered 5.2 ounce hardball for a millisecond “…with 6,000-8,000 pounds of force”8. “SMACK” is heard as the impact produces a Speed (BBS) of 120 mph.

His timing has to be within 5 milliseconds in one of two dimensions to hit a fair ball. Fortunately for the Red Sox, the swing’s 2nd dimension is off. The ball misses the bat’s by just over ½” at -220 instead of 390 and Youk harmlessly fouls it off.

The count remains at no balls, 2 strikes. The runners go back to their bases and everyone resets, including the fans. Red Sox manager , a former pitcher, signals Salty for a down and away based on how Youk handled the last pitch. Yankee manager Joe Girardi, a former catcher, also believes it will be a curve and signals that to Robby Thompson. TV viewers are told to expect a .

The Red Sox bench coach gestures his , , and to the right since Buchholz will toss a curveball low and outside. They are excellent outfielders, but the coach knows from FIELDf/x9 big data that correct positioning could make the difference between a hit and an out. Outfielders generally cover 23’ in a second. Their reaction time is 0.35 seconds forward, 0.52 seconds sideward, and 0.65 seconds backward creating a defensive oval10.

Buchholz agrees to Salty’s finger-signals for a curveball. The fans are standing and the noise is deafening. Before the game, Youk’s iPad displayed Buchholz’s PITCHf/x chart to the right11. His hitting coach, , had said:

1) Buchholz tires after 105 pitches 2) Buchholz likes to use his curveball 3) Buchholz’s curveball slows down to 74-75 mph after 84 pitches.

2014 EMC Proven Professional Knowledge Sharing 7

Youk readies for a slow curveball. The pitch is on its way. But instead of low and away, Buchholz’s 110th pitch is coming down the center of the plate. With an unquenchable desire to be tonight’s hero, Youk’s bat strikes the curveball at 390 and rockets it towards deep center field.

In the physics of baseball, the mass and velocity of the bat and ball determine the “conservation of momentum” and predict the equal and opposite forces at that instant. The “coefficient of restitution” or the bounce of a ball after it is deformed by the bat, causes the ball to gain forward motion because of kenetic energy. The bat deforms slighty before springing back to its natural shape – notice its shape versus the yellow line. Gardner and Cano are off with the “crack of the bat”. Ellsbury, out of position, instinctively turns to his left, sprinting towards the green wall. The fan’s chanting instantly turns into a prayer.

If the ball falls in, the Yanks will lead as Gardner is nearing home and Cano is sprinting from 2nd with the go-ahead run. Some in the crowd scream to the baseball gods for Ellsbury to make the . Others turn silent as all hope is seemingly gone. Based on where he was playing, he needs 4.4 seconds to reach the hit, but only has 4.3 seconds as the ball sails over his head. Gardner ties the score at 3-3. and 2nd baseman race to short center field to catch the relay throw. Cano has crossed 3rd base on his way home. Youk’s smash hit the base of the wall near the “Stop N’ Shop” sign and ricochets away from Ellsbury. In this battle of wills, Cano crosses home for the Yankee lead as Youk lumbers into 2nd base with a double. The Sox, with a glorious victory seemingly in hand, now trail by a run in the top of the 9th with 2 outs. Left-handed reliever replaces Buchholz and strikes out left- handed Travis Haffner to end the inning.

The Yankees in the bottom of the 9th bring the all-time leader in saves and future Hall of Famer, right-handed , to the mound. His devastating fools the best of hitters. The cutter’s two finger fastball grip is off-centered against the ball’s seams. Delivered like his fastball, it has dynamic late movement half-way to home – the distance for which a batter has to

2014 EMC Proven Professional Knowledge Sharing 8 decide to swing or not.12 It breaks down and in towards a lefty or away for a righty just before it reaches home. His WHIP is 1.00013. Legendary Pedro Martinez, by comparison, had a career 1.054 WHIP14, while the 2013 MLB average was 1.3015.

First to bat is Salty, a batting left-handed. The Yankees shift players to the right based on FIELDf/x charts. PITCHf/x’s heat map shows Rivera’s distinct cutter locations when he faces a lefty or a righty, and how well he succeeds at avoiding the center of the plate16. During his career, lefties are batting just .209 and righties .21517. Salty pops up the 1st pitch cutter to 2nd baseman Robinson Cano for the 1st out.

Right-handed hitting 3rd baseman Will Middlebrooks is ready to face Rivera. He is fed a steady diet of outside and low cutters and hits a ground ball to 1st baseman Lyle Overbay for the 2nd out. Boston desperately needs a run to tie. Left-handed hitting shortstop Stephen Drew, batting .249 with 12 home runs, is their last hope. A key to Rivera’s success is that batters who do hit his cutter do not hit it well. Rivera jams Drew inside with a cutter on the 1st pitch that breaks his bat – not uncommon against Rivera. The ball is tapped to Overbay for the 3rd out as the Yankees beat their arch rivals. The crowd is silent.

Introduction – Baseball, Big Data, and Advanced Analytics The modern game of baseball dates back to the late 1840’s and is one of those pure sports where statistics has thrived. Attend a game today and a giant scoreboard displays the batter’s picture, his , home runs, RBIs, and other information. Fans on either side of you may have bought a program with the latest statistics on each team’s players, as well as articles, paid advertisements, and the all- important blank

2014 EMC Proven Professional Knowledge Sharing 9 scorecard for each side. To the right is a fan’s scorecard used in the late 1800’s. Serious fans pencil in the player’s last names and uniform numbers, then summarize each pitch, , hit, out, walk, , and other events in coded shorthand eerily close to those used at the dawn of the game. Fielders have positional numbers such as 1 for the pitcher, 2 for the catcher, and so on. While the concept hasn’t changed much, the scorecard is now available as an Android and iPhone application that helps keep score and links fans into the powerful work of and (“Society for American Baseball Research” plus “metrics”).

Core “counting” statistics such as at bats, runs, hits, singles, doubles, triples, home runs, and more originated with the game, and became integral to America’s pastime with the formation of the Elias Sports Bureau in 191318. In 1947, the Brooklyn Dodgers hired a full-time statistician. Modern-era baseball managers have used statistics to run their team, perhaps none more famous than the fiery Hall of Fame manager Earl Weaver. Well before the birth of the personal computer, Weaver tracked pitcher- hitter matchups on index cards like the one on the right to fine-tune his platooning system19 and make pitching changes in the 1960s.

Managers like Weaver kept charts like the one on the left which shows the 1st and 5th were best for scoring runs20, perhaps because the pitcher has yet to reach his groove, then tires, or has already shown the batters his repertoire a few times that game. They kept situational charts, such as the best time to bunt21, or with a runner on 3rd base and 1 out, they could expect to score on average .920 runs22. Bill James evolved the statistical art through his baseball books and

2014 EMC Proven Professional Knowledge Sharing 10 is credited with the name sabermetrics in 1980. Sabermetrics Runs Scored from Situations Runners No Outs 1 Out 2 Outs derived new statistics from the basic ones, such as Slugging --- .537 .294 .114 x-- .907 .544 .239 -x- 1.138 .720 .347 Percentage , On-Base Percentage --x 1.349 .920 .391 xx- 1.515 .968 .468

, and On-base Plus Slugging x-x 1.762 1.140 .522

-xx 1.957 1.353 .630 . xxx 2.399 1.617 .830

Michael Lewis’s “: The Art of Winning an Unfair Game“ documented the birth of modern baseball analytics and the actions of general manager Billy Beane. Beane bucked the preconceived notions of his scouts and explored analytics to find a better, more affordable way to assemble a winning team. Rather than concentrate on batting averages, RBIs, and stolen bases, he focused on new variables like on-base percentage and to find under- valued talent. Moneyball became legendary when Brad Pitt starred in a 2011 movie based on the book. Moneyball has become an adjective for finding hidden value, and a guiding mantra that benefits individuals, corporations, and nations, not just baseball teams.

In 2002, Beane could only spend $41 million to compete in a league of expensive stars and richer teams like the Yankees (with their $125 million budget). That year, both the Athletics and Yankees won 103 games. One of Beane’s approaches was to find undervalued, patient players who helped produce runs by walking23, in lieu of overvalued players with high metrics such as batting average and stolen bases. Oakland made the playoffs from 2000 through 2003.

Chris Davis is one of the better undervalued stars these days24. He was a 5th round 2006 draft pick of the Rangers, who traded him to the Orioles in 2009. By 2013, Davis was making $3.3 million or 3.5% of the team payroll, and led both leagues with 53 home runs. In comparison, Yankee’s is the highest paid player at $28 million a year – more than the payroll25. How do Value ? Chris Davis Alex Rodriguez 2013 # Cost # Cost Per they stack up?26 Davis was “paid” $62,264 for each of Games 160 $20,625 44 $636,364 Appearences 673 $4,903 181 $154,696 his 53 home runs while Rodriguez got a whopping Walks 72 $45,833 23 $1,217,391 Hits 167 $19,760 38 $736,842 $4,000,000 for each of his 9 homers. Runs 103 $32,039 21 $1,333,333 Home Runs 53 $62,264 7 $4,000,000

2014 EMC Proven Professional Knowledge Sharing 11

Concepts like big data have turned raw data into useful information for the fans, players, coaches, managers, and other organizations that promote baseball. Over the last 5 years, sensors, high-speed digital photography, and Doppler radar have led to an explosion of baseball data. Fans especially love the new and exciting ways to follow their heroes, such as this real-time big data Internet view of balls and strikes crossing home at the July 6, 2013 Yankee-Oriole game.

While baseball is still just a game, batters are always looking for an advantage over the pitcher and vice versa. If a batter knows what type of pitch they will get, they gain a significant upper hand by being better prepared to hit it! Big data technology may get them that leg up. At the 2012 Sloan Sports Conference, a paper “Predicting the Next Pitch“27 discussed an advanced analytics approach whereby batters could dramatically increase their knowledge of what the next pitch would be. Rather than guessing or looking for a specific pitch, like a fastball, their big data analysis of 2008 data showed they could increase the chances of getting that fastball from 18% to an amazing 311%! (Note: MLB rules limit players and coaches to printed reports during a game – i.e. no electronic real-time data can actively influence play on the field28.)

Moneyball’s predictive statistics created radical theories that brought significant changes to a team’s structure. Now, big data is adding new insights and dimensions to the game that never existed before. Fans expect more, and big data is ready to deliver29. They are exposed to the “how” and “why” of player actions, leading them to experience the game at a whole new level. Teams are employing big data to find overlooked domestic and international baseball talent they never knew existed, as well

2014 EMC Proven Professional Knowledge Sharing 12 as finding relationships between people and merchandise, ticketing, and the media. That is big data. Beane’s statistics are just a part of it.

Big Data and Baseball - Players, Coaches, Trainers, and Managers We have seen a fraction of the statistical metrics and analytics employed by baseball. Even before computers, baseball was about tracking and improving. With the advent of computers, new metrics and ways of thinking about the sport emerged. In 2007, tools like PITCHf/x escalated the amount of baseball data, dwarfing the data collections that went on before it30. Pure statistics can distort the game – a run, hit, or out. PITCHf/x and tools like it add insight and enhance the game. In just the last 5 years, advances in computer science have unlocked a Pandora’s Box of data and analysis.

Players always relied on their coaches, and in the early years of the game, some had photographs of their opponents or their own swings. Then came motion pictures shot on film. Eventually, each team developed a VCR tape library of each hitter’s at-bats in addition to footage of opposing pitchers and their arsenals. Hall of Fame right-fielder Tony Gywnn played 20 years for the with a career .338 batting average and a .847 OPS31. Going back to 1983, he recorded his own games on video tape, played it back on one VCR and recorded his swings on another VCR. He would then further edit the “swings” tape into 3 more tapes – good at-bats, at-bats with hits, and just the swings that produced the hits32. Gwynn was a trendsetter in this regard.

These days, pitchers can graphically review the opposing lineup before the game through a scouting big data iPad. They examine batter strengths and weakness against the variety, location, and speed of pitches they throw to see if it gets them a strike out, pop up, or a (less fortunate) in game situations. After the game, the data helps the pitcher see what he did well and areas needing improvement. A hitter can likewise study an opposing pitcher and dissect each pitch to gain an advantage the next time he faces him33. It can improve the predictability of pitches by

2014 EMC Proven Professional Knowledge Sharing 13

12.5% to 50% by leveraging data relationships such as “…Pitcher/Batter prior, Pitcher/Count prior, the previous pitch, and the score of the game.” 34 Bloomberg Sports offers this as a software subscription to players.

Big data has already impacted the game. It used to be that when a batter was ahead in the count, they expected a fastball. Between 2009 and 2011, the percentage of Fastball decline - hitter's counts 35 has dropped almost 2.4% . The pitcher and pitching coaches are 2009 FB% 2011 FB% All 74.5 72.1 employing dynamic, less predictable mixtures than those that have 1-0 69.5 67.3 2-0 82.1 79.7 occurred over the last 150 years of the game. 2-1 69.1 66.8 3-1 85.8 83.5

Big data technology helps bring excitement to baseball. Technology researcher Gartner Group says “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”36 With metrics that change over time, high- volume is measured in hundreds of terabytes or petabytes, high-velocity in milliseconds, and high-variety involves complexity, using rows and columns to process structured, semi structured, as well as unstructured data such as social media and tweets. You know you have a big data problem when the dataset’s size exceeds the ability of typical database software tools to capture, store, manage, and analyze it. Big data analytics company Vertica uses the illustration seen here to define the typical dataset37.

A fundamental part of the big data definition is management of disparate data. A typical sabermetric database might have hundreds of thousands of records, but they run into a problem when they couple that database to scouting and medical reports, video, radar, camera tracking, crowd size, crowd noise, home/away, defensive shifts, hotel accommodations, length of road travel and duration, slumps, weather, health and fitness data, team chemistry, financial contract length and value, bonuses, trade data, and other information. In a few years, wearable sensor player data may need to be integrated (they are unlikely to be put in the balls themselves). And that is just the start. Let’s see how various products use big data to help analyze, predict, and optimize player performance, and revolutionize baseball by supplying data that never existed before.

2014 EMC Proven Professional Knowledge Sharing 14

Sportvision Emmy Award-winner Sportvision is a sports broadcasting technology company that introduced K-Zone in 2001, the same year the Gartner Group published their big data definition38,39. Moneyball had not yet hit the presses, “the” baseball tool was Microsoft Excel, and all MLB statistics totaled less than 2% of the baseball data collected today. Sportvision’s COMMANDf/x, FIELDf/x, PITCHf/x, HITf/x, and SCOUTrax systems thrive on big data in baseball, capturing 2.5 terabytes of data a game and 6 petabytes for all regular season games played by 30 teams. COMMANDf/x analyzes the position of the catcher’s mitt to determine if the pitcher hits the agreed upon spot. FIELDf/x digitally tracks the exact location of every player and hit, and for the first time, brings movement analysis to the game. Sportvison general manager Ryan Zander says “It’s almost overwhelming how much data we’re creating.”40 While some of the data is non-action, such as pitcher warm-ups, birds flying, and players waiting between innings, overall they collect 2-2.5 million records per game.

Sportvision is on the cusp of the big data movement. Much of their revenue comes from selling data generated from missile tracking concepts. Its 2006 PITCHf/x product used in-house compute and storage technology. Along the way, they changed computing models and currently employ multiple replicated cloud databases using Amazon Web Services (AWS) and business intelligence/visualization software from Domo41. Sportvision handles streaming data from every MLB game, sometimes concurrently, and performs real-time analysis as well as split-off data feeds for live Internet and TV action.

Sportvision focuses on giving fans a richer experience while optimizing the management and marketing of sports businesses. While this article focuses on MLB, their amazing enhancement techniques are also employed in football, hockey, golf, basketball, soccer, stock car racing, and the Olympics. For example, their Emmy Award-winning “1st and Ten Line”42 uses a yellow virtual marker to show football fans where a team has to go to reach a first down.

Their MLB product suite illustrates how big data computer vision impacts the sport. Algorithms measure 3D objects such as people, , and background shapes by leveraging artificial intelligence, cognitive science (study of the mind and its processes),

2014 EMC Proven Professional Knowledge Sharing 15 and machine learning. PITCHf/x visualization of the ball crossing the plate and underlying data reach fans in stadiums and at home through the Internet, mobile apps, and televised games. Trajectory and ball speed off the bat is captured and portrayed by HITf/x. Sportvision data allows us to objectively answer questions that burn in the hearts of baseball aficionados everywhere, such as:  Who is the best when playing shallow?  Which double-play shortstop/2nd baseman combinations are the most efficient?  Where should the position himself for a relay throw from the ?  Who has the weakest/strongest throwing arms from the outfield?

While the Oakland Athletics used Moneyball concepts to find undervalued players, tools like PITCHf/x and FIELDf/x now employ advanced big data metrics unavailable in that era. These tools precisely measure player performance and calculate if they are in the top 10% of batted ball velocity, use optimal paths 85% of the time to make a catch, or whether they have 98% throwing accuracy – a real-time refinement of Moneyball.

In the future, big data has to meet new challenges, such as the demands that 3DTV will have when depth is added to a broadcast. While 3DTV has had fits and starts, fans thrill to the batters view of the ball as it approaches the plate, or root for a fielder trying to catch up to a batted ball. Higher definition video is just around the corner. 4K Ultra HDTV has 8-mega pixels or 4X the resolution of a 1080P 2-mega pixel HDTV, and will put a strain on this real-time big data system. Who knows – fans may want to see the stitching on the ball or the grains of wood on bats when this staggering level of detail becomes common place.

PITCHf/x Sportvision’s PITCHf/x system tracks every pitch in every game using three permanently mounted digital cameras to follow each pitch at 30 frames per second (FPS) through a centralized tracking system. First installed in 200743, one camera is placed high over 1st base and the other high over home plate. A third centerfield-mounted camera stays focused on home plate to establish the strike zone since it differs for each batter.

2014 EMC Proven Professional Knowledge Sharing 16

Each camera is attached to its own computer allowing each image to be instantly scanned for a baseball. If one is found, the next frame is checked for a ball in-flight to make sure it is a pitch. The cameras triangulate the ball’s exact mid-flight location through real-time digitally processed pattern-recognition of each video frame. Raw For each time t, 9 values are collected: data is transformed into (x,y,z)  3 positions x, y, z  3 velocities vx, vy, vz coordinates of the pitch. PITCHf/x  3 accelerations ax, ay, az calculates 60 data points per pitch including the release point, spin, speed, break, arc, location, and where the ball crosses the plate to within an inch. It algorithmically determines 8 pitch types: fastball, , curve, changeup, slider, split-finger fastball, cut fastball, and a knuckleball44. The real-time data is used by broadcasters and internet providers like ESPN and Bloomberg, and augmented by staff at the game with box scores and live play-by-play.

The combined data stream is transmitted to servers and the video frames are converted into XML data files. Fans in the ballpark can follow the PITCHf/x action on the center field screens or almost instantly on their iPads. At home, fans see PITCHf/x enhanced TV graphics as the ball is (broadcasters may call it by their trade name). Multiply this gigabyte-per-pitch activity by the often simultaneous 15 league games at the same sub-second, all processed by Major League Baseball Advanced Media (MLBAM), and you get an incredible high-volume and high-velocity big data feed. The XML data is available online for free within seconds. This bonanza of baseball insider information includes the details of a pitcher's repertoire and their approach to a batter.

The ball’s speed, location, and trajectory are precisely measured as soon as the pitcher releases it until it crosses home plate. It is algorithmically possible to discern the effect of a fast versus slow pitch, the ball’s break up or down, left or right, and its forward or reverse spin. As a result, its initial velocity, gravity, drag, and the Magnus force45 can be quantified and displayed. Positional data allows the ball to be shown from above looking at home, a pitcher’s view towards home, and the batter’s view looking at the pitcher.

2014 EMC Proven Professional Knowledge Sharing 17

PITCHf/x produces statistics that sabermetricians only dreamt about. More than strike, ball, and hit outcomes, it gives information on the ball’s path and exactly where it crosses the strike zone. A trail can be graphically added to the ball for instant replays46. The data also allows for follow- up coaching analysis or further fan enjoyment. Cameras log 36 events per pitch, and there are roughly 300 pitches a game not counting 8 warm up pitches every half-inning. With 15 match ups (30 teams) and 162 games a year, the system records over 26 million real-time XML records a season. Compared to the older radar gun approach, PITCHf/x is analogous to visiting a doctor for a continuous electrocardiogram as opposed to just checking your pulse. The radar gun gives a pitch observation while PITCHf/x gives the set of flight dynamics.

PITCHf/x allows for extensive automated and accurate data on velocity, pitch type, location, and result. It has tracked over 4 million pitches from 2007-2011 and is used by 16 leagues across the U.S., Canada, South Korea, and the Dominican Republic47. Unfortunately, the system cannot go back in time to analyze pitching great or home run slugger – we have incomplete views of these legends.

Coaches, managers, and scouts use various tools to eke an out an extra win and better their playoff chances. Fans can use the same technology to enhance the game. Every year there is something new and exciting to learn about this game. For example, big data tools like PITCHf/x are available on smart phones. Other tools slice and dice the data into amazing analytics such as pitch plots of horizontal movement versus speed.

Fans love the way PITCHf/x shows the exact spot where the ball crosses the graphical plate48, especially when they think the umpire has blown the call, which happens about every six called pitches that do not involve a swing according to baseball author Tobias Moskowitz49. Truth be told, making 300 split- second correct decisions a game on a ball darting through the air at 90 mph is hard to do. It is also possible the umpire is biased against a batter or pitcher and calls a strike on an actual “ball” and vice-versa. PITCHf/x is invaluable in determining the accurate and optimal strike zone location. It shows which

2014 EMC Proven Professional Knowledge Sharing 18 batters swing at bad pitches, and it is easy to chart strike zone consistency by pitch type and velocity. In this PITCHf/x illustration, of the felt a pitch was outside (it is) and was ejected from a game after disagreeing with an umpire’s “strike” call.

Referencing the charts below, umpire Eric Cooper called the action for a 2012 (squares) game against the Orioles (triangles)50. The chart on the left was against left-handed batters, calls on right-handed batters in the chart on the right. Red is a called strike and green is a called ball. It was remarkable that Cooper was correct in most of his judgment calls. Big data in action!

PITCHf/x can also help analyze the ball’s in-flight behavior. In this 7 pitch sequence, the speed (SPD), break (BRK – the arc of the ball - similar to gravity), left- right horizontal movement caused by its spin (PFX), pitch type, and the result are shown. Notice the big difference between the speed and the break of the curveball versus the fastball.

Here is another example of PITCHf/x helping to pinpoint pitching issues. is a Year W L ERA H/9 HR/9 BB/9 SO/9 OPS 27-year old left-handed Tampa Bay Rays ace who won the 2008 0 0 1.93 5.8 0.6 2.6 7.7 .501 2009 10 7 4.42 8.3 1.2 3.8 7.2 .716 2012 award with a 20-5 record, a 2010 19 6 2.72 7.3 0.6 3.4 8.1 .637 2011 12 13 3.49 7.7 0.9 2.5 8.7 .659 2.56 ERA, and a .602 OPS. He made $4.3 million in 2012 and 2012 20 5 2.56 7.4 0.7 2.5 8.7 .602 $10 million in 2013. Great things were expected of him, so 2013 3 5 3.94 9.5 1.0 1.6 7.5 .720 fans were concerned with his 3-5 record over the first 12 games of 2013. His ERA jumped up 53%, hits per 9 innings were up 28%, home runs/9 were up 43%, strike outs/9 were down 14%, and OPS was up 20%. What was wrong? If this was just a short-

2014 EMC Proven Professional Knowledge Sharing 19 term issue, fans would not have been overly concerned, but the problem occurred start after start. BrooksBaseball.com’s PITCHf/x data showed his four-seam fastball (little movement) was down 1.9 mph and his sinking fastball (downward and horizontal movement) down 2.3 mph. The slower these pitches are, the less effective his changeup is. His reliance on faster pitches dropped 5% while favoring slower pitches by 5%.

With PITCHf/x data publically available at no charge, a fan used it to create this custom graphic to the right showing Price’s issues51. From a batter’s perspective, you can see the movement and ball’s speed as it leaves Price’s hand. The fan noted it is not good to have a fastball and sinker (circled in red) look similar to the hitter and that Price’s curveball was more effective with a larger drop in 2012 than during his first 12 appearances in 2013.

R.A. Dickey, winner of the 2012 , is one of the premier pitchers. A knuckleball is aerodynamically difficult to hit. It rotates once or twice after release whereas a fastball rotates 16-17 times. Generally thrown between 60- 80 mph, a knuckleball wobbles on its way to home, and is further influenced by the wind and drag. On June 18, 2012, Dickey faced the Orioles and had 13 in a 5-0 shutout. This PITCHf/x visualization shows his 114 pitches from the catcher’s viewpoint, with 84% of them being , - red strikes and yellow balls52.

2014 EMC Proven Professional Knowledge Sharing 20

PITCHf/x produces readable XML data which can be easily manipulated a myriad of ways and is described in upcoming pages.

TexasLeaguers.com uses the exact same PITCHf/x XML data to produce a bird’s eye view of the pitches53. On the left is Dickey’s release point. The right-hander stands to the 3rd base side of the pitching rubber. His knuckleball “* KN” is incredibly straight, with

virtually no left-right bending after it is released. On the right, the ball dramatically drops as it crosses home. In its last 15’ of flight, it drops almost 3’ making it extremely difficult to hit! PITCHf/x provides 50’ of flight from the pitchers release to home, but visualization services like TexasLeaguers calculate the data to the release point and show 55’ of travel. “The ball's path is calculated by plugging the acceleration and velocity data 2 54 provided by MLBAM into the standard equation for acceleration: x = ax0*t + vx0*t + x0” .

Let’s leverage PITCHf/x big data to explore the success of the 2011 American League Cy Young winner, Tigers right-handed pitcher, . That year, he had an impressive 24-5 record, averaged over 7 innings per Verlander min max avg 2011 # % Vel Vel Vel game, held batters to a low .191 average, and struck out Fastball 2,139 55% 91.1 101.4 95.0 Curveball 729 19% 74.4 84.9 79.4 over a quarter of his opponents. PITCHf/x shows nearly Changeup 719 18% 82.6 91.0 86.8 Slider 327 8% 81.9 90.6 85.9 half of his 4,000 pitches were 95 mph fastballs, mixing in Total 3,914 a 79 mph curveball (19%), 87 mph changeup (18%), and an 86 mph slider (8%)55.

His success depends on his devastating circle changeup. It is thrown like a fastball but with a dramatically different grip. You can see the index finger and thumb form a circle that doesn’t touch the seams, and the ball is in the palm of his hand. Verlander’s changeup is 8+ mph slower than his blazing fastball, and that is the key. His delivery fools the batter into believing the pitch is a fastball, but it takes .037 seconds longer to arrive. That is enough to throw off the batter’s timing. When batters expect a changeup, their reaction time to the fastball is too slow. PITCHf/x shows that the difference in velocity makes his changeup very effective.

2014 EMC Proven Professional Knowledge Sharing 21

The more a ball spins, the harder it is to hit. This chart shows batting averages, slugging percentages, and swinging strike effectiveness by spin rates56. Verlander attributes part Swinging of his success to his 3,004 RPM curveball which rotates 23% Spin Rate RPM AVG SLG Strike Low Spin <2300 .244 .403 8% more than MLB average of 2,450 rpm. He threw it 19% of the Avg Spin 2300-2500 .185 .340 10% Plus Spin 2500-2700 .178 .251 12% time in 2011. PITCHf/x “…calculates the spin rate of pitched High Spin >2700 .166 .212 15% balls based on the fit trajectory,”57 and is an approximation. Video doesn’t directly capture the rotation, just where the ball is in each frame. It takes more advanced camera technology with higher FPS and data rates to count the rpm of a white ball with red seams using video. Fortunately, PITCHf/x can be augmented with additional data such as TrackMan’s radar system that does capture the spin of the ball58 without a camera.

On the left is Verlander’s tight release point cluster for 201159. The pitch stratification on the right is very consistent. As a right handed pitcher, his 95 mph fastball and his 86 mph changeup stay 5-10” to the left part of home plate – speed is the only difference.

The game footage illustration below shows Verlander with PITCHf/x insights from 4 distinct pitch types superimposed on each other with captions60. He mixes 100 mph fastballs, 79 mph curves, 86 mph , and 88 mph sliders. In this at-bat, veteran Justin Morneau shows multiple swings and misses. All 4 pitch types had the same motion and release point. The screen shot on the right shows the pitches have significantly separated with different velocities. In baseball slang, that is called nasty.

2014 EMC Proven Professional Knowledge Sharing 22

Coaches today look at release points for all their pitchers when their velocity drops. It is hard to spot a 3-6” drop from the , but real-time data shows this instantly. The ultimate use case is in the field of kinesiology where predictive analytics could prevent a shoulder, torso, elbow, wrist, or hand injury when a “bad” pitching motion is detected.

Suppose you are a right handed batter facing the Giant’s 2008-09 National League Cy Young winner, righty , and you want get a feel for his slider. PITCHf/x big data can display images of batters similar to you and show what you might see in a game. The search finds event IDs #2459447 and #2576000. This visualization of slider #2576000 leaves Tim’s hand at 86.1 mph and has slowed to 85.5 mph on its way to home61.

This illustration shows slider #2459447 was just outside at 75.7 mph, spun at 1,042 rpm (8 rotations from its release until it crossed home), broke to the right 5.46”, and is was called it a strike. The 2nd slider #2576000 crossed the plate much lower and at 80 mph, rotated at 689 rpm

(5 rotations), had a little less break to the right at 3.65”, Lincecum Start End Release Pitch ID Speed Speed Break Rotation Point and the was also called a strike. 2459447 82.9 75.7 5.5 1,042.1 6.12 2576000 86.1 80.0 3.7 689.8 6.05

2014 EMC Proven Professional Knowledge Sharing 23

Detroit Tigers is a feared hitter. As the American League 2012 Crown winner, PITCHf/x shows his hitting strengths and weaknesses62. He crushes low and inside pitches. Opposing managers use spray charts to position players where hits are likely to go63. The legend shows if the ball was an out, error, or hit. In 2012, if an opposing manager with this small sample size had his pitcher throw Miguel a curveball, he might use the following spray chart and have his 3rd baseman and shortstop play in the blue and red oval areas on the right. Spray charts can be automatically created by coupling HITf/x and PITCHf/x data64.

PITCHf/x is also useful to teach “framing”. As the ball crosses the plate, an algorithm can determine if the umpire will call it a ball or a strike. Good “frame” or move their glove once they catch the ball in order to influence the umpire’s decision. For example, if borderline pitches are called strikes 50% of the time, increasing the called strike success to 60% through framing gives a team a terrific advantage. One study showed that framing saves an average of 20 runs a season65. Another study found that Rays’ catcher Jose Molina saved 73 runs one year by framing the pitch66. That difference could mean a couple of extra victories, potentially worth millions of dollars to a team. Based on PITCHf/x data, Molina frames 13.4% of balls into called strikes67.

PITCHf/x data is free and in XML format, so let’s take a much closer look at it. The data is found at http://gd2.mlb.com/components/game/mlb/. Pick a year >=2007, a month, and a day, and find the game you want and the inning. This example is from 9th inning of game 4 of the October 28, 2012 World at Comerica Park, with the Giants 1st baseman Brandon Belt facing Tigers pitcher . The full URL is http://gd2.mlb.com/components/game/mlb/year_2012/month_10/day_28/gid_2012_10_28_sfnmlb_detmlb_1/inning/inning_9.xml.

2014 EMC Proven Professional Knowledge Sharing 24

Open the inning_9.XML. The metadata definitions are in the appendix. Coke throws 6 pitches: Ball, Ball, Foul ball, Foul ball, Ball, and Swinging Strike. Scroll down to this sequence of -

Let’s focus on the XML data for pitch #4 that Belt fouls off in the red oval: <pitch des="Foul" des_es="Foul" id="507" type="S" tfs="230849" tfs_zulu="2012-10-29T03:08:49Z" x="109.01" y="132.11" sv_id="121028_230849" start_speed="94.4" end_speed="86.0" sz_top="3.32" sz_bot="1.53" pfx_x="8.23" pfx_z="10.3" px="-0.315" pz="2.919" x0="2.562" y0="50.0" z0="6.035" vx0="-10.7" vy0="-137.956" vz0="-6.156" ax="15.708" ay="33.556" az="-12.449" break_y="23.7" break_angle="-43.3" break_length="4.1" pitch_type="FF" type_confidence=".676" zone="1" nasty="41" spin_dir="141.469" spin_rate="2651.720" cc="" mt=""/>

The coordinates for the x-axis are to the catcher’s right, y-axis towards the pitcher, and the z-axis vertical upward68. XML is hard to visualize, but it says the four-seam fastball lost 8.4 mph [start_speed = "94.4" minus end_speed = "86.0"] and is Fouled off [pitch des = "Foul"].

HITf/x Whereas PITCHf/x follows the pitcher’s delivery, HITf/x follows the ball’s trajectory off the bat. HITf/x reprocesses the same PITCHf/x images to extract ball data whenever the bat hits it – no separate cameras are required69. It tracks the ball to the pitcher’s mound and measures its speed off the bat as well as vertical and horizontal angles as it leaves the bat. This helps determine if the ball is a hit but not where it lands. Unlike publically available PITCHf/x data, HITf/x data is privately licensed, and was only available to the

2014 EMC Proven Professional Knowledge Sharing 25 public for a short while in 2009. Much of the HITf/x functionality was incorporated into Sportvision’s latest effort, FIELDf/x, which is discussed in the next section.

Algorithms kick in at the split-second the ball hits the bat to extract three-dimensional trajectory and movement parameters. Other data can be integrated such as pitch tracking, atmospheric conditions, park dimensions, etc. to analyze the expected path. The course can be affected by the ball’s spin, humidity, temperature, drag, and wind speed70. On humid days, the baseball doesn’t travel as far because the yarn in the ball soaks up moisture causing it to absorb more of the bat’s energy.

HITf/x uses 20 optical samples of the batted ball to produce the bat’s contact point, ball speed off the bat, the elevation angle, and where the ball is heading71. These angles are key to understanding the ball’s flight and affecting the defensive alignment designed to prevent the hitter from safely reaching base.

The distance a ball travels is highly dependent on three factors. One is the speed off the bat and its backspin. Dry balls travel further during warm, humid weather at higher altitude with the wind blowing from home to the outfield. HITf/x correlates how hard it is struck with the odds of it being a hit72. An 85 mph hit could be a single, with a double, triple, and home run requiring 95-100 mph. The 2nd factor is the vertical launch angle (VLA), such as 200 (900 would be straight up in the air and 00 would be a straight line drive). The last one is the horizontal launch angle (HLA) or spray of the ball (the location it is headed with 450 being the right field foul line and 1350 is the left field foul line)73. These angles are unaffected by conditions like the weather, and Speed mathematically show Date Hitter Team Pitcher Team Ballpark off bat Feet VLA HLA 9/1/12 Edwin Encarnacion TOR J.P. Howell TB Rogers Center 118.9 488 24.0 101.5 whether it is a home run 4/15/12 CLE Luis Mendoza KC Kauffman Stadium 117.2 481 29.7 64.4 8/17/12 Giancarlo Stanton MIA COL Coors Field 116.3 494 24.9 95.5 or in the playing area. In 6/3/12 Nelson Cruz TEX Bobby Cassevah LAA 116.3 484 26.5 100.1 7/2/12 Cameron Maybin SD ARI Chase Field 115.7 485 24.9 98.0 2012, the hardest hit ball belongs to Toronto’s Edwin Encarnacion (119 mph).74

2014 EMC Proven Professional Knowledge Sharing 26

Let’s use the , who play at Denver’s Coors Field a mile above sea level, to explain the atmospheric impact on a hit. Thinner air allows the ball to travel further, so a 400’ home run at sea level travels 34’ further at Coors75. On August 17, 2012, the years’ longest home run (green highlight in the previous table and image to the right) was hit at Coors by right-handed Giancarlo Stanton of the off of Rockies righty pitcher Josh Roenicke on a 6th inning breaking-ball. With HITf/x data augmented by Hit Tracker76, the ball traveled 494’. It left Stanton’s bat at 116 mph at a 24.90 VLA and a 95.50 HLA. Hit Tracker doesn’t use cameras or radar, but relies on trajectory data to calculate the path, and HITf/x data for direction, flight time, and absolute landing spot.

Not only is the data exciting to fans, but useful to position fielders. If Yankee right-hander faced the Orioles, the Yankee manager would shift the defense based on a spray chart lineup card of previous Hughes-Orioles duels 77. This shows where Markakis, Machado, and Davis have hit Hughes’ pitches. The shortstop would move to the right of 2nd base when Davis came to bat. A sufficiently large sample size is clearly required.

Based on this chart where Red Sox slugger hits the ball78, you might move your shortstop to the right of 2nd base, your 2nd baseman into short right field, and let your 3rd baseman play in the shortstop position. The Rays employ a shift based on spray patterns, and in 2011 their league leading defense saved 85 runs79.

2014 EMC Proven Professional Knowledge Sharing 27

Scouts can use HITf/x to objectively evaluate prospects and approximate their success in a team’s home stadium. This helps determine if their observations truly exemplify the player’s ability, find out which are under/over-valued, or figure out if they are just lucky.

HITf/x can establish a hitter’s true performance profile. If they are in a , their batted ball speed could be compared to their historical results. If they average 80 mph but could only muster 70 this week, a coach might want to explore the power drop-off. This helps rule out bad luck as the culprit, such as hitting balls right at fielders, and helps determine if they have a mechanical issue with their swing. While this is just one indicator, it is preferable to looking at a declining batting average and explaining they are having a 0- 22 hitting “streak”. Another example is whether game conditions, or the amount of foul territory in a series of parks, could be causing their short-term poor performance.

FIELDf/x Most of baseball’s statistical history focuses on hitting and pitching, with defensive stats by and large limited to outs, errors, and double-plays once a fielder reaches a ball80. A shortstop who ranges far to his right to scoop up a ball, leaps, spins, and fires the ball to 1st base to get the runner out, is merely credited with an . A runner who times the pitcher’s delivery, takes an aggressive lead, sprints to 2nd and slides head-first for a is simply credited with a stolen base. To value a hit, the defense and its alignment, crowd noise, park dimensions, weather, and other factors should be taken into account. FIELDf/x real-time data tells us how the play happened, not simply that it occurred. With FIELDf/x, defensive luck can be factored out of their ability to play a position and a runner is given credit for his running abilities.

Box scores report numeric conclusions – runs, hits, errors, averages, innings, pitches, etc. Quantifying fielding, throw accuracy, and efficient has been elusive and often limited to a stop watch. A fielder’s skill or lack thereof is treated as a binary event, wrapped in adjectives or ignored. FIELDf/x revolutionizes player evaluation and reporting with everything on the field digitally tracked. Even sabermetric enhancements

2014 EMC Proven Professional Knowledge Sharing 28 pale in comparison to the in-depth big data analysis that cutting-edge FIELDf/x provides – the majority of what playing baseball is all about. Sportvision calls FIELDf/x the “…holy grail of baseball metrics”81 and is a tremendous leap forward for defensive positioning and player reaction times. We can now quantify if a fielder took off in the right direction when the ball was hit, or if he went forward when he should have gone backward. It leads to precise and measurable objective views of how much a player adds or subtracts from a team’s capabilities, and how interrelated his actions are to other events on the field.

With empirical FIELDf/x data, the notion of a stolen base becomes a subject unto itself. Hall of Famer ’s plaque calls him the “Man of Steal” as he holds the MLB record with 1,406 stolen bases over a 25-year career. If FIELDf/x were in use when he played, we might learn more about how he “reads” the pitcher, how he evaluated a catcher’s throwing arm, and how he maximized a 4-5 step lead, sliding into 2nd base 2.9 seconds later82. Conversely, teams could use data to better defend against the stolen base. Big data systems show the length of the initial and secondary lead off a bag against a particular pitcher’s wind-up, catch by the catcher, the catcher’s throw to 2nd base, and the tag on the runner. It determines how fast a runner needs to be against specific pitcher-catcher-fielder combinations. Teaching a runner to get a slightly longer lead could mean the difference between a win or a loss, or reduce successful pitcher pick-offs if the runner can leave a fraction of a second later and still steal the base. The variables include the lead length, runner’s speed, the type of pitch thrown, the catcher’s reaction time and the strength and accuracy of their arm, and whether the runner slides feet first or head first. The catcher must get their throw to 2nd base in 2.0- 2.2 seconds83.

Until recently, precision data required a point-in-time radar gun. FIELDf/x uses four high definition Prosilica GX computer controlled cameras mounted at high points in the stadium to cover the exact position of all offensive and defensive players every second they are on the field84. Initially installed in 2010 at AT&T Park ()85, motion capture processing is used on the images to interpret the action. The system precisely follows the ball as it crosses the plate and

2014 EMC Proven Professional Knowledge Sharing 29 computer vision algorithms turn the offensive and defensive players, offensive coaches, and umpires86 into trackable graphic images. An operator adds the umpire’s ball-strike judgment call and people’s names. As with PITCHf/x, details are recorded as the “…pitcher releases the ball, batter hits the ball, fielder gains possession of a ball (fields it, or catches it from a throw) and fielder throws the ball.”87 With a hit ball, FIELDf/x displays its angle of elevation, maximum height, and how far it travels, all while showing the fielder’s speed to the landing spot and the distance they cover. It captures action at 1/30th of a second, resulting in approximately 50,000 offensive and defensive player object-recognition measurements88 and 600,000 – 1,000,000 data points89,90 a game. It results in an amazing high-volume, high-velocity, and high-variety big data feed.

The system provides a bird's-eye view of all on-field action – leads, speeds, routes, and the ball’s fair or foul trajectory. It reports on a fielder’s range, reaction time, speed to the ball, throwing velocity, and throwing accuracy to a base or as a relay throw. For example, 3rd basemen have about 1½ seconds to glove a ball while middle have an extra ½ second giving them greater range91. FIELDf/x shows if a fielder takes an optimal path to the ball, or plays better in the daytime or nighttime, or on grass or artificial turf. And as we’ve already seen, managers use this data to position fielders, and decide on pitch selection based on the situation, the pitcher, and the opponent.

The long debate over the quickest left fielder, the shortstop with the greatest range, or who is worthy of a Gold Glove can now be quantified. A full season of FIELDf/x empirical data makes sabermetric statistics on errors per chance, assists, or passed balls passé. FIELDf/x metrics can help with contract negotiations by giving general managers a player’s “hustle value” comparison, or help pair up and 2nd basemen based on their combined time to complete a . For example, if shortstop “Rick” threw to 2nd baseman “John” in .4 seconds and then John threw to 1st baseman “Phil” in .5 seconds, you could then determine if other player pairings could complete the play in less than .9 seconds.

Here is an example of how FIELDf/x allows coaches to accurately evaluate a fielder’s defensive worth to a club. The Yankees lead the Rays in the 5th inning 2-1 with 2 outs on July 2, 2012. Rays Will Rhymes hits an 87 mph pitch from Yankee Freddy Garcia to center field. The ball is 31’ high as Yankee center fielder runs 44’ at

2014 EMC Proven Professional Knowledge Sharing 30

12.9 mph towards it92. The YELLOW path he takes is sub-optimal. Between innings, coaches might have discussed this with Granderson to understand what went wrong.

FIELDf/x generates an order of magnitude more real-time data than PITCHf/x. Each camera has two link-aggregated gigabit Ethernet ports that capture action at 30 FPS and stream it at up to 240 MB/s during the game. They optically track a ball that could travel 600’, as well as 9 fielders, 4 umpires, a batter, 3 base runners, ball boys, and whoever enters the field. The system can tell players apart even when they cross each other. All this comes out to 2.5 million records for a 3 hour game93. Combined with other feeds, Sportvision and MLB have amassed over 30 petabytes of data.94 It is stored and analyzed in near-real time and is available to a manager for his decisions or to evaluate player performance. Teams rely on analysts (data scientists) to sift through the daily data volumes for unknown insights that could lead to an extra run or better defense.

Solid defense is key to a winning season. Correctly positioning your players for each opponent, park geometries, weather conditions, pitcher/pitch, and other variables can runs and help your team get to the playoffs. That requires big data. Sabermatrician Bill James equated runs scored and runs allowed to a team’s winning percentage95

. For example, a team that scores 800 runs and gives up

700, , will win 92 games that year. A 5% FIELDf/x

defensive improvement could reduce the runs allowed to 665 runs

. Four extra wins could make or break a season.

This FIELDf/x image shows the ball’s HIT PATH on the right96. A straight line connects the left fielder’s starting point to where the ball is going. The fielder runs 85’ at 20.1 mph to the hit at a sub-optimal 3.9 seconds after taking 1.2 seconds to react to it. The path

2014 EMC Proven Professional Knowledge Sharing 31 was 15’ longer than a straight line. His fielding efficiency is .824

.The catch probability is calculated in real-time from the digital photography. FIELDf/x adds a descriptive language to simple box scores.

These still frames (with graphics enhanced to make the action easy to follow) are from a FIELDf/x test of an Athletics-Mariners game on September 20, 200897. There is 1 out with Seattle’s , Jeremy Reed on 3rd base, on 2nd base, and Bryan LaHair on 1st base. Oakland defensively has 3rd baseman Jack Hannahan , shortstop Bobby Crosby , 2nd baseman Cliff Pennington , center fielder Ryan Sweeney , and left fielder Aaron Cunningham .

The action begins with Oakland’s Kirk Saarloos’s pitch to catcher Rob Bowen .

Johjima hits it deep to left field. We know from historical FIELDf/x data that

Cunningham covers 130’ in 7 seconds, or the time a high fly ball stays in the air. His coach positioned him based on who is at bat and perhaps the type of pitch the pitcher

was throwing. Defensive players go after the hit and position themselves

2014 EMC Proven Professional Knowledge Sharing 32 for relay throws and a possible play at the plate. Runners prepare to in the event the ball is caught. The ball hits the wall and the runners sprint home. The relay

home is late and two runs score. Runners stay on 1st and 2nd.

These tools allow managers to create optimal defensive plans and decide which shortstop to use based on weather, altitude, day game after a night game, type of turf, etc. Coaches learn more about the pitchers their team will face, and the pitcher can study the batters they will oppose in more detail than a traditional scouting report would contain. A manager can build a list of pitch sequences for the pitcher to throw to increase the chances of getting a hitter out. Likewise, a manager can let the hitter know what type of pitch and its location to expect in a situation – runners on, wind, day/night, turf, etc. With this level of detail available in real-time, players in training sessions can get instant feedback on what they did well and what they need to work on98. While the real-time data can’t actively influence play on the field, it does help improve player preparation.

Big Data and the Business of Baseball On the surface, baseball is a simple game and fun to play. For those running a baseball business, however, simplicity couldn’t be farther from the truth. The overarching

2014 EMC Proven Professional Knowledge Sharing 33 organization is Major League Baseball, and they have 30 team franchisees. MLB owns the brand, logos, and other aspects licensed for use as defined by a constitution. The league employs revenue sharing with 31% of each team’s net local revenue distributed evenly to all teams99. Small market teams like the get more money than they put in, and very rich teams like the Yankees get back less than they contribute.

At a high level, each team is governed by a MLB franchise contract and permitted to select players, set prices, and operate like a for-profit business. Teams put a “product” on the field to generate fan interest. With each team having its own business model that is tailored to its geographic franchise, some teams are quite profitable while others are lucky to break even. In 2013, the Yankees were the most valuable team at $2.3 billion with 2012 revenue of $471 million. Somewhat surprisingly, the Rays were the lowest valued team at $451 million, yet they had 2012 revenue of $167 million100.

To bring in sufficient revenues to keep a franchise profitable, a club must begin with the “old school” basics of putting a winning team on the field that fans love to watch, since the fervor of great play in front of a sellout crowd also generates concession revenue. The club then needs to layer on big data improvements, such as assembling a better team for less money, establishing a dynamic and profitable customer relationship, and positioning the team as a marketing and media revenue generating engine.

Consider yourself a team’s general manager, with your goal to assemble a championship contender. A guideline you might have to follow is based on money buying success. These are possible strategies you might use and teams that fit that profile:  Highest salary – expensive veterans, top-of-their-game all-stars (Yankees)  High salary – past their prime stars (Cubs)  Medium salary – primary focus on pitchers, secondary focus on hitters (Giants)  Low salary – primary focus on hitting, secondary focus on pitching (Angels)  Lowest salary – minor league players for MLB minimums (Athletics)

There are many more such variables, combinations, and strategies. You probably do not want to group hitters with pitchers, and you probably want to add factors like park dimensions, typical weather conditions, number of day versus night games, travel schedules, and more. You want left-handed hitters if your ballpark has a short right field dimension. Some combinations may not make any sense, while others will have a strong correlation. You will likely get more walks if you have a lot of shorter players. If you want

2014 EMC Proven Professional Knowledge Sharing 34 the oldest players or the heaviest players, you are getting veterans who really know how to play the game, but may suffer more injuries or tire easier. Pick the youngest players and they might be faster, but have less experience. Or you could use Billy Beane-style variables to assemble a team focusing on Defensive Runs Saved101. A strategy of winning at all costs may attract fans to the stadium, in which case you might want to outspend the other teams. Any or all or none of these can be part of big data mining in assembling the right team for your goals. To the right is an example of a strategy that could be tried102.

Leveraging big data cannot only help a team make it to the championships, but generate meaningful extra revenue at the same time. Tickets are sold at a premium, souvenirs and concession prices increase, and on cold October nights, clubs sell a lot of hot chocolate. The 2009 Yankees earned an extra $72 million by playing in the postseason103. Broadcasters view the as a revenue bonanza, especially when the teams are in top markets and the best-of seven series goes to 7 games. For example, the featured the against the Yankees. An average of 19 million viewers saw the first 6 games jumping to 22 million viewers for Game 7104. Championship games also bring additional revenues to local businesses such as bars, restaurants, and MLB merchandise retailers105.

MLB players earn a contractual minimum of $490,000 in 2013 and the highest paid player makes $29 million106. In 2011, Phillies pitcher signed a $120 million 5- year contract that earned him $25 million in 2013107. That equates to $8,067 for each of his 3,099 pitches he made that year108. Teams collectively spend over $3 billion on player salaries109, and try to determine each player’s worth based on their contributions. Each team typically has 225 major and minor league players, plus 100-200 other full- time employees, 1000-2000 part-time ushers, game-time concession sales people, etc. Hundreds of thousands of direct and indirect workers make part of their living from baseball, including parking attendants, police/security, hotel staff, peanut farmers, gas station attendants, garment workers, newspaper writers, TV and radio commentators

2014 EMC Proven Professional Knowledge Sharing 35 and engineers, etc. In truth, baseball is a simple game played by kids in 45 countries110 and viewed by fans of all ages worldwide, while backed by a trillion dollar global industry that touches the lives of a significant portion of the planet in one way or another.

Businesses hate uncertainty, and baseball is anything but certain. We want to know the future, but we don’t. In the business of baseball, you have a lot of data at your disposal, and need to make a lot of decisions, with the goal of making the business viable despite this uncertainty. That is where big data helps the most – discovery:  What roster of players should be assembled?  Players get injured - who should replace them and how will it affect the team?  How long should their contracts be and how much are they worth?  How to predict the future performance of a player?  What arm angles and pitcher velocity changes can predict medical issues, and how to use medical care and conditioning to get them and keep them healthy?

With big data systems like PITCHf/x and FIELDf/x, each prospect might have a hitting, base running, fielding, and throwing score during a game. Over the course of a season, players would have an empirical rating based on their play – a comprehensive report card. For example, if college player “Fred” has a 92 rating and “Bill” a 96, teams might offer Bill a higher signing bonus based on their ratings. The approach would also work to evaluate and compensate older players who substitute experience for raw athleticism as their career wanes. Some executives might even employ personality and/or medical testing to find the best athletes that can earn the team tens of millions more dollars.

Player Development Fans yearn to be part of an exciting team – great pitching, power, speed, and defense. They come to the park to see an exciting game and cheer their team to victory. All clubs would like to field a team of feared all-stars, but their collective cost is too high. The 2013 Yankee player payroll was ten times that of the Astros. It is true that generally the team with the best, most expensive players tend to win more games and make it to the playoffs more often than teams with the least expensive players, but it is far from a certainty. Injuries to star players can quickly turn expensive teams into mediocre ones. This wins per dollar graph from 2001 to 2010111 shows teams above the line won more often than those below the line without an

2014 EMC Proven Professional Knowledge Sharing 36 ironclad correlation to payroll. The Athletics won an average of 87 games at $804,597 per win while the Yankees spent $2,244,897 for each of their 98 wins. There is obviously more to a sucessful team than player’s paychecks.

In 2013, teams with above average minor league affiliates were the Cardinals, Mariners, and Rays112. Prospects come from high schools, colleges, and key locations around the world. In 2012, 27% or 224 MLB players came from the Dominican Republic (87), Venezuela (58), Canada (15), Japan (13), Cuba (11), Mexico (9), Panama (7), and others113. In the pre-Moneyball era, you scouted these locals to find talented players for bargain-basement contracts. Big data has made the task a bit easier. With empirical data and the power of the Internet, scouts can “watch” every game from their office and analyze potential players before they hop on a plane or drive to visit them in action. They can get a feel for a pitcher’s habits and understand what they throw in game situations. Are they effective when behind 3 balls no strikes? Do they a runner at 1st base? Does their changeup have a big break? Do they keep their first pitch down?

In the pre-Moneyball era, players were often ranked by “tools” with “5-tools” being the best. Teams were prone to draft players who had that “look” – handsome, athletic body, and charisma. Today, they want players who get hits from optimal mechanics, have the fastest reaction times, take the shortest path on the bases, and have the most accurate and strongest throwing arms. The big data objective is to employ metrics like these114:  60th percentile of ability to go with the pitch  92th percentile of forward exit velocity  62th percentile of path to ball  72th percentile of throwing arm accuracy  40th percentile of speed

As a result, Sportvision tools are appearing at little leagues, colleges, and baseball schools, and are used alongside radar guns and stop-watches. SmartKage uses HITf/x and PITCHf/x in a video-sensor to provide over 40 performance metrics to student athletes115. Their hitting, pitching, fielding, running, catching, and agility are quantified to improve the player. Scouts also receive this data116. SmartKage is installed in over 300 locations, 3,500 colleges117, and many minor league parks118.

2014 EMC Proven Professional Knowledge Sharing 37

With all the players in MLB, it is hard to find plentiful data on specific hitter-pitcher matchups. This small sample size problem is evident when a manager picks a against a particular pitcher and decides to use someone who is 2 for 12 against a pitcher, 1 for 3, or even 3 for 20 over a 3-4 year period because they lack a sufficient statistical history to work with. What if pitchers were grouped with others who pitched the same way and clustered based on fastball, changeup, and curve release points, velocity, and horizontal and vertical break characteristics? A grouping might show how a hitter could do against this model of a pitcher’s ability. Vince Gennaro used a YarcData Urika big data appliance to comb through a vast amount of PITCHf/x data to create pitcher clusters against left-handed batters119. In red are pitchers such as Gio Gonzalez and . In green, and . Purple has CC Sabathia and Boone Logan. and Johan Santana are blue groups, with Clustering Pitchers David Price and in yellow. Gennaro’s Hitters' Questions Model Data What does he throw? Top 2 Pitches analysis evaluated 12 different attributes as shown in Pitch Repertoire/Variety 120 Horizontal Pitch Location this table . This data could help a manager use a Vertical Pitch Location How hard does he throw? FB Velocity stronger statistical correlation to select a lineup. Instead What kind of movement? Horizontal Movement Vertical Movement of picking a batter against CC Sabathia, the manager Where does he come from? Release Point How does he like to pitch? Swinging Strike % would use the group stats of Sabathia, Craig Breslow, Zone % Edge % Rex Brothers, , and Boone Logan. Top 2-pitch Sequence

To Gennaro, “…big data is about discovery. It's about asking the right questions to spot a customer trend, to identify a new market, or, in personnel matters, to find a gem of an employee, someone who you didn't know existed until their name started showing up regularly on Twitter or LinkedIn. Big data helps you to know what don't know.”121

Revenue From Fans Team owners will tell you the key to a profitable baseball franchise starts with an exciting team, for excitement equals fans. Ticket revenue is just the tip of the iceberg. Selling more tickets means higher concession revenue, increased merchandise sales, more advertisers, etc. Management has many approaches to sell tickets, from giveaway player bobble-head dolls, firework evenings, co-sponsored free caps, balls, or bats for

2014 EMC Proven Professional Knowledge Sharing 38 kids, and other items. There are also special stadium events such as “Old-Timers” reunions, retiring a player’s number, and tributes. It’s not just in the ballpark; TV viewers on their couches, passengers listening to car radio, Internet baseball fans, all experience the fervor through licensed broadcasts. A happy fan base translates into higher revenue, and higher TV/radio ratings lead to higher advertising revenue.

While a widely accepted goal is to sell all of your seats before the season begins to season subscribers, not all teams share that view. The Nationals “…limit the number of season tickets available for 2013 to 20,000 of our 41,000 available seats.”122 Single- game tickets have higher club revenue than games purchased on a discounted season plan. Clubs have also experimented with various ticket sale strategies in order to increase direct ticket revenue. Variable and dynamic are two approaches to allow for “market” ticket pricing against a hot club or discount tickets to a last-place opponent. When their team wins, ticket prices can increase, or vice-versa if they lose.

With variable pricing, teams set seat prices based on the day, game time, and pre- season popularity of the opponent. For example, a $19 Monday seat may cost $21 for a Wednesday game. The Red Sox introduced premium prices for and Yankees games, and reduced prices for weeknight games in colder months of April, May and September123. This increases attendance for games with less desirable opponents, and raises prices for high demand tickets against top teams. Dynamic pricing uses big data to gauge real-time interest to set prices based on supply and demand. Similar to the airlines, prices are higher when your team is on a hot streak, a player is about to break an important record, your team is vying for the pennant, or a great summer day. Twists on dynamic pricing include allowing the fan to upgrade their seat when the stadium is not full at a lower price than if they purchased the better seat weeks earlier.

Merchandise creates substantial club revenue because they incur no cost while profiting from each item sold. Hats, jerseys, and foam fingers use licensed team logos and names of players past and present that fans relate to. In 2012, almost 25 million fans bought items124. Fans purchased $3.2 billion worth of goods in 2011125 and it is estimated that MLB keeps 4% of each sale before distributing the revenue to teams126.

2014 EMC Proven Professional Knowledge Sharing 39

The Nationals are a technologically advanced club that uses big data for fan marketing and ticket promotion. Their “Ultimate Ballpark Access"127 card is a fan’s ticket, cash, and rewards card that gets them personalized discounts in the park. The card is key to their Customer Relationship Management (CRM) system that integrates marketing, sales, and customer service to get a better understanding of the fan. In exchange, fans get a better game experience with points exchangeable for merchandise or food discounts, seat upgrades, promotions during rain delays for free beer with a hot dog purchase, and other advantages128. With cash tied to the card, fans are encouraged and incentivized to use it at the game.

Real-time data, including a fan’s location in the park, allows the Nationals to redeploy concession staff when queuing causes long waits, offer promotions to fans that arrive early, and reward frequent fans. Knowing more about each fan even allows vendors to greet them by their loyalty card name. “Hi Debi, thanks for coming today! As a loyal fan, I see you are entitled to a free hot dog. Want it with mustard”? They even increase sales during inclement weather through parking and subway discounts if fans buy a ticket.

All Nationals business groups share a common CRM view - ticketing, merchandise manager, concession coordinator, etc. They strive to improve the customer experience and allow fans to dictate their level of participation in the system. The Nationals track how often a season ticket holder attends a game and make it easy for that fan to resell their tickets on StubHub since the goal is a full stadium129. If they see a pattern where a fan buys tickets when certain teams visit, such as when the Mets come to play, they can offer them a Nationals-Mets ticket package. Their loyalty card is also tied in with other merchants such as grocery stores for extra ballpark discounts. They increase the value of the Nationals brand by knowing more about their fans and market directly to them.

Business managers also want to know their fans better. Data was always available on season or luxury box ticket holders, but insight into fans that came once a year was hard to come by. Big data to the rescue, especially with a CRM card! For example, they can deduce that a fan who buys beer and cotton candy in the 3rd inning likely has a child at the game, and can text them a 7th inning 50% discount offer on children’s caps.

2014 EMC Proven Professional Knowledge Sharing 40

Offering fans rich interactive media is part of the big data ticket buying experience. For example, the Yankees ticket website uses an interactive park tour to find a view you like. MLB runs ticketing for most of the teams, and tracks the buying habits of its fans to improve the fan experience and help gain insight into its customers. Getting to the park is made easier using a smart phone or tablet for directions and parking details. MLB’s app also shows team statistics and video highlights from games attended, player songs, special offers, and ticket upgrades130. Fans can receive smart phone alerts about the best parking, gates to use, and bathrooms with short lines, see where they are in the park, bring up a food and drink menu, and have their order delivered to their seats!

More seated fans means more advertising revenue. Time slots before the game begins, between innings, during pitching changes, and on the rain tarp during inclement weather are ideal for advertisers looking to capture the attention of 30,000-50,000 fans. () is expanding their advertising square footage, including a left field Jumbotron expected to “…generate $300 million over five years to use for Wrigley renovations.”131 Advertising revenue is closely tied to fan attendance, which reflects whether the team is in contention and exciting to watch. For example, the 2010 Mets at drew 2,559,738 fans with a 79-83 record132. A year later, their record fell to 77- 85 and drew 8% fewer fans. Advertising revenue dropped $1.9 million to $44.2 million133.

MLBAM was created to develop a common look-and-feel web site for each team134. They now generate revenue, and in 2012, brought in $650 million through 3 million MLB.tv and MLB.com subscribers. AtBat is their top smart app with almost 7 million downloads. It provides play-by-play action plus the PITCHf/x graphical flight of the ball as it crosses home135. Through electronic devices, fans increase their enjoyment through real-time statistics and individual replays within 15 seconds. “...MLB.tv subscription allows you to stream the game through an Xbox or any Wi-Fi connection. Meanwhile, BAM's AtBat app allows for streaming on your iPhone or iPad. It also allows fans to watch the action in "Game Day" mode , a data-enriched, live-graphic presentation of the game.”136 It takes a lot of real-time big data expertise to unify diverse data feeds behind

2014 EMC Proven Professional Knowledge Sharing 41 the scenes, mask the complexity and make it appear simple and seamless to the fan.

MLBAM is also working on bringing PITCHf/x, HITf/x, and FIELDf/x to the home video game console. Imagine being at home, pinch hitting for a player with real runners on base, trying to swing a virtual bat at Mariano Rivera’s cut fastball coming at you on your big screen TV at 90 mph. The gaming potential is tremendous and promises to offer greater fan immersion compared to standard XBOX or PlayStation baseball games.

It should not be a surprise that running a successful ball team is a big data problem. The data shows all sorts of patterns137. Who buys tickets? Who buys merchandise? Who likes what? What’s the best time to make them an offer? What form should that offer take? On the game side, what combination of players produces the best results? Where should you position your defense? What pitches in what sequence should you look for? It’s come down to Moneyball on steroids. Moneyball was “…based on 80,000 box scores and play by play data (1963-2002), or about 1.5GB of just outcome data”138, versus today where multi-terabytes of data are produced in one game – a big data explosion! There’s lots of data and answers in that data that leads to some amazing conclusions.

Revenue From Media Clubs also make money from TV and radio deals. Teams contractually share revenue from national contracts with CBS, ESPN, TBS, FOX, etc., amounting to $12.4 billion over 8 years, or nearly $52 million a season per club139. Teams also have local contracts. For example, the Yankees own the YES Network which paid them $85 million in 2013 for broadcasting rights, eventually reaching $350 million a year by 2042140. In 2012, the Yankees sold 49% of YES to FOX Sports for $3.4 billion141. In 2013, the Dodgers and Time Warner Cable signed a $7-8 billion, 20-25 year deal worth $280 million a year142.

Why are sports so popular with networks and advertisers? One big reason is the DVR. Viewers record their favorite shows and fast-forward through commercials to achieve a near non-stop experience. That behavior hurts product advertising and the revenue it generates. People may record their favorite TV show to skip the commercials, but watching DVR'ed sports results in a mediocre experience. There are many ways to track a score, so once you find out the result, you tend to either not watch your DVR’ed game,

2014 EMC Proven Professional Knowledge Sharing 42 or watch it with less enjoyment. That is why advertisers pay so much to air commercials between innings – people want to watch sports games live and not record them.

A team can make out well when it comes to negotiating advertising contracts with broadcasters. For a broadcaster, a single game represents four 30-second TV or radio commercials per half inning, or 56 commercials during a 9-inning game. In an extreme example, Fox TV sold each 2010 baseball playoff 30-second slot for $225,000, and $450,000 for World Series slots143, with the 2012 All-Star game going for $550,000 per commercial144. There may also be advertisements during the game, such as the “Pepsi home run blast” or the “Kia player of the game”. There are also commercials before and after the game, but it is harder to guarantee advertisers the same viewer demographics, ratings points, and market share of the 18-49 year olds they are trying to reach.

Print advertisers buy space on pocket schedules, media guides, game day handouts, yearbooks, and the back of paper tickets. Ads are also placed in concourses around the park, electronic scoreboards, outfield fences, behind home plate and the bases, and scrolling messages around the field. Premium giveaways with an advertiser’s message, such as a souvenir cup or trinket, are handed to fans as they enter the park. MLB attracts an interesting demographic for advertisers. For example, 25% of the fans earn $100,000 or more compared to 17% of the Milwaukee Designated Market Area145. The team’s demographic audience, market, and team interest dictate the prices charged for signage. For example, a big market team like the Red Sox might charge $300,000 for the same home plate signage that costs $50,000-$60,000 for a half- inning at Kauffman Stadium (Royals)146. When the Red Sox play the Royals, the Boston TV audience sees advertisements that cost 1/6 the price of Fenway ads.

Big Data Helps Create Algorithmic Baseball Journalism Automated Insights (AI) created realtime.automatedinsights.com147 which robo-writes content about any MLB game in real-time. It includes big data mined observations from previous games, be they from the day before or 100 years ago. With instant updates to games in progress and recaps derived from the final , facts and insights are added to yield a colorful account of the play-by-play.

Originally named StatSheet.com, AI leverages structured statistics to produce big data automated written content. They use linguistic algorithms to automatically publish 500-

2014 EMC Proven Professional Knowledge Sharing 43

1,000 word recaps such as this from a Red Sox – Tigers 2013 game148. They generate recaps for 417 other websites covering the NFL, MLB, NCAA Division I college basketball, and others, consisting of 400,000+ tweets on score updates, play-by-play accounts, and more149. Their “…network is already comprised of more than one million unique pages of content and 15,000- 20,000 original articles are added every month.”150

Software-generated content can turn quantitative information into English, but qualitative writing still requires a human touch. Some benefits of “robo-writing” include:  never gets writer’s block  works morning, noon, and night, and produces content quickly  can alter its writing style through parameter changes  can analyze data without getting tired  low cost per article

Programs can take something simple like “Adam Dunn of the White Sox homered to left center” and instantly turn it into: “That was an MLB leading 35th home run for the year. Adam Dunn moves into sole possession of 50th place all time for home runs in Major League Baseball. When there are two or more home runs in a game, the White Sox are 34-10.”151

Rather than invest in a large data center with physical servers, storage systems, backup and disaster recover capabilities, large power and cooling bills, and more to pull off this big data feat, they leverage the elastic cloud computing capabilities of Amazon’s AWS. On peak Tuesdays, they spin up 300 AWS servers on demand for one hour to input millions of Yahoo data points and author millions of customized fantasy sports reports, each tailored to an individual fantasy team owner!152 A report could delve into a player’s actions versus historical events, a manager’s decisions, playoff implications, and more. They create hundreds of reports a second and AWS bills them for the actual usage.

2014 EMC Proven Professional Knowledge Sharing 44

This US patent shows the AI basic flow153. Statistics flow from an information provider (305) to any device that can relay baseball stats according to an agreed API, XML, CSV, or spreadsheet format. A content generation engine (320) issues a SQL query against the content generation database (330) for statistics that includes event IDs such as the one used in the PITCHf/x metadata, or other classification trends, streaks, new records, etc.

The actual content is based on templates designed by AI journalists and programmers which include and . The templates can include “…event preview, in-progress event update, event recap/summary, statistical analysis, progress report, etc.”154 Once a template is selected, text is varied to avoid duplication of content by using and . The variation allows the tone of the content to be hopeful, positive, optimistic, angry, indifferent, pessimistic, judgmental, etc. Their algorithms also derive uniqueness from the significance of the event.

Known as computational journalism, machine written news, automated content creation, algorithmic journalism, and robotic reporter, to name a few, companies like AI mine big data to create humanized content around statistics by adding the who, what, where, when, and why with nouns, verbs, and other parts of speech. To guard against repetitiveness, they use artificial intelligence to scan big data datasets for patterns, insights, and trends using natural language constructs. While not yet at the level of a professional writer, they do a fine job with plain English. Their approach can also generate content for real estate listings, financial reports, and traffic alerts.

AI employs journalists who specialize on meta-data structure and content, and work with programmers on data filtering parameters to fit that syntax. They coach algorithms to identify various data “angles”. Content may originate from a hit or an out, winner or loser, record setters, and players who had a great day. The methods involve context and data mining searches from proprietary databases, external data feeds, and external big data sources. Automated mining can free up a real reporter from the drudgery of digging out data clues. The big data aspect adds historical perspective for “why” an event is special. Working with data rich subjects, turning statistics like RBIs and home runs into

2014 EMC Proven Professional Knowledge Sharing 45 sentences and paragraphs is a natural fit for robo-writing. AI’s programmers help fill the gap created by enormous rich data streams and the demand for formulaic journalism155.

Here are two different articles written about the same game. Which was written by the NY Times and which by an algorithm? BOSTON — Things looked bleak for the Angels BOSTON — In the hopes of channeling one of their when they trailed by two runs in the ninth inning, most famous postseason moments, Boston but Los Angeles recovered thanks to a key single reached back 23 years on Sunday and had Dave from to pull out a 7-6 victory over Henderson throw out the ceremonial first pitch. the at Fenway Park on Sunday. It was Henderson’s home run that rallied the Red Guerrero drove in two Angels runners. He went 2-4 Sox when they were down to their last strike at the plate. against the Angels in the 1986 American League Championship Series. Now with the Red Sox facing “When it comes down to honoring Nick Adenhart, elimination by the Angels again, here was and what happened in April in Anaheim, yes, it Henderson, proof that no game, no series, is over probably was the biggest hit (of my career),” until the final out is secure. Guerrero said. “Because I’m dedicating that to a former teammate, a guy that passed away.” But this time, all these years later, it was the Angels who made that point. Down to their final strike three After Chone Figgins walked, doubled times in the ninth inning — just as Henderson was and Torii Hunter was intentionally walked, the in 1986 — Los Angeles scored three runs off Angels were leading by one when Guerrero came Boston’s dominant closer , to to the plate against Jonathan Papelbon with two win the game, 7-6, and eliminate the Red Sox from outs and the bases loaded in the ninth inning. He the postseason. singled scoring Abreu from second and Figgins from third, which gave Angels the lead for good. Henderson’s magic, apparently, is not reusable.

Still not sure? An algorithm created the one on the left from a box score with on-line quotes thrown in for color156. While the artificial article is not as good as the journalist’s, the quality isn’t noticeably sub-par. It shows state of the art linguistic algorithms combined with big data. Algorithms may help with data mining research, create content that journalists virtually never do such as Little League game commentary, handle simple medical checkup explanations involving cholesterol readings, or traffic reports. They can also report on the finer points of pitch movement, trajectories off the bat, and a shortstop’s movement to his left, while updating it every second. Algorithms could also alert a coach in natural language to a tiring pitcher’s lower release point or velocity.

Listen To Your Data - Grady the Goat - The Curse of the Bambino As good as baseball statistics and analytics have become, they don’t guarantee a win, nor is everyone a true believer. An example of where data was available to help make a better decision - a premise of big data - was during the 2003 ALCS pitting the Red Sox against the Yankees. The Red Sox were just 5 defensive outs away from winning their 5th pennant since 1918 with a chance to win the World Series, something they had not

2014 EMC Proven Professional Knowledge Sharing 46 done in 85 years. That they failed to win the game can be traced to the decision of Red Sox manager Grady Little to “go with his gut”, and not with the data available to him, at a critical moment late in the game.

The celebrated Red Sox pitcher Pedro Martinez was on the mound, and was brilliant giving up only 3 hits through 7 innings. He had now thrown over 100 pitches, however, and his team’s data analysis showed that his effectiveness dropped noticeably after 6 innings or the 105-pitch mark. Historically, batters had a .555 OPS through 6 innings facing Martinez, jumping to .758 after that. As Martinez passed the 105-pitch mark, the Yankees began to tee off on the previously dominant Red Sox “ace”, chipping away at the Sox’s 5-2 lead. Astonishingly, Little refused to pull Martinez from the game until he had thrown 123 pitches; his last pitch, thrown to Yankees catcher , saw the Yankees tie the game, which they went on to win. The so-called “Curse of the Bambino”, which linked the Red Sox trading the legendary Babe Ruth in 1919 to the Yankees to their World Series drought since then, was seemingly alive and well. Little was fired 11 days later; his refusal to rely on clear-cut evidence that a new pitcher was needed had broken the hearts of Red Sox Nation for yet another year.

Conclusion: The Future of Big Data Baseball Baseball is a game where a “once in a lifetime” batter has a .400 average and a pitcher that wins 15 games a year is ranked in the top 10%. The Moneyball era marked a radical overhaul of baseball analytics and the refinement of the process of baseball. Teams today invest millions on advanced analytics to determine the players they need, how to use them, how to help them, maximize revenue, and transform their team’s relationship to their fans. A symbiotic relationship between the athlete and the has emerged.

Fans continue to be exposed to this data explosion, with degrees of difficulty on fielding plays replacing adjectives like “great catch”, or base running metrics that make for hotly debated player comparisons. At home, fans see a liberal dose of data capabilities because of advertising demands and new metrics that grab everyone’s attention. Management and the media continue to expand revenues and increase fan interest. Coaches are able to keep players healthier during the grueling season, and turn a player’s average season into great ones with a few extra hits a month. As Vince Gennaro says “There are going to be pregame meetings on where the shortstop is going

2014 EMC Proven Professional Knowledge Sharing 47 to play exactly for each of the hitters”157. Players will make better catches, runners will get more stolen bases, and the pitcher/hitter relationship will become more intense.

Big data baseball continues to grow as PITCHf/x and other systems reach minor league affiliates. Over time, 7,500 major and minor league players could be tracked a year. Add 7,700 U.S. colleges158, 37,000 American high schools, and big data could someday track over a million players. By following players around the world, there will be millions of players that could be affordably tracked. Big data has come to baseball!

More players equal more games and more data as demonstrated by Gennaro’s use of Urika. Management will gain more insight into which high school players to draft, find undrafted “gems”, decide who to put together on teams, how to help player development, determine negotiating strategies based on empirical player values, and more. The data collected is complex and difficult to derive information from using classical database tools, and some teams have begun to experiment with Hadoop.

Hadoop is a big data processing method where structured and unstructured data is spread amongst dozens to hundreds of computers in parallel to quickly answer complex questions, similar to the speed of a Google search request. The Rays are using Hadoop to help derive extra revenue and create a better fan experience159. Coaches use Cassandra, Hive, or Hbase analytics160,161 to discover better defenses, optimize running, or give the hitter or pitcher an advantage. The following is the basic Cassandra logic.

Client Client WRITE PATH Acknowledge Request Acknowledge READ PATH Mem Table – Storage row Key Cache Miss Write Mem Table – Storage row DATA 1 DATA 2 DATA 3 DATA 4 Bloom Filter DATA 1 DATA 2 DATA 3 DATA 4 Server Server File System Flush File System Read Random Seek DATA 1 DATA 2 DATA 3 DATA 4 DATA 1 DATA 2 DATA 3 DATA 4 DATA 1 DATA 2 DATA 3 DATA 4 Append Only Log DATA 1 DATA 2 DATA 3 DATA 4 DATA 1 DATA 2 DATA 3 DATA 4 DATA 1 DATA 2 DATA 3 DATA 4 Data – SS Table DATA 1 DATA 2 DATA 3 DATA 4 Commit Log Data – SS Table

Big data sports managers know that live baseball entertainment is not limited to the ballpark. They would love, for example, to tie PITCHf/x into the untapped market of real- time video games. Fans trying to bat against Mariano Rivera with real game footage could bring in substantial revenue. Someday, the data might tie into holographic baseball in your den, so fans could physically “stand tall” against the pitcher with bases loaded!

2014 EMC Proven Professional Knowledge Sharing 48

Teams don’t want to eliminate the human factors that make baseball a great game, but instead leverage big data to help them test theories and make smarter decisions in all facets of their business. Grady Little’s “old school” gut feeling days and the era of Earl Weaver note cards may be coming to an end. The “new school” big data explosion shows the decision pool of data is growing at a rate beyond the abilities of a simple, singular data mining approach162. Mastering these new analytics should yield valuable competitive insights in all facets of the game. The Yankees have realized this and employ 21 statisticians163 who apply their knowledge of analytics, modeling, and computer science to the customer relationship, and solve complex business problems by asking “what if” questions of the data.

Perhaps the day will come when the fan in their seat will ask their smart phone about the likelihood of a runner on 1st base scoring if the batter gets a curveball and drives it into right field. Or transit officials may someday reduce car traffic and improve mass transit service through real-time insight into the number of fans attending a game, fan sentiment, and the score. We want to have a great time at the game and owners want us excited about their product. Trying to understand the data relationships is a big challenge, and teams that succeed will increase their revenue. When teams succeed, they enhance our national pastime. Play ball!

2014 EMC Proven Professional Knowledge Sharing 49

Appendix - PITCHf/x Metadata The following definitions support the inning_9.XML file used earlier in this paper.

Field This data value Description des Foul ball Pitch description des es Foul ball Pitch description Spanish ID 507 Unique pitch number of the game type S B ball, S strike, X in play tfs 230849 Event # tfs_zulu 2012-10-29 to 3:08:49Z Starting date and time of event x 109.01 x-pixel at home plate y 132.11 y-pixel at home plate sv_id 121028_230849 date/time stamp when PITCHf/x first detects the pitch YYMMDD_hhmmss start_speed 94.4 release point velocity mph end_speed 86.0 crossing front of home plate velocity mph sz_top 3.32 distance from ground to the top of the strike zone, feet sz_bot 1.53 distance from ground to the bottom of the strike zone, feet pfx_x 8.23 horizontal movement of the pitch between the release point and home plate, inches pfx_z 10.3 vertical movement of the pitch between the release point and home plate, inches px -0.315 left/right distance of the pitch from the middle of the plate as it crossed home plate, feet pz 2.919 height of the pitch as it crossed the front of home plate, feet x0 2.562 left/right distance of the pitch, measured at the initial point, feet y0 50.0 distance from home plate to measure the initial parameters, feet z0 6.035 height of the pitch measured at the initial point, feet vx0 -10.7 x velocity of the pitch measured at initial point, feet per second vy0 -137.96 y velocity of the pitch measured at initial point, feet per second vz0 -6.156 z velocity of the pitch measured at initial point, feet per second ax 15.708 x acceleration of the pitch measured at the initial point, feet per second per second ay 33.556 y acceleration of the pitch measured at the initial point, feet per second per second az -12.449 z acceleration of the pitch measured at the initial point, feet per second per second distance from home plate to the point in the pitch trajectory where it achieved its greatest break_y 23.7 deviation from the straight line between the release point and the front of home plate, feet angle from vertical to the straight line path from the release point to where the pitch crossed break_angle -43.3 the front of home plate, as seen from the catcher’s/umpire’s perspective, degrees greatest distance between the pitch's trajectory between the release point to the front of break_length 4.1 home plate, and the straight line from the release point to the front of home plate, inches most probable pitch type according to a neural net classification algorithm developed by Ross pitch_type FF Paul of MLBAM - FF=Four-seam Fastball value of classification algorithm’s output node corresponding to the most probable pitch type_confidence 0.676 type, multiplied by 1.5 if the pitch is known by MLBAM as part of the pitcher’s repertoire location of the pitch based on the boxes into which the Gameday app divides the strike zone zone 1 for its hot/cold zone graphics nasty 41 how hard a pitch it is to hit 0-100 spin_dir 141.469 direction of spin of the baseball, degrees spin_rate 2651.72 measured as the number of rotations the ball, rpm cc Comments? mt Unknown?

2014 EMC Proven Professional Knowledge Sharing 50

Footnotes

1 http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 2 http://espn.go.com/mlb/player/hotzones/_/id/5375/kevin-youkilis 3 http://www.fangraphs.com/graphs.aspx?playerid=1935&position=1B/3B&page=0&type=mini 4 http://blog.ceciliatan.com/?p=478 5 Adapted - http://www.ericcressey.com/troubleshooting-baseball-hitting-timing-is-not-always-the-problem 6 www.providencejournal.com/sports/red-sox/content/20120316-baseball-vision-when-20-20-eyesight-just-wont-cut-it.ece 7 “The Physics of baseball”, by Robert K. Adair, ISBN 0-06-008436-7, Page 39 8 http://www.bostonbaseball.com/whitesox/baseball_extras/physics.html 9 http://www.sportvision.com/baseball/fieldfx 10 http://baseball.physics.illinois.edu/FieldFX-TDR-GregR.pdf 11 http://www.brooksbaseball.net/pfxVB/pfx.php?s_type=2&sp_type=1&year=2012&month=8&day=4&batterX=&pitchSel=453329&g ame=gid_2012_08_04_minmlb_bosmlb_1/&prevGame=gid_2012_08_04_minmlb_bosmlb_1/&prevDate=84&inning1=y 12 http://princeofslides.blogspot.com/2011/05/sab-r-metrics-gif-movies-and-pitch.html 13 http://www.baseball-reference.com/players/r/riverma01.shtml 14 http://www.baseball-reference.com/players/m/martipe02.shtml 15 http://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=1&season=2013&month=0&season1=20 13&ind=0&team=0,ss&rost=0&age=&filter=&players=0 16 http://www.fangraphs.com/heatmap.aspx?playerid=844&position=P&pitch=FC 17 http://articles.latimes.com/2013/jul/29/sports/la-sp-0730-mariano-rivera-20130730 18 http://en.wikipedia.org/wiki/Elias_Sports_Bureau 19 http://sabr.org/sabermetrics 20 http://www.predictiveanalyticsworld.com/sanfrancisco/2012/presentations/pdf/Day2_1430_Vartanian.pdf 21 http://baseballanalysts.com/archives/2006/07/empirical_analy_1.php 22 http://media.hometeamsonline.com/photos/baseball/SOBRATO/TG_Strategy_Presentation.pdf 23 http://turbotodd.wordpress.com/2011/10/26/information-on-demand-2011-a-data-driven-conversation-with-michael- lewis-billy-beane/ 24 http://www.baseball-reference.com/players/d/davisch02.shtml 25 http://deadspin.com/2013-payrolls-and-salaries-for-every-mlb-team-462765594 26 http://baseballplayersalaries.com/ 27 http://www.sloansportsconference.com/wp-content/uploads/2012/02/98-Predicting-the-Next-Pitch_updated.pdf 28 http://sports.yahoo.com/fantasy/blog/roto_arcade/post/Been-Caught-Stealing-MLB-warns-Phillies-about- s?urn=fantasy,240535 29 http://www.geekwire.com/2013/major-league-baseball-dominates-big-data-mobile-game/ 30 http://seanlahman.com/blog/wp-content/uploads/2013/08/SABR-43-presentation.pdf 31 http://espn.go.com/mlb/player/stats/_/id/1150/tony-gwynn 32 http://sportsillustrated.cnn.com/vault/article/magazine/MAG1115578/2/index.htm 33 http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 34 http://www.sloansportsconference.com/?p=6137 35 http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 36 http://www.gartner.com/it-glossary/big-data/ 37 http://www.sloansportsconference.com/wp-content/uploads/2013/Slides/205/HP%20Vertica.pdf 38 http://www.sportvision.com/ 39 “The Technology of Baseball” by Thomas K. Adamson. ISBN 978-1-4296-9955-6, P.15 40 http://www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi- tech.html 41 http://blogs.wsj.com/cio/2013/07/15/baseball-all-stars-data-gets-more-sophisticated-with-field-fx/ 42 http://en.wikipedia.org/wiki/1st_&_Ten_(graphics_system) 43 http://www.dnallen.com/stuff/AllenSkillVLuck11.pdf 44 “Broadcasting Baseball: A History of the National Pastime on Radio and Television”, ISBN 978-0-7864-4644-5, P. 204 45 “Why a Curveball Curves”, ISBN 978-1-58816-794-1. Page 63, “The Physics of a Curveball” by Dr. Peter Brancazio, my professor at Brooklyn College, 1971. The Magnus effect is the top spin causing the ball to move downward in addition to gravity, back spin causing the ball to rise, all due to the ball’s interaction with the air and the wake it creates. 46 http://sportsvideo.org/main/blog/2010/10/tbs-sportvision-offer-hd-viewers-new-angles-with-pitchfx/ 47 http://www.kbardesign.com/141266/1404601/motion-design/sportvision-baseball-presentation 48 http://www.hardballtimes.com/main/blog_article/ichiro-strike-three/ 49 “Scorecasting: The Hidden Influences Behind How Sports Are Played and Games Are Won” by Tobias Moskowitz, ISBN 978-0-307-59180-7, P.14. 50 http://www.brooksbaseball.net/pfxVB/zoneTrack.php?&game=gid_2012_06_18_balmlb_nynmlb_1/&innings=yyyyyyyyy& month=06&day=18&year=2012 51 http://rayscoloredglasses.com/2013/05/09/velocity-concerns-for-rays-david-price-extend-beyond-his-fastball/ 52 http://pitchfx.texasleaguers.com/pitcher/285079/?batters=A&count=AA&pitches=AA&from=6%2F18%2F2012&to=6%2F18 %2F2012

2014 EMC Proven Professional Knowledge Sharing 51

53 http://pitchfx.texasleaguers.com/pitcher/285079/?batters=A&count=AA&pitches=AA&from=6/18/2012&to=6/18/2012 54 http://pitchfx.texasleaguers.com/readingcharts.php 55 http://www.fangraphs.com/pitchfx.aspx?playerid=8700&position=P 56 http://sportsillustrated.cnn.com/2011/writers/tom_verducci/04/12/fastballs.trackman/index.html 57 http://baseballanalysts.com/archives/fx_visualizatio_1/ 58 An Introduction to TrackMan Baseball - http://baseballanalysts.com/archives/fx_visualizations/002587-print.html 59 http://www.fangraphs.com/pitchfxg.aspx?playerid=8700&position=P&season=2011&date=0&dh=0 60 http://wapc.mlb.com/play/?content_id=26750779&topic_id=&c_id=mlb&tcid=vpp_copy_26750779&v=3 61 http://www.kbardesign.com/141266/1400894/motion-design/pitchfx-data-visualization 62 http://msn.foxsports.com/mlb/player/miguel-cabrera/hotzone/140955?q=miguel-cabrera 63 http://pitchfx.texasleaguers.com/batter/408234/?pitchers=A&count=AA&pitches=AA&from=1/1/2012&to=12/31/2012 64 http://www.sportvision.com/news/research-confirms-hard-throwers-advantage 65 http://wapc.mlb.com/play/?c_id=mlb&content_id=20171503&topic_id=7417714&v=3 66 http://www.sportvision.com/news/baseballs-strike-zone-illusionist 67 http://www.sportvision.com/news/baseballs-strike-zone-illusionist 68 http://phys.csuchico.edu/baseball/talks/AAPT(Jul-2013)/slides.pdf 69 http://baseball.physics.illinois.edu/TrackingTechnologiesBaseball.pdf 70 http://www.sportvision.com/baseball/hitfx 71 http://www.sportvision.com/baseball/hitfx 72 http://sportsvideo.org/main/blog/2013/07/sportvisions-hit-speed-data-making-broadcast-debut-on-fox-sports-regional- networks/ 73 http://seamheads.com/2010/05/09/meet-the-new-park-factors-part-ii/ 74 http://www.beyondtheboxscore.com/2009/6/6/900817/graph-of-the-day-average-batted 75 http://baseballengineer.com/2010/04/13/meet-the-new-park-factors-–-part-ii/ 76 http://www.hittrackeronline.com/detail.php?id=2012_3635&type=hitter 77 pitchfx.texasleaguers.com/pitcher/461833/?batters=A&count=AA&pitches=AA&from=4/1/2012&to=9/30/2013 78 http://espn.go.com/mlb/story/_/id/6908844/information-age-changing-way-game-played 79 http://www.billjamesonline.com/best_defensive_teams_of_the_decade/ 80 http://www.baseballnation.com/2011/4/23/2128357/will-fieldf-x-go-public 81 http://www.businessinsider.com/fieldfx-sportvision-technology-baseball-tracking-2011-03 82 http://en.wikipedia.org/wiki/Rickey_Henderson 83 http://www.fangraphs.com/community/2010-pitchfx-summit-recap/ 84 http://www.alliedvisiontec.com/apac/news/news-display/article/lets-play-ballbrmachine-vision-cameras-in-professional- baseball.html 85 http://www.baseballprospectus.com/article.php?articleid=11869 86 New Technologies in Baseball, Aug 7, 2010, Rand Pendleton, Sportvision Dir Video Broadcast Development 87 http://baseballanalysts.com/archives/fx_visualizatio_1/ 88 http://www.baseballprospectus.com/article.php?articleid=11869 89 http://www.slideshare.net/timothyf/fisher-baseball-baseball-stats-overload?from_search=13 90 http://www.seanlahman.com/2013/08/baseball-in-the-age-of-big-data/ 91 http://www.baseballprospectus.com/article.php?articleid=11869 92 http://www.kbardesign.com/141266/1429848/motion-design/baseball-reaction-time-concepts 93 www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 94 http://www.intelfreepress.com/news/moneyball-brings-big-data-to-silicon-valleys-minor-league/6239 95 http://en.wikipedia.org/wiki/Pythagorean_expectation 96 http://60ft6in.com/2012/12/09/technology-for-the-fans/ 97 http://www.nytimes.com/2009/07/10/sports/baseball/10cameras.html?_r=1& 98 http://www.youtube.com/watch?v=HIJRZW3AoOw 99 http://www.fangraphs.com/library/business/revenue-sharing/ 100 http://www.forbes.com/mlb-valuations/list/ 101 http://espn.go.com/blog/statsinfo/tag/_/name/defensive-runs-saved 102 https://infocus.emc.com/william_schmarzo/using-the-big-data-strategy-document-to-win-the-world-series/ 103 http://www.cnbc.com/id/38046122 104 http://articles.latimes.com/2011/oct/21/sports/la-sp-world-series-tv-20111022 105 http://www.tauntongazette.com/news/x1565405444/World-Series-a-boost-for-local-business 106 http://sports.newsday.com/long-island/data/baseball/mlb-salaries-2013/ 107 http://www.baseball-reference.com/players/l/leecl02.shtml 108 http://www.fangraphs.com/statss.aspx?playerid=1636&position=P#battedball 109 http://www.cbssports.com/mlb/salaries 110 http://wiki.answers.com/Q/Which_countries_play_baseball 111 http://blogs.sas.com/content/cokins/2011/08/23/moneyball-turning-the-odds-on-the-casino-with-analytics/ 112 http://www.minorleagueball.com/2013/1/28/3925786/2013-baseball-farm-system-rankings 113 http://progressive-economy.org/files/2012/05/fact512.opening.day_1.pdf 114 http://www.sloansportsconference.com/?p=7809 115 http://www.smartkage.com/#/services/professional 116 https://www.prbuzz.com/sports/114405-more-good-news-for-boston-area-baseball-players-and-fans-of-all-ages.html 117 http://sabr.org/analytics/speakers/2012 118 http://www.reuters.com/article/2011/08/25/idUS162922+25-Aug-2011+MW20110825 119 http://www.yarcdata.com/Solutions/baseball-analytics.php

2014 EMC Proven Professional Knowledge Sharing 52

120 http://vincegennaro.mlblogs.com/2013/04/22/clustering-pitchers-by-similarity-part-1/ 121 http://www.yarcdata.com/Resources/video9.php 122 http://www.forums.mlb.com/n/pfx/forum.aspx?nav=messages&webtag=ml-washington&tid=38839 123 http://www.csnne.com/blog/red-sox-talk/red-sox-introduce-new-ticket-pricing-structure?p=ya5nbcs&ocid=yahoo 124 http://www.statista.com/topics/968/major-league-baseball/#chapter3 125 http://www.ocregister.com/articles/licensed-347956-products-major.html 126 http://www.askmen.com/sports/business/14_sports_business.html 127 http://washington.nationals.mlb.com/was/ticketing/ultimate_ballpark_access.jsp 128 http://emerging.uschamber.com/events/2013/bhs-big-data-and-why-it-matters 129 http://nationals.brochure-mlb.com/2013/faqs.php 130 http://newyork.yankees.mlb.com/mobile/attheballpark/index.jsp?c_id=nyy 131 http://www.nwherald.com/2013/07/12/cubs-get-signage-approval-ending-stadium-revenue-excuses/atyifb4/ 132 http://www.baseball-reference.com/teams/NYM/attend.shtml 133 http://bats.blogs.nytimes.com/2013/03/05/attendance-and-revenue-fall-at-citi-field/?_r=0 134 http://www.marketwatch.com/story/baseball-prospectus-announces-multi-year-partnership-with-mlbam-2013-09-04 135 http://www.forbes.com/sites/mikeozanian/2013/03/27/baseball-team-valuations-2013-yankees-on-top-at-2-3-billion/ 136 http://www.fastcompany.com/1822802/mlb-advanced-medias-bob-bowman-playing-digital-hardball-and-hes-winning 137 http://www.bloomberg.com/video/applying-big-data-thinking-to-sports-LZvCgdCFTxGYMtBXCVGm6g.html 138 http://sabr.org/latest/2013-sabr-analytics-conference-research-presentations - transcription of Vince Gennaro, "The Big Data Approach to Baseball Analytics" 139 http://www.forbes.com/sites/mikeozanian/2013/03/27/baseball-team-valuations-2013-yankees-on-top-at-2-3-billion/ 140 http://www.bloomberg.com/news/2013-03-28/yankees-value-surges-to-2-3-billion-on-yes-deal-forbes-says.html 141 http://sportsbusinessnews.com/content/big-business-major-league-baseball-and-television 142 http://www.nytimes.com/2013/01/29/sports/baseball/dodgers-sweet-tv-deal-will-taste-bitter-to-fans.html 143 www.bloomberg.com/news/2010-10-03/fox-says-ad-slots-for-baseball-playoff-series-world-series-are-90-sold.html 144 http://www.adweek.com/news/television/fox-sells-out-mlb-all-star-game-141581 145 http://milwaukee.brewers.mlb.com/mil/sponsorship/demographics/index.jsp 146 http://www.boston.com/sports/baseball/redsox/articles/2007/04/25/sox_have_dice_k_but_rivals_reaping_ad_dollars/ 147 http://www.forbes.com/sites/jasonbelzer/2013/02/26/automated-insights-poised-to-revolutionize-sports-media/ 148 http://www.sportce.com/red-sox-beat-tigers-5-2-advance-to-world-series-automated-insights/ 149 http://statsheet.com/sites 150 http://statsheet.com/about 151 http://www.youtube.com/watch?feature=player_embedded&v=M9oeJDkBjqA 152 http://upstart.bizjournals.com/companies/startups/2013/05/22/automated-insights-turns-data-to-stories.html?page=all 153 U.S. Patent US 20110246182A1 154 U.S. Patent US 20110246182A1 155 http://itknowledgeexchange.techtarget.com/cio/big-data-meets-automated-content-development/ 156 http://www.globalaffairs.org/forum/threads/computerised-journalism.62888/ 157 www.bloomberg.com/news/2011-03-31/baseball-is-set-for-deluge-in-data-as-monitoring-of-players-goes-hi-tech.html 158 http://wiki.answers.com/Q/How_many_colleges_are_there_in_the_US 159 http://www.7x7.com/tech-gadgets/lightspeed-venture-s-barry-eggers-how-big-data-disrupting-baseball 160 http://hadoop.apache.org/ 161 http://pivotallabs.com/patrick-mcfadin-killer-apps-using-apache-cassandra/ 162 http://resources.idgenterprise.com/original/AST-0082578_Turn_Big_Data_Inward_With_IT_Analytics.pdf 163 http://www.ft.com/cms/s/2/3f5cc88c-0b21-11e1-ae56-00144feabdc0.html#axzz1dkG8fZcd

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

2014 EMC Proven Professional Knowledge Sharing 53