DATA SCIENCE AND THE BUSINESS OF MAJOR LEAGUE BASEBALL
Matthew Horton Josh Hamilton Aaron Owen, PhD DATA SCIENCE AT MLB THERE ARE 2 DISTINCT DATA SCIENCE GROUPS, FOCUSED ON DIFFERENT ASPECTS OF THE GAME BUSINESS / FAN FOCUSED
2 WHERE DOES DATA SCIENCE FIT INTO THE ORGANIZATION AND WHO USES OUR WORK?
3 FUTURE SCHEDULE EVALUATION
CANDIDATE A CANDIDATE B
CANDIDATE SCHEDULES ARE FOR 2 YEARS INTO FUTURE
CANDIDATE C
4 FUTURE SCHEDULE EVALUATION
LINEAR REGRESSION 7 SEASONS OF GAME DATA day of week month SINGLE-GAME opponent ATTENDANCE interleague/intraleague game time MODEL previous attendance previous revenue TRAINING summer vacation dates holidays weather variables SINGLE-GAME multi-year aggregates feature interactions REVENUE …
LINEAR REGRESSION
5 FUTURE SCHEDULE EVALUATION
CANDIDATE A CANDIDATE B TRAINED MODELS
PREDICTED LEAGUE-WIDE ATTENDANCE: X PREDICTED LEAGUE-WIDE ATTENDANCE: X PREDICTED LEAGUE-WIDE REVENUE: $X PREDICTED LEAGUE-WIDE REVENUE: $X CANDIDATE C
COMMISSIONER’S SCHEDULING COMMITTEE’S CHOICE
PREDICTED LEAGUE-WIDE ATTENDANCE: X PREDICTED LEAGUE-WIDE REVENUE: $X 6 CANDIDATE C
7 8 SINGLE GAME TICKET DEMAND FORECASTING
9 SINGLE GAME TICKET DEMAND FORECASTING – MODEL
3 PREVIOUS SEASONS OF GAME DATA GRADIENT BOOSTED REGRESSOR number of days before game day of week STEP 1: month NUMBER OF opponent MODEL TRAINING interleague/intraleague TICKETS SOLD promo/no promo …
CURRENT TICKET SALES FOR ALL GAMES TRAINED MODEL STEP 2: GAME 29
PREDICTIONS TICKET SALES
DAYS BEFORE GAME
10 SINGLE GAME PROMOTION TICKET DEMAND OPTIMIZATION FORECASTING
11 PROMOTION SCHEDULE OPTIMIZATION - MODEL
3 PREVIOUS SEASONS OF GAME DATA GRADIENT BOOSTED REGRESSOR series start day/night STEP 1: day of week REVENUE PREDICTION MODEL TRAINING month BASED ON CURRENT opponent PROMOTION SCHEDULE promo type …
PROMOTION SCHEDULE TRAINED MODEL 10,000 REVENUE PREDICTIONS RANDOMIZED x10,000 STEP 2: MONTE CARLO SIMULATION
REVENUE
12 PROMOTION SCHEDULE OPTIMIZATION - RECOMMENDATION
ORIGINAL PROMOTION
SCHEDULE SIMULATIONS NO PROMOTIONS
HYPOTHETICAL REVENUE
13 PROMOTION SCHEDULE OPTIMIZATION - RECOMMENDATION
FIREWORKS ORIGINAL PROMOTION
SCHEDULE SIMULATIONS
SHIRT
OR CAP SIMULATIONS
NO SIMULATIONS PROMOTIONS
BOBBLEHEAD HYPOTHETICAL REVENUE -OR- WORST 10% OF SIMULATED BEST 10% OF SIMULATED FIGURINE
PROMOTION SCHEDULES PROMOTION SCHEDULES SIMULATIONS
14 15
TEAM AVIDITY METRIC
Strong Mets Fan
FAN SEGMENTATION
Team Fan
LIFETIME VALUE
Ticketing LTV: $500 Shop LTV: $100 MLB.tv LTV: $100
Overall LTV: $700
PLAYER AVIDITY
Jacob DeGrom Pete Alonso Mike Trout
17 TEAM AVIDITY – DEVELOPMENT
6 PREVIOUS YEARS OF FAN DATA
email opt-ins EXPLICIT SIGNALS ballpark app … ENGAGEMENT MLB.TV streams TeamAvidity fan, team = ticket scans SHARE OF … ExplicitSignals x WES + Engagement x WE + Spend x WS FAN’S SPEND shop purchases ticket purchases DATA SOURCE FEATURES
SCORE AND RANK STANDARDIZE AND SEGMENT Weak Moderate Strong
Fan ID ARI ATL BAL BOS CHC CIN CLE COL CWS DET HOU KC LAA LAD MIA MIL MIN NYM NYY OAK PHI PIT SD SEA SF STL TB TEX TOR WAS High Fav 0.00 0.00 0.12 0.16 0.00 0.00 0.05 0.00 0.06 0.04 0.00 0.04 0.04 0.00 0.00 0.00 0.05 0.03 0.19 0.07 0.05 0.00 0.00 0.04 0.00 0.00 0.09 0.09 1.00 0.02 1.00 TOR
0.00 0.07 0.00 0.00 0.94 0.06 0.06 0.09 0.05 0.00 0.00 0.00 0.00 0.00 0.10 0.16 0.00 0.05 0.00 0.00 0.05 0.13 0.00 0.00 0.03 0.06 0.00 0.00 0.00 0.00 0.94 CHC
0.00 0.04 0.15 0.86 0.02 0.00 0.00 0.00 0.05 0.04 0.03 0.03 0.04 0.00 0.04 0.00 0.00 0.00 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.19 0.06 0.12 0.00 0.86 BOS
0.34 0.43 0.22 0.35 0.31 0.29 0.27 0.40 0.18 0.14 0.29 0.21 0.35 0.31 0.28 0.31 0.28 0.81 0.31 0.25 0.31 0.19 0.33 0.16 0.24 0.29 0.10 0.21 0.74 0.42 0.81 NYM
0.06 0.21 0.33 0.72 0.11 0.12 0.12 0.06 0.12 0.17 0.03 0.19 0.17 0.09 0.15 0.06 0.13 0.34 0.81 0.19 0.14 0.06 0.03 0.07 0.07 0.11 0.21 0.00 0.28 0.21 0.81 NYY
0.00 0.00 0.09 0.13 0.02 0.00 0.05 0.00 0.00 0.04 0.01 0.04 0.07 0.00 0.04 0.00 0.06 0.03 0.81 0.06 0.00 0.01 0.00 0.00 0.00 0.02 0.03 0.00 0.18 0.06 0.81 NYY -3 -2 -1 0 1 2 3 STANDARD DEVIATION
18 TEAM AVIDITY – USE CASES
IDENTIFY AND TARGET OUT-OF-MARKET FANS Opp. HOME FIELD ADVANTAGE 97% 96% 95% 95% 94% 93% 93% 92% 92% 92% 91% 88% 86% 85% 85% 84% 80% 78% 0% 20% 40% 60% 80% 100% Fans Opponent Fans
OFTEN MOST PREDICTIVE FEATURE IN MODELS
19 FAN SEGMENTATION – MODEL
NETWORK ANALYSIS 3 PREVIOUS YEARS OF FAN DATA
MLB.TV participating teams Degree centrality, Attended Games per game Clustering coefficient
DATA SOURCE FEATURES
ALL FANS
ROOKIES TEAM FANS VETERANS
Casual or New Mostly Interested in a Single Team Interested in Many Teams Fan 20 FAN SEGMENTATION – USE CASES
ROOKIES TEAM FANS VETERANS
21 FAN LIFETIME VALUE (LTV) – MODEL
3 PREVIOUS YEARS OF FAN DATA REPURCHASE Model Business Lines Features (Gradient Boosted Classifier)
ticket spend total num tickets unused tickets ticket resell ROI Probability of … Repurchase
shop spend [0, 1] shop returns Predicted shop unique products X … LTV [0, ∞)
MLB.tv total mins watched Predicted MLB.tv subscriber type MLB.tv num cancels Potential Spend MLB.TV num year subscriber …
Potential Spend Model* (Gradient Boosted Regressor)
*only trained on fans that went on to spend again 22 FAN LIFETIME VALUE (LTV) – USE CASES
LTV SEGMENTATION
High [0, 1] X Ticketing LTV [0, ∞)
[0, 1] EACH TOTAL X Shop LTV MLB FAN LTV
[0, ∞) POTENTIAL SPEND POTENTIAL
[0, 1] X MLB.tv LTV [0, ∞) Low Low High PROBABILITY OF REPURCHASE
EVALUATING MARKETING/ADVERTISING CAMPAIGN EFFICACY
RESPONSE A/B TESTING
23 PLAYER AVIDITY – MODEL
CURRENT FAN + LATENT VARIABLE MODEL PLAYER DATA EXPECTATION-MAXIMIZATION ALGORITHM
PREDICTED OBSERVED
All-Star All-Star Votes Votes fan’s team avidity fan’s location FAN-PLAYER Shop Shop player’s MLB popularity AVIDITY Sales Sales player’s team popularity player’s performance Website Website Views Views
24 PLAYER AVIDITY – USE CASES
IMPACT OF ROSTER CHANGES CUSTOMIZED FAN CONTENT ON A CLUB’S FANBASE
25 TEAM AVIDITY METRIC TICKET PACKAGE RENEWAL LEAD SCORE
Strong Mets Fan Top 20% of fans likely to renew
FAN SEGMENTATION SEASON TICKET HOLDER RISK
Team Fan Not a season ticket holder
LIFETIME VALUE MLB.TV ENGAGEMENT CAMPAIGN
Ticketing LTV: $500 Moderately engaged Shop LTV: $100 Received email last week MLB.tv LTV: $100
Overall LTV: $700
PLAYER AVIDITY TARGETED TICKET GUIDE
Jacob DeGrom Received notification about CLE vs. Pete Alonso NYM series Mike Trout
26 OTHER FAN-BASED MODELING
Model Model Model 10 9 8 7 6 5 4 3 2 1 Fan ID Vars Prob. Decile
1 LEAD SCORING 1 2 0 1 Probability of Upgrading or Renewing 3
Model Model Risk Segment Low Moderate High Fan ID Vars Prob. SEASON TICKET Low HOLDER RISK SCORE High Mod. 0 1 Low Probability of NOT Renewing
Unengaged Moderately Highly Model Engagement Recommended Engaged Engaged Fan ID Vars Segment Game
MLB.TV Un NYM vs. ATL ENGAGEMENT High CHC vs. STL Mod. NYY vs. BOS 0 7 Predicted Number of Days with a View High SF vs. LAD
Ticket Buying Recommended Fan ID History Vars Series TARGETED MIL vs. CHC TICKET GUIDE HOU vs. TEX SEA vs. OAK MIA vs. TB
27 QUICK SUMMARY
• OVERVIEW OF DATA SCIENCE AT MLB • METRIC INTRODUCTIONS • GAME • FAN
BETTER SERVE THE 30 CLUBS & OUR MILLIONS OF FANS
28 Rate today’s session
Session page on conference website O’Reilly Events App QUESTIONS?
• OPEN POSITIONS: www.mlb.com/jobs
• TECH BLOG: https://technology.mlblogs.com/
30