DATA FARMING: METHODS FOR THE PRESENT, OPPORTUNITIES FOR THE FUTURE
Susan M. Sanchez
SEED Center for Data Farming Operations Research Department Naval Postgraduate School
ISIM 2017 Research Workshop Durham, U.K.
Department of Defense Distribution Statement: Approved for public release; distribution is unlimited Data Mining vs. Data Farming
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc.
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation
2 Data Mining vs. Data Farming
• Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation
One way of thinking of big data…any data set that pushes against the limits of currently available analysis technology
2 = Correlation 6 Causation
“Wall Streeters have the “Harnessing vast quantities of data fastest computers, most rather than a small portion, and sophisticated software and biggest privileging more data of less exactitude, databases money can buy, and yet many opens the door to new ways of failed to see the 2008 crash coming. The understanding. It leads society to abandon hope that Big Data will make economics its time-honored preference for causality, and other social sciences truly scientific and in many instances tap the benefits 2 — that is, precise and predictive– of correlation.” remains, for now, a fantasy.”1
Correlation = 0.947
1Hogan, J., 2014. “So Far, Big Data is Small Potatoes” 2Mayer-Schonberger, V. and K. Cukier, 2013. “Big Data: A Revolution That Will Transform How We Live, Work, and Think” 3Vigen, T., 2014. “Spurious Correlations,” www.tylervigen.com”
3 = Correlation 6 Causation
“Wall Streeters have the “Harnessing vast quantities of data fastest computers, most rather than a small portion, and sophisticated software and biggest privileging more data of less exactitude, databases money can buy, and yet many opens the door to new ways of failed to see the 2008 crash coming. The understanding. It leads society to abandon hope that Big Data will make economics its time-honored preference for causality, and other social sciences truly scientific and in many instances tap the benefits 2 — that is, precise and predictive– of correlation.”Simulators don’t haveremains, to for now, a fantasy.”1 choose!
Correlation = 0.947
1Hogan, J., 2001. “So Far, Big Data is Small Potatoes” 2Mayer-Schonberger, V. and K. Cukier, 2013. “Big Data: A Revolution That Will Transform How We Live, Work, and Think” 3Vigen, T., 2014. “Spurious Correlations,” www.tylervigen.com”
4
Large-scale computational experiments are transformative
“Petaflop machines like Roadrunner have the potential to fundamentally alter science and engineering…[allowing scientists to] perform experiments that would previously have been impractical.” The New York Times, June 9, 2008 Large-scale computational Experimentation is hard: experiments are transformative “2100 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Large-scale computational Experimentation is hard: experiments are transformative “2100 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart!
Moore’s Law is not enough!
The “curse of dimensionality” cannot be solved by hardware alone.
Petaflop = 1 quadrillion ops/second Cost of “Roadrunner”= $133 million Large-scale computational Experimentation is hard: experiments are transformative “2100 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart!
Moore’s Law is not enough! Data farming helps overcome the curse of dimensionality… The “curse of dimensionality” cannot be solved by hardware alone. With large-scale efficient experimental designs, we generate “better big data” and regularly study hundreds of factors for longer-running simulations in hours, days, or weeks on high-performance computing clusters…
Petaflop = 1 quadrillion ops/second Cost of “Roadrunner”= $133 million Simulation is different
Response Surface Complexity
Simulation
Efficient R5 FF Experiments and CCD
Number of Factors of Number Physical Experiments
6 Simulation is different
What is? What if? What matters?
What could be?
What should be?
How might we get there?
7 Simulation is different
What is? What if? What matters?
What could be?
What should be?
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? What matters?
What could be?
What should be?
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time
What could be?
What should be?
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be?
What should be?
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs
What should be?
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be?
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? • Confirmation runs of your simulation can be made to see how well your metamodels perform at previously untested design points
How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? • Confirmation runs of your simulation can be made to see how well your metamodels perform at previously untested design points Statistical significance is NOT practical importance How might we get there?
“think big!” — factors, features, flexibility 7 Simulation is different
What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? • Confirmation runs of your simulation can be made to see how well your metamodels perform at previously untested design points Statistical significance is NOT practical importance How might we get there? • You may throw out many metamodel terms with low p-values without sacrificing explanatory power or affecting the decision “think big!” — factors, features, flexibility 7 SEED student impact…
“If the Division Commanders want a UAV at their level and have nothing else, we ought to give it to them.” Saving money: Major Chris Nannini, USA – LTG Scott Wallace, Commanding General, U.S. Army V Corps, “Analysis Of The Assignment Scheduling cited in Sinclair (2005) Capability For Unmanned Aerial Vehicles (ASC-U) Simulation Tool.” “The UAV modeling…harvested $6 billion in savings and 6,000 to M.S. in Operations Research 10,000 billets, that’s a brigade’s worth of soldiers. Over 20 years that allowed us to avoid a cost of $20 billion.” – Michael F. Bauman (2007), Director of the United States Army New methods: LTC Tom Cioppa, USA Training and Doctrine Command Analysis Center. “Efficient Nearly Orthogonal and Space- Filling Experimental Designs for High- Dimensional Complex Models" Ph.D. in Operations Research
Helping the Fleet: LT Chad Kaiser, USN “Air Defense Against UAS Kamikaze Saturation Attack” M.S. in Operations Research
U.S. Navy TACBUL AD 09-01, “USN Surface Weapon System Capabilities and Limitations against low slow flyers (Helicopters/Small Aircraft/UAVs).” Graphs from a few examples
• STORM – U.S. Navy campaign analysis model – ~40MB of input data spread over 150 input files – A single replication takes hours to complete, yields tens or hundreds of GBs of output date (mix of database fields, large flat files) – Graphs shown come from a notional training scenario – See “Improving U.S. Navy campaign analysis with big data” by Morgan, Schramm, Smith, Lucas, McDonald, Sanchez, Sanchez, & Upton (2017), forthcoming in Interfaces
• Fleet management model (matlab) – Australia using for its naval helicopter fleet (30 year lifetime) – Exploring how results depend on different policies – Graphs shown are based on notional data – Marlow, Sanchez, & Sanchez (2015 MODSIM, 2017 forthcoming).
9 Morgan et al.: Improving U.S. Navy Campaign Analyses with Big Data 8 Interfaces, Articles in Advance, pp. 1–17 scenario to be readily customized for other entities and (Figure 3) can be used to examine a user-specified sub- conditions in other campaigns. set of the metrics. This figure shows two very strong positive correlations and one very strong negative cor- Visual Summary Tools relation. A few potential insights into this notional The new quick-look dashboard (Figure 2) informs the scenario follow. For example, the positive correlation user how often objectives are met in each instance and between the number of Blue aircraft lost and the num- is the starting point for exploring the response space. ber of Red advanced surface-to-air missile (SAM) sites It displays scores of output measures across dozens of destroyed suggests that destroying Red SAM sites, runs at a glance. Each row describes a campaign objec- an important objective, comes at a cost. Additionally, tive specified by the user. Hyperlinks allow researchers there is a negative correlation between the number to dynamically access other analytic artifacts described of amphibious ship losses and the amount of time ⇤ below. In this example, Blue carrier losses 0 is defined Blue has air supremacy. One possibility is that the as success, whereas Blue carrier losses 1 is defined longer Blue has air supremacy, the more protection as failure. Each cell contains the number of occur- the amphibious ships receive. We also observe that rences of the condition for that replication. The green the length of time Blue is able to achieve and hold or red color indicates if the threshold condition was air supremacy is positively correlated with Red car- met (green) or not met (red). rier losses. These last two correlations are consistent A similar outlier dashboard (not shown) presents with the conventional wisdom about the importance of analysts with a color-coded map that identifies runs achieving early air supremacy. in which discordant data occur for user-specified outcomes. Condition, Event, and Resource Heatmaps It is also informative to see how the key metrics What, when, and where certain events and conditions relate to each other. The correlation plot of key metrics took place is critical to understanding a simulated
A9 Figure 2. (Color online) The Quick-LookMultiple Dashboard Shows, responses in Aggregate, How Often… the User-Defined Success Metrics Are Met
ReplicationReplication CriteriaCriteria namename linkslinks toto metricmetric’s’ s MetricMetric’s’ s valuevalue numbernumber partitionpartition treetree analysisanalysisQuickLook Dashboard atat endend ofof runrun
Variable variable 171742 4229 2934 3416 1645 451 122 2233 3343 4330 303 340 4049 4950 5032 325 535 3524 2448 4827 2711 1112 126 638 3810 1013 1325 2539 3926 2644 4447 4721 218 837 3718 1831 319 941 412 24 47 720 2014 1415 1546 4623 2328 2819 1936 36 Blue_Carrier_Losses ugh 1100101111100102111113222202211213333322323222323310 0 1 0 1 1 1 1 1 0 0 1 0 2 1 1 1 1 1 3 2 2 2 2 0 2 2 1 1 2 1 3 3 3 3 3 2 2 3 2 3 2 2 2 3 2 3 3 1 Blue_SurfaceShip_Losses ugh 886 64 44 49 96 68 86 67 79 95 57 76 66 69 99 99 910 107 79 99 94 47 78 87 77 79 96 611 118 88 88 86 68 87 77 79 95 58 810 106 67 710 1010 105 58 87 711 119 910 10 Blue_Sub_Losses ugh 3354444545534344434444434555555556545344444544364455 4 4 4 4 5 4 5 5 3 4 3 4 4 4 3 4 4 4 4 4 3 4 5 5 5 5 5 5 5 5 6 5 4 5 3 4 4 4 4 4 5 4 4 3 6 4 4 5 Blue_Amphib_Losses ugh 1100000120000003101000010001000001000010000000002110 0 0 0 0 1 2 0 0 0 0 0 0 3 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 1 1 BlueAirSupremacy ugh 004 44 49 95 52 20 01 10 05 513 136 66 60 00 00 06 60 00 06 612 1210 1013 136 69 98 86 67 70 09 915 1517 174 44 40 00 06 60 00 03 35 55 55 510 106 65 50 00 06 60 0 BlueAirSuperiority ugh 181818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 18 RedAirSupremacy ugh 0000000000000000000000000000000000000000000000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RedAirSuperiority ugh 181818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 1818 18 FailFail carthageESGHasEnteredMed ugh 303029 2930 3030 3030 3030 3029 2930 3030 3029 2930 3030 3029 2930 3030 3029 2930 3029 2929 2930 3029 2930 3029 2930 3029 2930 3029 2929 2929 2930 3029 2929 2929 2930 3030 3030 3030 3030 3030 3030 3029 2929 2929 2929 2930 3030 3030 3030 3029 2930 30 carthageCBGHasEnteredMedugh 303030 3030 3030 3031 3130 3030 3031 3131 3130 3030 3030 3030 3030 3030 3030 3031 3130 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3030 3031 3131 3131 3130 3030 3031 3130 3030 3030 3030 3030 3030 3030 3031 3130 3030 3030 30 carthageESGHasArrivedOffRomeugh 6666666666666666666666106666671196686666661266696666666 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 10 6 6 6 6 6 7 11 9 6 6 8 6 6 6 6 6 6 12 6 6 6 9 6 6 6 6 6 6 expeditionaryOpsHaveBegun ugh 55555555555555555555559555555975575555551155585555555 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 9 5 5 5 5 5 5 9 7 5 5 7 5 5 5 5 5 5 11 5 5 5 8 5 5 5 5 5 5 PassPass carthageESGHasReturnedFromBeachugh 2222222222233322222222733222375224222222822362323232 2 2 2 2 2 2 2 2 3 3 3 2 2 2 2 2 2 2 2 7 3 3 2 2 2 3 7 5 2 2 4 2 2 2 2 2 2 8 2 2 3 6 2 3 2 3 2 3 gibralterMinesweepingHasStartedugh 363636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 3636 36 seabaseIsComplete ugh 373737 3737 3737 3737 3737 3737 3737 370 037 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 370 037 3737 3737 3737 37 cruisersHaveArrivedAtSeabaseugh 373737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 3737 37 seabaseNWScreenIsCompleteugh 000 00 031 310 030 3032 3232 320 00 00 00 00 00 00 00 00 00 033 330 00 033 330 00 033 3334 340 00 00 00 031 310 00 00 00 027 270 032 320 00 027 2720 200 00 032 320 034 340 00 00 0 InIn jeopardyjeopardy seabaseNEScreenIsComplete ugh 373737 3737 3737 3737 3737 3737 3737 370 037 3737 370 00 037 3737 3737 3737 370 037 3737 3737 3737 3737 3737 3737 3737 3737 3737 370 037 3737 3737 3737 3737 3737 3737 370 037 3737 3737 3737 3737 3737 3737 3737 3737 3737 370 00 037 37 isRedCarrierDead ugh 006 60 00 00 00 04 45 50 00 00 06 60 00 00 04 40 00 00 06 612 1218 1813 136 69 98 86 67 717 1717 1715 1511 110 00 00 00 06 610 100 00 00 00 00 00 06 66 60 00 03 30 0 seabaseSWScreenIsCompleteugh 000 00 031 310 030 3032 320 00 00 030 300 00 00 033 330 00 00 033 3329 2933 3333 3329 2932 320 034 3418 180 00 032 320 00 020 2027 270 027 2728 2832 3219 190 027 270 030 300 032 320 034 3433 330 028 28 isESGInPort ugh 4444444444444444444444444444444444444444444444444444 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ColorColor indicatesindicates whichwhich user-specifieduser-specified thresholdsthresholds forfor successsuccess areare met met
Notes. The figure also shows the worst or best performance against these metrics. The replication numbers are not in numeric order because a clustering algorithm groups red cells together, making the dashboard easier to read by presenting less of a checkerboard display. 21 responses indicate whether or not the naval campaign has gone well for the notional Carthage empire (“Blue” side)
10 Morgan et al.: Improving U.S. Navy Campaign Analyses with Big Data 12 Interfaces, Articles in Advance, pp. 1–17
Figure 6. (Color online) Resource Heatmap:Delving The Horizontal deeperFigure 7. (Color online) Multidimensional Scaling Axis Represents Time (i.e., Days in the Campaign) Depiction of the Separation Between Two Clusters, Where
.O OF REPS WHICH INVENTORY LEVEL the Clusters Are Determined by WITTW Key Metrics FOR BLUE !! MISSILE Replications by cluster
#ASABLANCA .AVAL 3TATION 39 26 Cluster 11 10 a 1 #ARTHAGE 3OUTH #ARRIER 3TRIKE 'ROUP 26 1,000 a 2 11 #ARTHAGE .ORTH #ARRIER 3TRIKE 'ROUP 44 SubLosses SurfaceShipLosses AmphibLosses CarrierLosses C2_RedAdvSAMSitesDead_count ACLosses C2_isRedCarrierDead_count C2_BlueAirSupremacy_count 1 12 27 12 SubLosses 47 #ARTHAGE .AVAL "ASE 0.8 25 #OUNT 27 38 25 SurfaceShipLosses 0.07 0.6 6 #ARTHAGE %XPEDITIONARY 3TRIKE 'ROUP 500 6 40 0.4 13 AmphibLosses −0.06 0.23 #ARTHAGE #,&