Data Farming: Methods for the Present, Opportunities for the Future

DATA FARMING: METHODS FOR THE PRESENT, OPPORTUNITIES FOR THE FUTURE Susan M. Sanchez SEED Center for Data Farming Operations Research Department Naval Postgraduate School ISIM 2017 Research Workshop Durham, U.K. Department of Defense Distribution Statement: Approved for public release; distribution is unlimited Data Mining vs. Data Farming 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation One way of thinking of big data…any data set that pushes against the limits of currently available analysis technology 2 = Correlation 6 Causation “Wall Streeters have the “Harnessing vast quantities of data fastest computers, most rather than a small portion, and sophisticated software and biggest privileging more data of less exactitude, databases money can buy, and yet many opens the door to new ways of failed to see the 2008 crash coming. The understanding. It leads society to abandon hope that Big Data will make economics its time-honored preference for causality, and other social sciences truly scientific and in many instances tap the benefits 2 — that is, precise and predictive– of correlation.” remains, for now, a fantasy.”1 Correlation = 0.947 1Hogan, J., 2014. “So Far, Big Data is Small Potatoes” 2Mayer-Schonberger, V. and K. Cukier, 2013. “Big Data: A Revolution That Will Transform How We Live, Work, and Think” 3Vigen, T., 2014. “Spurious Correlations,” www.tylervigen.com” 3 = Correlation 6 Causation “Wall Streeters have the “Harnessing vast quantities of data fastest computers, most rather than a small portion, and sophisticated software and biggest privileging more data of less exactitude, databases money can buy, and yet many opens the door to new ways of failed to see the 2008 crash coming. The understanding. It leads society to abandon hope that Big Data will make economics its time-honored preference for causality, and other social sciences truly scientific and in many instances tap the benefits 2 — that is, precise and predictive– of correlation.”Simulators don’t haveremains, to for now, a fantasy.”1 choose! Correlation = 0.947 1Hogan, J., 2001. “So Far, Big Data is Small Potatoes” 2Mayer-Schonberger, V. and K. Cukier, 2013. “Big Data: A Revolution That Will Transform How We Live, Work, and Think” 3Vigen, T., 2014. “Spurious Correlations,” www.tylervigen.com” 4 Large-scale computational experiments are transformative “Petaflop machines like Roadrunner have the potential to fundamentally alter science and engineering…[allowing scientists to] perform experiments that would previously have been impractical.” The New York Times, June 9, 2008 Large-scale computational Experimentation is hard: 100 experiments are transformative “2 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Large-scale computational Experimentation is hard: 100 experiments are transformative “2 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Moore’s Law is not enough! The “curse of dimensionality” cannot be solved by hardware alone. Petaflop = 1 quadrillion ops/second Cost of “Roadrunner”= $133 million Large-scale computational Experimentation is hard: 100 experiments are transformative “2 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Moore’s Law is not enough! Data farming helps overcome the curse of dimensionality… The “curse of dimensionality” cannot be solved by hardware alone. With large-scale efficient experimental designs, we generate “better big data” and regularly study hundreds of factors for longer-running simulations in hours, days, or weeks on high-performance computing clusters… Petaflop = 1 quadrillion ops/second Cost of “Roadrunner”= $133 million Simulation is different Response Surface Complexity Simulation Efficient R5 FF Experiments and CCD Number of Factors of Number Physical Experiments 6 Simulation is different What is? What if? What matters? What could be? What should be? How might we get there? 7 Simulation is different What is? What if? What matters? What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? What matters? What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? • Confirmation runs of your simulation can be made to see how well your metamodels perform at previously untested design points How might we get there? “think big!” — factors, features, flexibility 7 Simulation

Data Farming: Methods for the Present, Opportunities for the Future

Micro-Services for Facilitating Data Farming in Nato

Data Farming Services in Support of Military Decision Making

Data Farming: Better Data, Not Just Big Data

Big Data in Smart Farming – a Review

Types of Data & Terms Related with Data

Data Mining What Is Data Mining? Data Mining “Architecture

UI for Radiation Therapy Cohort Selection Seminar Presentation

Simulation Experiments: Better Data, Not Just Big Data

Establishing a Data Farm to Harvest Quality Information

Data Farming and Quantitative Analysis of Cyber Defense Technologies and Measures

Visualization and Interaction for Knowledge Discovery in Simulation Data

Big Data in Smart Farming – a Review