Data Farming: Methods for the Present, Opportunities for the Future

Data Farming: Methods for the Present, Opportunities for the Future

DATA FARMING: METHODS FOR THE PRESENT, OPPORTUNITIES FOR THE FUTURE Susan M. Sanchez SEED Center for Data Farming Operations Research Department Naval Postgraduate School ISIM 2017 Research Workshop Durham, U.K. Department of Defense Distribution Statement: Approved for public release; distribution is unlimited Data Mining vs. Data Farming 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation 2 Data Mining vs. Data Farming • Miners seek valuable buried nuggets - Miners have no control over what’s there or how hard it is to separate it out - Data Mining seeks valuable information buried within massive amounts of data • Farmers cultivate to maximize yield - Farmers manipulate the environment to their advantage: pest control, irrigation, fertilizer, etc. - Data Farming manipulates simulation models to advantage with designed experimentation One way of thinking of big data…any data set that pushes against the limits of currently available analysis technology 2 = Correlation 6 Causation “Wall Streeters have the “Harnessing vast quantities of data fastest computers, most rather than a small portion, and sophisticated software and biggest privileging more data of less exactitude, databases money can buy, and yet many opens the door to new ways of failed to see the 2008 crash coming. The understanding. It leads society to abandon hope that Big Data will make economics its time-honored preference for causality, and other social sciences truly scientific and in many instances tap the benefits 2 — that is, precise and predictive– of correlation.” remains, for now, a fantasy.”1 Correlation = 0.947 1Hogan, J., 2014. “So Far, Big Data is Small Potatoes” 2Mayer-Schonberger, V. and K. Cukier, 2013. “Big Data: A Revolution That Will Transform How We Live, Work, and Think” 3Vigen, T., 2014. “Spurious Correlations,” www.tylervigen.com” 3 = Correlation 6 Causation “Wall Streeters have the “Harnessing vast quantities of data fastest computers, most rather than a small portion, and sophisticated software and biggest privileging more data of less exactitude, databases money can buy, and yet many opens the door to new ways of failed to see the 2008 crash coming. The understanding. It leads society to abandon hope that Big Data will make economics its time-honored preference for causality, and other social sciences truly scientific and in many instances tap the benefits 2 — that is, precise and predictive– of correlation.”Simulators don’t haveremains, to for now, a fantasy.”1 choose! Correlation = 0.947 1Hogan, J., 2001. “So Far, Big Data is Small Potatoes” 2Mayer-Schonberger, V. and K. Cukier, 2013. “Big Data: A Revolution That Will Transform How We Live, Work, and Think” 3Vigen, T., 2014. “Spurious Correlations,” www.tylervigen.com” 4 Large-scale computational experiments are transformative “Petaflop machines like Roadrunner have the potential to fundamentally alter science and engineering…[allowing scientists to] perform experiments that would previously have been impractical.” The New York Times, June 9, 2008 Large-scale computational Experimentation is hard: 100 experiments are transformative “2 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Large-scale computational Experimentation is hard: 100 experiments are transformative “2 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Moore’s Law is not enough! The “curse of dimensionality” cannot be solved by hardware alone. Petaflop = 1 quadrillion ops/second Cost of “Roadrunner”= $133 million Large-scale computational Experimentation is hard: 100 experiments are transformative “2 is forever” —Maj Gen Jasper Welch “Petaflop machines like Roadrunner have the potential to fundamentally alter science and Even with today’s most powerful computers, engineering…[allowing scientists to] perform brute force exploration of 100 variables at 2 experiments that would previously have been levels for a simulation that runs in one second impractical.” would take many times the age of the universe… The New York Times, June 9, 2008 so we need to be smart! Moore’s Law is not enough! Data farming helps overcome the curse of dimensionality… The “curse of dimensionality” cannot be solved by hardware alone. With large-scale efficient experimental designs, we generate “better big data” and regularly study hundreds of factors for longer-running simulations in hours, days, or weeks on high-performance computing clusters… Petaflop = 1 quadrillion ops/second Cost of “Roadrunner”= $133 million Simulation is different Response Surface Complexity Simulation Efficient R5 FF Experiments and CCD Number of Factors of Number Physical Experiments 6 Simulation is different What is? What if? What matters? What could be? What should be? How might we get there? 7 Simulation is different What is? What if? What matters? What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? What matters? What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? How might we get there? “think big!” — factors, features, flexibility 7 Simulation is different What is? Analyst may be more “expensive” than the data What if? • Don’t focus on keeping the #design points or #reps small, if you can make runs in a What matters? reasonable time Make few assumptions What could be? • Retain flexibility in the design and analysis, “explore” the output to gain understanding: we advocate space-filling designs You can always double check! What should be? • Confirmation runs of your simulation can be made to see how well your metamodels perform at previously untested design points How might we get there? “think big!” — factors, features, flexibility 7 Simulation

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    44 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us