CHANCEVol. 31, No. 2, 2018 Using Data to Advance Science, Education, and Society

Special Issue on Defense & National Security

Including...

Informing the Warfighter— Why Statistical Methods Matter in Defense Testing

Disease Surveillance: Detecting and Tracking Outbreaks Using Statistics

09332480(2018)31(2) Job hunting? The can help!

JOBWEB, our targeted job CHECK OUT database and résumé-posting service, OUR JOB 1 www.amstat.org/ASA/Your-Career/- JobWeb.aspx RESOURCES JSM CAREER SERVICE, our onsite JSM interview service connect- ing hundreds of applicants to hun- 2 dreds of positions and recruiters from more than 70 employers, ww2.amstat.org/meetings/- jsm/2017/

AMSTAT NEWS, our monthly 3 membership magazine, featuring job opportunities in every issue CHANCE Using Data to Advance Science, Education, and Society http://chance.amstat.org

23 45 Special Issue 53 on Defense & National Security

ARTICLES COLUMNS

4 Informing the Warfighter—Why Statistical 60 The Odds of Justice Methods Matter in Defense Testing Mary W. Gray, Column Editor Laura J. Freeman and Catherine Warner Whom Shall We Kill? How Shall We Kill Them? 12 Disease Surveillance: Detecting and Tracking Outbreaks Using Statistics 62 Visual Revelations Howard Wainer, Column Editor Ronald D. Fricker, Jr. and Steven E. Rigdon Ancient Visualizations 23 Adversarial Risk Analysis Howard Wainer and Michael Friendly David Banks 30 Adaptive Testing of DoD Systems with Binary Response DEPARTMENTS Douglas M. Ray and Paul A. Roediger 3 Editor’s Letter 38 Developments in the Statistical Modeling of Military Recruiting Samuel E. Buttrey, Lyn R. Whitaker, and Jonathan K. Alt 45 Data Farming: Reaping Insights from Simulation Experiments Susan M. Sanchez 53 Updated Guidelines, Updated Curriculum: The GAISE College Report and Introductory Statistics for the Modern Student Beverly L. Wood, Megan Mocko, Michelle Everson, Nicholas J. Horton, and Paul Velleman

Abstracted/indexed in Academic OneFile, Academic Search, ASFA, CSA/Proquest, Current Abstracts, Current Index to Statistics, Gale, Google Scholar, MathEDUC, Mathematical Reviews, OCLC, Summon by Serial Solutions, TOC Premier, Zentralblatt Math.

Cover image: ThinkStock EXECUTIVE EDITOR Scott Evans Harvard School of Public Health, Boston, Massachusetts AIMS AND SCOPE [email protected] ADVISORY EDITORS CHANCE is designed for anyone who has an interest in using data to advance science, education, and society. CHANCE is a non-technical magazine highlight- Sam Behseta ing applications that demonstrate sound statistical practice. CHANCE represents a California State University, Fullerton cultural record of an evolving field, intended to entertain as well as inform. Michael Larsen St. Michael’s College, Colchester, Vermont Michael Lavine SUBSCRIPTION INFORMATION University of Massachusetts, Amherst Dalene Stangl CHANCE (ISSN: 0933-2480) is co-published quarterly in February, April, Carnegie Mellon University, Pittsburgh, Pennsylvania September, and November for a total of four issues per year by the American Hal S. Stern Statistical Association, 732 North Washington Street, Alexandria, VA 22314, University of California, Irvine USA, and Taylor & Francis Group, LLC, 530 Walnut Street, Suite 850, EDITORS Philadelphia, PA 19106, USA. Jim Albert Bowling Green State University, Ohio U.S. Postmaster: Please send address changes to CHANCE, Taylor & Francis Phil Everson Group, LLC, 530 Walnut Street, Suite 850, Philadelphia, PA 19106, USA. Swarthmore College, Pennsylvania Dean Follman ASA MEMBER SUBSCRIPTION RATES NIAID and Biostatistics Research Branch, Maryland Toshimitsu Hamasaki ASA members who wish to subscribe to CHANCE should go to ASA Members Office of Biostatistics and Data Management Only, www.amstat.org/membersonly and select the “My Account” tab and then National Cerebral and Cardiovascular Research “Add a Publication.” ASA members’ publications period will correspond with Center, Osaka, Japan their membership cycle. Jo Hardin Pomona College, Claremont, California Tom Lane SUBSCRIPTION OFFICES MathWorks, Natick, Massachusetts USA/North America: Taylor & Francis Group, LLC, 530 Walnut Street, Michael P. McDermott University of Rochester Medical Center, New York Suite 850, Philadelphia, PA 19106, USA. Telephone: 215-625-8900; Fax: 215- Mary Meyer 207-0050. UK/Europe: Taylor & Francis Customer Service, Sheepen Place, Colorado State University at Fort Collins Colchester, Essex, CO3 3LP, United Kingdom. Telephone: +44-(0)-20-7017- Kary Myers 5544; fax: +44-(0)-20-7017-5198. Los Alamos National Laboratory, New Mexico Babak Shahbaba For information and subscription rates please email [email protected] or University of California, Irvine visit www.tandfonline.com/pricing/journal/ucha. Lu Tian Stanford University, California OFFICE OF PUBLICATION COLUMN EDITORS Di Cook American Statistical Association, 732 North Washington Street, Alexandria, Iowa State University, Ames VA 22314, USA. Telephone: (703) 684-1221. Editorial Production: Megan Visiphilia Murphy, Communications Manager; Valerie Nirala, Publications Coordinator; Chris Franklin Ruth E. Thaler-Carter, Copyeditor; Melissa Gotherman, Graphic Designer. University of Georgia, Athens Taylor & Francis Group, LLC, 530 Walnut Street, Suite 850, Philadelphia, PA K-12 Education 19106, USA. Telephone: (215) 625-8900; Fax: (215) 207-0047. Andrew Gelman Columbia University, New York, New York Ethics and Statistics Copyright ©2018 American Statistical Association. All rights reserved. No part Mary Gray of this publication may be reproduced, stored, transmitted, or disseminated in American University, Washington, D.C. any form or by any means without prior written permission from the American The Odds of Justice Statistical Association. The American Statistical Association grants authoriza- Shane Jensen tion for individuals to photocopy copyrighted material for private research use Wharton School at the University of Pennsylvania, Philadelphia on the sole basis that requests for such use are referred directly to the requester’s A Statistician Reads the Sports Pages local Reproduction Rights Organization (RRO), such as the Copyright Nicole Lazar Clearance Center (www.copyright.com) in the United States or The Copyright University of Georgia, Athens Licensing Agency (www.cla.co.uk) in the United Kingdom. This authoriza- The Big Picture tion does not extend to any other kind of copying by any means, in any form, Bob Oster, University of Alabama, Birmingham, and and for any purpose other than private research use. The publisher assumes no Ed Gracely, Drexel University, Philadelphia, Pennsylvania Teaching Statistics in the Health Sciences responsibility for any statements of fact or opinion expressed in the published Christian Robert papers. The appearance of advertising in this journal does not constitute an Université Paris-Dauphine, France endorsement or approval by the publisher, the editor, or the editorial board of Book Reviews the quality or value of the product advertised or of the claims made for it by its Aleksandra Slavkovic manufacturer. Penn State University, University Park O Privacy, Where Art Thou? RESPONSIBLE FOR ADVERTISEMENTS Dalene Stangl, Carnegie Mellon University, Pittsburgh, Pennsylvania, and Mine Çetinkaya-Rundel, For advertising inquiries, please contact [email protected]. Duke University, Durham, North Carolina Taking a Chance in the Classroom Printed in the United States on acid-free paper. Howard Wainer National Board of Medical Examiners, Philadelphia, Pennsylvania Visual Revelations WEBSITE http://chance.amstat.org EDITOR’S LETTER

Scott Evans

Dear CHANCE Colleagues, hreats to national security come in many counterterrorism. Douglas Ray and Paul Roediger forms. In 2016, Russians hacked the United then discuss adaptive testing of DoD systems with States election. On September 11, 2001, a binary response. The evolution of statistical model- 19T militants associated with the extremist group ing of military recruiting is the topic of an article al-Qaeda hijacked four airplanes, killing nearly 3,000 by Samuel Buttrey, Lyn Whitaker, and Jonathan people. In the 1990s, President Clinton was convinced Alt. Susan Sanchez discusses the use of data farm- that the global spread of AIDS was reaching cata- ing, using tools and techniques for the design and strophic dimensions and formally designated HIV analysis of large simulation experiments, as applied as a threat to United States national security since it to defense problems. could threaten the stability of foreign governments, In an independent article, Beverly Wood, Megan touch off ethnic wars, and undo recent advancements Mocko, Michelle Everson, Nick Horton, and in building free-market democracies abroad. Paul Velleman, evaluate clarifications and updates Defense and national security is the theme of this to the six recommendations for teaching from the special issue of CHANCE. Six articles discuss various original, foundational GAISE College Report. They aspects of national security, and how statistics is play- consider evolutions affecting the teaching and prac- ing a key role in addressing various issues. Dr. David tice of statistics, including the rise of data science, an Banks and Dr. Alyson Wilson served as guest editors increase in the number of students studying statistics, for the special issue. increasing availability of data, and advances in sci- In the first article, Laura Freeman and Catherine ence and technologies. They discuss how the original Warner discuss implementing statistical design and recommendations can be clarified by acknowledging analyses in the evaluation of the Department of these developments. Defense (DoD) operational systems in “Informing In the Odds of Justice column, Mary Gray evalu- the Warfighter—Why Statistical Testing Methods ates the death penalty and the role of statistics is Matter in Defense Testing.” Ron Fricker and Steven playing and can play in evaluating its appropriateness. Rigdon then discuss surveillance methods applied to In Visual Revelations, Howard Wainer and Michael detecting and tracking deadly diseases, such as influ- Friendly take a historical look at visualization and the enza (swine flu or bird flu), Ebola, Zika, or SARS. profound impact that visual communication has had, David Banks discusses how adversarial risk analysis, going back to ancient civilizations. a modeling strategy that incorporates an opponent’s reasoning, can be applied to a range of problems in Scott Evans

CHANCE 3 Informing the Warfighter —Why Statistical Methods Matter in Defense Testing Laura J. Freeman and Catherine Warner

he Department of Defense enabling military users to accom- both improving system design and (DoD) acquires some of plish missions. Early in a program’s informing decisions about produc- the world’s most-complex development, testing evaluates tion readiness. systems.T These systems push the the performance of system com- Operational testing is the next limits of existing technology and ponents and sub-systems. An phase of testing. As required by span a wide range of domains. example of sub-system testing law, this phase must be conducted They provide the technology involves engine reliability for a before a system can either start that enables the U.S. military to new aircraft. Later, engineering full-rate production or be fielded conduct operations. The diversity prototypes of full systems are used to military users. Operational of system types includes fighter to evaluate system performance testing involves military opera- aircraft, submarines, stealth bomb- and reliability and ensure they can tors using the representative ers, ground vehicles, transport operate with other systems, and to production system to create realistic planes, radars, radar jammers, sat- understand cybersecurity vulnera- ellites, data management tools, and bilities. For example, a new aircraft combat scenarios. The users com- enterprise management systems, may be taken on a simple flight plete missions in operationally among many others. They can be to establish the range at which its realistic environments—they test repairable or single-use; many are onboard sensors can see a threat in the system in an environment as software-intensive. the distance. close as possible to what it would Data and statistics are key to This testing of sub-systems and experience in real combat. For assessing these systems. Each one engineering prototypes is called example, new tanks are tested in must be evaluated to determine developmental testing. The goals the desert under high temperatures whether it will be effective in of developmental testing include and in the presence of a realistic

VOL. 31.2, 2018 4 Figure 1. The diversity of defense systems (clockwise from top left): amphibious assault vehicle; SSN 774 Virginia Class submarine; MV-22 Osprey; F-22A advanced tactical fighter. Images obtained from the DOT&E Annual Report.

opposing force; conditions similar to the grocery store, work, and the The Navy performed developmen- to combat in the Middle East. beach for vacation. Can the family tal testing in clear water in the Gulf The two primary areas of eval- get to all these places reliably? Can of Mexico near Panama City and uation from an operational test Dad use the navigation system cor- performance was acceptable. are effectiveness and suitability. rectly to get to a new doctor? The Navy then selected a loca- Operational effectiveness consid- While testing a new family tion with poor water clarity for ers whether a unit equipped with car in this fashion might seem the operational test—conditions a system can accomplish missions. like an expensive endeavor, doz- similar to the Persian Gulf. As Operational suitability is the ens of examples demonstrate expected, performance was worse degree to which a system can be how the operational context of a than in clear water; more impor- satisfactorily placed in field use. test was a key element in iden- tantly, performance degraded far Operational testing is fun- tifying problems before fielding more than expected. This proved damentally different from military systems. that the pre-test predictions and developmental testing. Consider Examples include: the tactics guides were inaccurate. the example of testing a new family —The environment can degrade Without testing in operational car model. Developmental testing performance (sometimes more than conditions, these limitations would might consist of a series of tests expected). Testing an airborne not have been discovered until the that assess engine reliability, fuel mine detection system provides a system was used in combat. economy, radio system connectiv- good example. It was designed to —Numbers and configurations ity, Bluetooth connectivity, crash be employed from a helicopter to of users can degrade performance. safety, etc. Operational testing, on detect, classify, and localize shallow Problems with networked radio the other hand, evaluates whether moored mines and floating mines systems frequently arise due to the family can use this car to get on the water’s surface using a laser. both the scale of operational units

CHANCE 5 (number of users) and operational testing. Specifically, the initial the operational environment environments (terrain), which can guidance focused on using design introduces variability—such as limit line of sight between radios. of experiments (DoE) for test changes in the weather; and the Testing with operationally realistic planning (Guidance on the use of evolving context of the mission units who are conducting actual DoE in OT&E, 2010). introduces variability. missions not only helps to reveal This is a relatively new approach Another aspect of operational these problems, but also provides to designing operational tests. testing worth highlighting is the ability to measure their impact Common test design approaches the criticality of human-systems on the mission outcome. used in the past included special- interactions and their impact on Another example is an aircraft ized or singular combat scenarios, mission accomplishment. Because tactical radio system. In this case, changing one test condition at a hardware and software cannot operational scenarios with multiple time, replicating a single condi- accomplish missions alone, opera- aircraft were necessary to discover tion specified by requirements, tors are a critical component of problems, with multiple partici- conducting case studies, and avoid- military systems. Systems that are pants achieving a communication ing control over test conditions overly complex introduce failures connection that was synchronized (termed “free play”). and force the services to invest in in real time. The operational test While these historical lengthy training programs to miti- used a varied formation size. The approaches produced tests that gate problems that arise because of problem occurred stochastically were operationally realistic, they poor interface design. and as a function of the number lacked the scientific process needed To address the challenges of of aircraft linking together. Previ- to ensure that testing was both applying DoE to operational ous testing of the link with only efficient and able to characterize testing, DOT&E and IDA devel- two aircraft did not reveal the system performance in a diverse oped numerous case studies to problem; increasing the number range of conditions. For complex illustrate the benefits. These case of aircraft to operationally real- defense systems, performance studies have shown that statisti- istic configurations (four aircraft often depends on interactions cal methodologies are essential to per mission) found that the issue between independent variables. constructing defensible test pro- occurred frequently. Statisticians know that his- grams. In spite of the diversity of —The operational environment torical test strategies typically systems types, statistical tools are induces reliability failure modes are inadequate to support the universally applicable to all systems that previously were unknown. estimation of second-order or in both developmental and opera- In the case of a Miniature Air higher interactions. tional testing. Launch Decoy (MALD), stor- In 1998, the National Research Design of experiments provides ing an expendable air-launched Council reviewed test strategies the tools to span this complex decoy in operational conditions— of the time and concluded, “Cur- area with a defensible approach. It rainstorms in Guam—resulted in rent practices in defense testing is often a challenge to cover mul- water getting into the fuel blad- and evaluation do not take full tiple missions and balance them der, which ultimately resulted in advantage of the benefits avail- with limited test resources. Sta- the MALD failing to deploy. The able from the use of state-of-the- tistical power analysis and other discovery of a previously unknown art statistical methodology,” and tools of assessing test adequacy failure mode was the direct result of that “[s]tate-of-the-art methods have provided methods for ensur- putting the system into operational for experimental design should be ing that expensive tests will pro- conditions not previously captured routinely used in designing opera- vide the information needed. in laboratory testing. tional tests.” Statistical models provide the However, it was not immedi- tools to characterize outcomes Implementing ately obvious how DoE would across complex operational spaces Statistical Design apply to operational testing. A and allow the data to inform and Analysis in unique aspect of operational test- that characterization. Operational Testing ing, especially when considered Three examples show how sta- in combination with the applica- tistical methods have provided: In recent years, DoD guidance has tion of DoE, is that there often 1. Defensible rationales for directed that a statistical approach are many uncontrollable vari- test adequacy. be taken when testing and evalu- ables. Operational users—human ating DoD systems in operational beings—introduce variability; 2. Efficient test plans.

VOL. 31.2, 2018 6 3. The ability to character- As can be imagined, each of possible by comparing the M&S ize capability throughout these technologies is expensive and to live data. operational conditions. only a limited number of LRASMs Therefore, the test team Example 1: Defensible will be produced. Missiles used conducted a statistical power cal- culation comparing free-flight Rationale for Test for testing are not available for fielded use, making it important to data and M&S outcomes using Adequacy—Long-range only use the minimal number of a Fisher’s combined probability Anti-Ship Missile (LRASM) missiles in testing to obtain impor- test. Their analysis showed that A credible rationale for test ade- tant information. the planned free-flight program quacy is always important. Defense A key operational contribution provided enough information to testing is expensive. Flying aircraft, of LRASM is to be able to target validate the M&S at an acceptable firing missiles, and reserving space a specific ship out of many at sea. level of risk, but cutting the shots on test ranges all cost money. To test that capability thoroughly in half would not provide adequate More than just the material costs, requires multiple ships at sea, a information (adequate statisti- however, it also takes time and range in which it is safe to launch cal power) to detect differences coordination to execute an opera- missiles, and a launch platform for between free-flight testing and the M&S. Reducing the number of tionally realistic test. the weapon. This is no small feat shots risked mischaracterizing the A heavy logistical burden is also to coordinate. performance of the weapon in the associated with collecting every data Due to the inherent costs and operational test space by using an point. Operational tests require the limitations of testing all weapons, inadequately validated model. simultaneous coordination of test modeling and simulation (M&S) Through statistical analysis ranges, military service members often are used to fill gaps in knowl- techniques, the test team was able who operate and maintain the sys- edge from live testing. Sanchez tems, and the systems themselves. to clearly discuss the trade-space (2017) discusses different design between the two test designs Therefore, once these resources approaches for M&S in her arti- come together, it is important to and determine that the test team cle in this issue of CHANCE. It should not reduce the existing be sure that a test captures enough is also important here because information to inform decision- design because it provided a mini- the operational shots have a mally adequate test for assessing makers on whether to acquire and dual purpose: getting enough field the system, while not over- weapon performance and validat- live data to both identify prob- ing the M&S. testing. Researchers have to find lems in operations and support the sweet spot of testing enough to the validation of the modeling Example 2: Efficient Test inform the decision-maker while and simulation. Plans that Cover the not spending any more resources The Navy was considering Operational Envelope— than necessary, and must be able reducing the number of weapons F-35 Joint Strike Fighter to find that spot consistently in for testing by half. Using experi- hundreds of different systems. Defense systems are often designed mental design techniques, the test Experimental design and sta- to be used in multiple missions, tistical power calculations have team was able to clearly show that and each of those missions may proved to be essential tools for the the proposed reduction excluded cover a complex operating space. DoD test community. In many important aspects of the opera- Being able to cover the full oper- cases, resources are extremely lim- tional engagements that looked ating environment efficiently is ited; using experimental design at different target ranges and a core challenge of planning a makes it possible to maximize the aspect angles, which could affect defensible operational test. Gaug- impact of those resources. success rates. ing the right amount of testing LRASM is a long-range, pre- The reduction in shots could be involves more than simply deter- cision-guided anti-ship missile. mitigated by using M&S in those mining the number of test points; It uses multiple sensors to find a ranges and aspect angles, but doing equally important is the placement target, a data link to communicate so requires that the M&S be vali- of those points across the region with the aircraft that launched the dated for operational evaluation. of operability. missile, and an enhanced Global Specifically, the usefulness and Placement of the points is the Positioning System to detect and limitations of the M&S should most-important aspect of deter- destroy specific targets within a be characterized, and uncertainty mining whether the testing will group of numerous ships at sea. should be quantified to the extent be adequate to support the goals of

CHANCE 7 Figure 2. F-35 C in flight. Source: DOT&E Annual Report.

the analysis. The goal of collecting ever has been considered in a single For example, they blocked out the the right data in the right locations operational test. design by dividing the threat con- is to understand where systems The philosophies and methods tinuum into categories and then work and to what extent. inherent to experimental design correlated the threat coverage The F-35 is a multi-role fighter provided the necessary frame- blocks with appropriate mission aircraft being produced in three work for covering such a complex areas. They ensured coverage of variants for the Air Force, Marine operational space defensibly. The key capabilities by focusing each Corps, and Navy. Its multi-role overarching test approach for the capability assessment in the most- nature covers many diverse mis- F-35 initial operational test, which relevant mission area. sions, including air-to-surface remains to be conducted, was to Finding, tracking, and engaging attack, aerial reconnaissance, close create detailed test designs for moving ground targets are only evaluating each of the core mis- air support, offensive counter air, covered in two of the mission areas, sion areas by defining appropriate, defensive counter air, destruc- but the performance assessment of measurable response variables that tion and suppression of enemy the radar in these two mission areas correspond to operational effec- air defenses, anti-surface warfare, will enable developing inferences tiveness of each mission area. cruise missile defense, and combat across all of the mission areas. The test team divided the search and rescue. operational space—using DoE Experimental design enabled The three aircraft variants, a concepts—into factors that would the test team to adequately cover range of potential ground and air affect the response variables, such nine core mission areas, multi- threats, various weapon loads, the as type of ground threat or number ple operational capabilities, and need to test during both day and and types of air threat, and varied multiple factors within each mis- night, the movement of potential those factors to ensure coverage sion area in a combined total of targets, and information quality of where the F-35 may be used 110 trials. While a very large test, provided to the aircraft all fur- in combat. this will provide information on ther compound this complexity. The test team also used the F-35 capabilities and how those Arguably, it is the most-complex principles behind DoE to span capabilities translate into opera- mission and environment that the operational space efficiently. tional outcomes.

VOL. 31.2, 2018 8 Figure 3. A historical analysis approach: calculating average performance across each condition, or a global average.

By treating the variant of the methods provide a defensible data was to characterize the probability F-35 as a factor (independent vari- analysis approach. of maintaining track across all the able) in the test design, the testers These empirical models allow operating conditions. also were able to leverage relevant for objective conclusions based on The factors that drive the performance and mission level observed data. Parametric regres- probability that the system is data across variants, resulting in a sion methods allow maximizing able to successfully maintain reduction in required sorties when information gained from test data, track include: compared to previous test designs while non-parametric methods on legacy platforms. can provide a robust assessment • Time of day (day/night) of the data that is free from model Example 3: Statistical Analy- • Target size (small/large) sis to Characterize Capability assumptions. Bayesian methods in the Operational Envelope— provide avenues for integrating • Target speed (slow/fast) additional sources of information. Tracking a Moving Target The following example is for a Figure 3 shows a traditional It is imperative not only to cover system whose purpose is to main- historical analysis in which aver- the operational space in testing, but tain a lock on a moving target. If age proportions were calculated also to analyze the resulting data the system can maintain the track for all conditions (far right, roll-up) appropriately to understand where for the desired period of time, the and for common conditions (e.g., systems work, to what extent, and test trial is scored as a success; if all trials against large targets). The how much precision there is in the system drops the track at any selected variables and levels for the conclusions. After conduct- point, then the test trial is scored binning often are designated by ing the test, statistical analysis as a failure. The purpose of the test test team expertise.

CHANCE 9 Figure 4. A powerful analysis approach: logistic regression enables a full characterization of system performance.

In Figure 4, the predicted prob- performance in operations. How- night can be tracked equally as well abilities of maintaining a track and ever, these global averages (and as others. This also would be a focus corresponding confidence inter- sometimes medians) are all too of follow-up testing. vals are the output of a logistic frequently used in defense analy- Advanced Analysis regression analysis. Using logis- ses. It is important to highlight Methods to Meet Unique tic regression and model selec- examples like these because they T&E Challenges tion techniques lets testers distill showcase why statistical analysis the most-important results from matters. Clearly, the traditional Balancing complex systems, envi- the data. In the analysis of the analysis of taking simple averages ronments, missions, and the need tracking system, the logistic fails to identify an important per- to account for human factors with regression revealed a significant formance degradation that occurs the structured use of experimen- interaction between time of day against large, fast-moving targets tal design and statistical analysis and target size. This interaction at night. results in many interesting research results in poor performance in a The statistical analysis also can questions and advanced statisti- specific set of conditions: large, fast provide insights for system devel- cal analysis challenges. Statistical targets at night. opment and future testing of the analyses that extend beyond the It is clear from this analysis system. For example, developers straightforward regression/linear that the overall roll-up calculation might target future system changes model analysis are slowly gaining provides little insight into system to ensure that large, fast targets at traction in the DoD.

VOL. 31.2, 2018 10 Examples include: state-of-the-art statistical meth- Further Reading odologies must continue to • Bayesian analysis methods evolve to confront these new DOT&E Website with Guidance (especially in a reliability Memos: http://www.dote.osd. challenges. Seeing the impor- context) allow leveraging mil/guidance.html. information from mul- tance of statistical methods in defense testing, elements of the Johnson, R.T., Hutto, G.T., tiple phases of testing Simpson, J.R., and Mont- DoD have started investing in while ensuring the results gomery, D.C. 2012. Designed statistical research through the still reflect the operational experiments for the defense reliability. Science of Test Research Con- community. Quality Engineer- sortium, which includes the Air ing 24(1), 60–79. • Survival analysis and Force Institute of Technology, Montgomery, D.C. 2008. Design mixture distributions for Naval Postgraduate School, Ari- performance measures and Analysis of Experiments. zona State University, Virginia such as detection range and John Wiley & Sons. Tech, North Carolina State Uni- time to detect allow incor- National Research Council. 1998. versity, Florida State University, porating information from Statistics, Testing, and Defense continuous measures in University of Arkansas, and Roch- Acquisition: New Approaches and cases where traditional ester Institute of Technology. Methodological Improvements. pass/fail metrics (e.g., prob- We also have started a Washington, DC. ability of detect) would have Statistical Engineering Collabo- Sanchez, S.M. 2017. Data Farming: been the only measures pre- ration with NASA to share best Reaping Insights from Simula- viously considered. practices across organizations. tion Experiments. CHANCE, We have designed a workshop 31(2). • Generalized linear models to strengthen the community Test Science Website: http:// and mixed models allow in statistical approaches to test- testscience.org/. flexible analysis method- ing, evaluation, and modeling and ologies that truly reflect the simulation in defense and aero- character of the data. space. The workshop also seeks to However, these analysis link practitioners and statisticians About the Authors methods require more than an in a bi-directional exchange. The introductory level understanding Laura Freeman is an assistant director of the information exchange consists Operational Evaluation Division at the Institute for Defense of statistics. The DoD's testing of practitioners providing chal- Analyses. She serves as the primary statistical advisor to the professionals need the assistance lenging and interesting prob- director of Operational Test and Evaluation (DOT&E) and of the statistical community in lems and statisticians providing leads the test science task, which is dedicated to expanding making these tools more acces- the use of statistical methods in the DoD test community. training, consultation, and access Freeman has a BS in aerospace engineering, MS in statistics, sible and recruiting individuals to new ideas. and PhD in statistics, all from Virginia Tech. Her PhD research with strong statistical backgrounds Finally, we are working to make was on design and analysis of experiments for reliability data. into defense positions. statistics more accessible and spe- It is highly important for the Catherine Warner is director of the Centre for cifically targeted to the DoD test Maritime Research and Experimentation in the NATO Science statistical community to engage community through an educa- and Technology Office. Previously she served as science with defense testers to develop advisor for the Director, Operational Test and Evaluation statistical analysis methods that tional website (testscience.org). (DOT&E). She has been involved with operational testing meet the unique challenges of the We encourage statisticians to and evaluation since 1991, when she became a research consider careers in defense. There staff member at the Institute for Defense Analyses. Warner operational environment. previously worked at the Lawrence Livermore National As our systems become even are opportunities for statisticians in Laboratory. She earned both BS and MS degrees in chemistry more complex and leverage defense test agencies, working in from the University of New Mexico and San Jose State levels of autonomy, continuous all phases of testing. The challenges University, and both MA and PhD degrees in chemistry from Princeton University. and integrated testing will be are complex and we need statistical necessary. The application of thinking to help us solve them.

CHANCE 11 Disease Surveillance: Detecting and Tracking Outbreaks Using Statistics Ronald D. Fricker, Jr. and Steven E. Rigdon

hroughout history, the and only infects a small number • Severe acute respiratory human race has peri- of people. syndrome (SARS), which odically been ravaged by Medical and public health is transmitted person- disease.T One of the most-extreme advances have essentially to-person via sneezing, examples is the bubonic plague or eliminated the chances of another touch, or other close con- Black Death pandemic of the 14th plague pandemic, but advances tact, infected 8,098 people century. Caused by the Yersinia in transportation facilitate peo- worldwide in 2003, of pestis bacteria, the plague is thought ple (and animals, insects, and whom 774 died. to have originated in Central Asia agriculture) moving around the and was propagated via rat-borne world with greater efficiency than • H1N1 “swine flu” pan- fleas, where it spread through the ever, so other diseases still pose a demic, which was first Middle East and Europe, kill- significant pandemic threat to detected in the United ing—by some estimates—60% of human welfare. For example, States in April 2009 and Europe’s population. recent disease outbreaks include: from April 12, 2009– Plague pandemics repeatedly • West Nile virus, which April 10, 2010, resulted in returned through the 1800s and, is transmitted by mosqui- approximately 60 million in fact, the plague still occurs toes, arrived in the U.S. flu cases, almost 275,000 throughout the world, including in 1999; through 2015, it hospitalizations, and more the United States, although it is infected 43,937 and killed than 12,000 deaths in the now treatable with antibiotics 1,911 people. United States alone.

VOL. 31.2, 2018 12 Overall and Infectious Disease Mortality Rates Have Leveled off since the 1950s Mortality rate per 100,000 population (1900–2014)

Figure 1. Overall and infectious disease mortality in the United States from 1900 to 2014. The spike at 1918 in Figure 1 is the U.S. mortality rate during the Spanish flu outbreak, where roughly 1 person in 100 who contracted an infectious disease died, and during which the Spanish flu killed an estimated 675,000 people in the United States. Source: http://n.pr/2giRnOG.

• Ebola is a highly lethal • Most recently, the Zika Furthermore, the threat of virus, with fatality rates virus, which is transmit- bioterrorism makes timely ranging from 25%–90%. ted by Aedes Aegypti, and and effective disease It is spread through bodily likely Aedes Albopictus, surveillance as much a fluids, which must enter both species of mosquito national security priority the body through a mucous and can spread between as a public health priority. membrane (e.g., mouth, humans via sexual con- As shown in Figure 1, due nose, eyes) or an opening tact or blood transfusion, to significant improvements in in the skin. It is therefore or from pregnant mother medical and public health practice, not highly contagious, to fetus. Although rarely mortality from infectious disease in but is dangerously infec- fatal, Zika can cause micro- the United States is significantly tious, in the sense that very cephaly and other severe lower today than it was a century few virions are needed to brain defects in babies ago (although it has leveled off infect a person who has born to women who had since the 1950s) . While good news, been exposed. Zika during pregnancy. this lower overall “background”

CHANCE 13 incidence of infectious disease does • Situational awareness is the first did almost two centuries not eliminate the possibility that ability to use detailed, real- ago. A defining feature of such another disease pandemic might time health data to confirm an investigation is that the out- occur in the future. or refute, and to provide an break has already been identified. For example, the highly patho- effective response to, the This is often referred to as event- genic avian influenza (HPAI) existence of an outbreak. based surveillance. or “bird flu” has the potential In event-based surveillance, the Situational awareness is essen- to cause a large-scale pandemic investigation is a retrospective effort tial for understanding when and should it ever mutate into a focused on trying to determine where to intervene, as well as form that is highly transmis- the cause of a known outbreak. In whether the intervention is having sible between humans. However, contrast, today’s epidemiologic the desired effect. It is also focused because the virus does not (yet) surveillance can also be a prospective on monitoring an outbreak’s transmit efficiently between peo- exercise in monitoring populations magnitude, geography, rate of ple, there has been no significant for potential disease outbreaks by human outbreak. change, and life cycle. Early event routinely evaluating data for evi- Should the virus mutate, what detection is critical for catching dence of an outbreak before the could a bird flu outbreak look an outbreak as soon as possible, so existence of a confirmed case and like? To put it in perspective, the medical and public health person- perhaps even before any suspicion 1918–1919 Spanish flu, which is nel can intervene before it grows of an outbreak. the likely ancestor of the human into a pandemic. It comprises and swine flu viruses, infected an case and suspect case reporting by estimated 500 million people, or medical and public health pro- Temporal Detection: about one-third of the world’s fessionals, along with statistical SARS Outbreak in population; the total number of analysis of health-related data. It is Monterey, California deaths has been estimated at about the latter with which we are con- Syndromic surveillance is a specific 50 million and may have been as cerned here. type of prospective epidemiologic high as 100 million people. Statistical Methods surveillance that is based on indi- Disease “Early Applied to cators of diseases and outbreaks, Warning Systems” not confirmed cases, with the Epidemiologic goal of detecting outbreaks before Seeking to avoid another pan- Surveillance medical and public health person- demic, public health agencies The use of statistical analyses nel would otherwise note them. conduct disease surveillance, to help determine the origin of It is based on the notion of a more formally called epidemiologic disease goes back to the Soho syndrome, which is a set of non- surveillance, by actively gather- epidemic of 1854 in England, specific pre-diagnosis medical and ing and analyzing data related when John Snow, MD mapped other information that may indi- to human health and disease to cases of cholera to support his cate the presence of a disease. It provide early warnings of human theory that the disease was trans- actively searches for evidence of health events and rapidly charac- mitted by contaminated water. possible outbreaks, perhaps even terize disease outbreaks. Figure 2 is Snow’s now-famous before there is any suspicion that Epidemiologic surveillance has plot, showing simultaneously the an outbreak has occurred. two main objectives: to enhance number of cholera deaths by city The motivating idea is that outbreak early event detection address and the locations of each of serious diseases often first mani- (EED) and to provide outbreak the city’s water pumps. The result fest with more benign symptoms situational awareness (SA). The is a clear visual association between before they can be diagnosed for United States Centers for Disease areas of higher death rates from what they really are. For example, Control and Prevention (CDC) cholera and specific water pumps, someone who contracts smallpox defines them as: particularly the Broad Street pump. exhibits flu-like symptoms for the • Early event detection is the Epidemiologists and public first week or two afterward. Thus, ability to detect, at the ear- health professionals are often a widespread smallpox outbreak liest possible time, events called upon to determine the might first become evident as an that may signal a public cause or causes of a particular increase in “influenza-like illness” health emergency. disease outbreak, much as Snow (ILI) syndrome counts before a

VOL. 31.2, 2018 14 Figure 2. Dr. John Snow’s map of the 1854 London cholera epidemic. Pump locations are depicted by dots in middle of streets and bars are proportional to number of deaths at each address. In center is the Broad Street pump. Source: http://bit.ly/2FNOnJz.

CHANCE 15 Figure 3. Percentage of patients diagnosed with ILI from September 28, 2008 (week 40), through January 2, 2010 (week 52) for: (1) the California Sentinel Provider system, (2) Monterey hospital emergency rooms, and (3) Monterey public health clinics. The diamonds represent laboratory-confirmed hospitalized cases of 2009 H1N1 in Monterey County, where each diamond represents one person and is plotted for the week the individual first became symptom- atic due to 2009 H1N1 infection. Source: Hagen, et al., 2011.

clinician diagnoses the first small- The outbreak periods are: At the bottom of Figure 3, pox case. the diamonds are laboratory- To illustrate syndromic sur- • Seasonal flu outbreak: confirmed, hospitalized cases of veillance, Figure 3 compares ILI December 12, 2008 (week 2009 H1N1 in Monterey County. syndrome counts to diagnosed 50)–February 13, 2009 Each diamond represents one per- case counts for the seasonal flu (week 6) son and is plotted for the week the that occurred in Monterey, Cali- • First H1N1 outbreak: individual first became symptom- fornia, in early 2009 and then atic due to 2009 H1N1 infection. April 6, 2009 (week 14)– two subsequent H1N1 “swine flu” During this same period, the 5/8/2009 (week 18) outbreaks. Those three outbreaks Monterey County Health Depart- appear in gray, with superimposed • Second H1N1 outbreak: ment (MCHD) also conducted time series of the weekly percent- June 15, 2009 (week 24)– syndromic surveillance, by moni- age of patients diagnosed with November 6, 2009 (week toring “chief complaint” data from ILI from the California Senti- 44) the county’s four hospital emer- nel Provider system, Monterey gency rooms (ERs) and six public hospital emergency rooms, and where the term “outbreak period” health clinics, as an early warning Monterey public health clinics. is defined as when the syndrome system for various types of dis- Smoothing lines better show the counts were increasing from their ease outbreaks. A chief complaint underlying trends. nominal state up to some peak. is a brief written summary of the

VOL. 31.2, 2018 16 Figure 4. Algorithm signal times. A vertical line “|” denotes a signal on a particular day and the heavier black bars indicate a sequence of daily signals. Source: Modified from Hagen, et al., 2011.

reason or reasons describing why To monitor the population, one whether it is a true positive, mean- an individual went to a medical must first model the “normal” state ing an outbreak is occurring. facility. The MCHD monitored of disease incidence or chief com- Without going into all of the a number of syndromes, includ- plaints in the population, which details, there are any number of ways ing ILI, gastrointestinal, upper will fluctuate naturally, and which to model the normal state of the respiratory, lower respiratory, for the purposes of detecting out- population. Then, borrowing from and neurological. breaks is just noise. the statistical process monitoring To distill the chief complaints Second, deviations from that literature, one algorithm useful for into syndrome indicators, the normal background incidence assessing whether an outbreak is occurring is the cumulative sum chief complaint text is searched must be monitored to look for (CUSUM). It works by summing and parsed for key words. Here, outbreaks, part of which requires up observations appropriately and the ILI syndrome is defined as: setting some sort of threshold producing a signal when the sum (“fever” and “cough”) or (“fever” level above which an algorithm exceeds some threshold. and “sore throat”) or (“flu” and not will produce a signal. Key to note Figure 4 shows the resulting “shot”). Thus, if someone went to is that such a signal, like all sta- CUSUM signals superimposed a Monterey County ER and said tistical signals, could be either a on the ILI chief complaint count either that they had a fever and true or false positive (just like the time series. The circles are the cough, or a fever and sore throat, lack of a signal could be a true aggregate daily ILI counts for or had the flu (but had not gotten a or false negative), so the signals Monterey County clinics and the “flu shot”), that person is classified produced by the algorithm will black line is a locally weighted as having the ILI syndrome. require investigation to determine smoothing line to show the

CHANCE 17 THE NATIONAL NOTIFIABLE DISEASES SURVEILLANCE SYSTEM One example of event-based surveillance is the United To perform this calculation, the most recent four-week totals

States Centers for Disease Control and Prevention for each of the notifiable diseases, Ti,j,k for reportable (CDC) National Notifiable Diseases Surveillance System disease i, in week j, and year k, are compared to the mean (NNDSS), which aggregates and summarizes data on number of cases reported for the same four-week period, specific diseases that healthcare providers are required by the preceding four-week period, and the succeeding four- law to report to public health departments. The NNDSS week period for the previous five years as follows. For the can be used to illustrate the historical limits statistical method average of the 15 historical four-week periods, calculated as: for disease detection.

Public health practitioners use the historical limits method to compare the observed incidence of a particular disease from a current time period to incidence data from equivalent historical periods. For example, the NNDSS uses it to and its associated standard deviation, calculated as characterize the current incidence of various reportable infectious and non-infectious diseases, such as anthrax, cholera, plague, polio, and smallpox—each week, each state reports counts of cases for each of the reportable diseases to the CDC. The CDC then compares the counts from the past month for each disease to historical average incidence rates from equivalent periods in the past five the historical upper limit (UL) and lower limit (LL) for years. reportable disease i, in week j, and year k, are ULi,j,k = x + 2s and LL = x – 2s . (The factor 2 used in One specific application, as shown in Figure S-1, is the i,j,k i,j,k i,j,k i,j,k i,j,k CDC’s Morbidity and Mortality Weekly Report (MMWR) the upper and lower limits is chosen because most of the “Notifiable Diseases and Mortality Tables,” which plot probability from a distribution falls between the mean minus various reportable diseases and show when the counts two times the standard deviation and the mean plus two observed in the past four weeks exceed the average standard deviations. For the normal distribution, this interval observed in the past five years, plus or minus two standard accounts for about 95% of the probability. Thus, using a deviations. Figure S-1 shows that in the last week of 2015, factor of 2 makes it unlikely that a point outside the interval meningococcal disease count from the preceding four occurred by chance, reinforcing the hypothesis that a weeks was more than two standard deviations below change occurred.) The part of the count that exceeds these its historical norm (which is evident in graph because limits are shaded in gray crosshatching. of the gray crosshatching on the bar associated with meningococcal disease). Interestingly, for reasons that are not entirely clear to us, the bars plotted in Figure S-1 are of Ti,j,k / xi,j,k plotted on a log

scale, where the limits are then UL*i,j,k = 1 + 2si,j,k / xi,j,k

and LL*i,j,k = 1 – 2si,j,k / xi,j,k. Plotting the ratio does make a direct comparison between the observed total and the historical average, but unless the total exceeds the upper or lower limit, it does not provide any information about how “far away” the observed total is from the average. In our opinion, the graph would be much easier to read and interpret if it plotted each of the most-recent four-week totals

Ti,j,k in terms of the number of standard deviations they are

above or below their associated historical average xi,j,k. Figure S-1. Figure I from “Notifiable Diseases and Mortality Tables” Furthermore, a time series plot for each disease, much like for week 52 of 2015. a control chart from the statistical process monitoring Source: www.cdc.gov/mmwr/preview/mmwrhtml/mm6452md. literature, would provide additional information within which htm?s_cid=mm6452md_w. to put the observed deviations (and control limits) into a useful perspective.

VOL. 31.2, 2018 18 Figure 5. SIDS rates in North Carolina from 1979–1983. Raw rates are shown in the top graph, and the rates smoothed by a conditional autoregressive model are shown in the bottom graph.

underlying trends. A signal on present or was less-obvious in (or other geographically defined a particular day is denoted by a the diagnosed ILI time series of regions). For a rare disease, dis- vertical line “|” and the heavier Figure 3. ease counts and rates may vary black bars indicate a sequence of greatly even for regions that are daily signals. Spatio-temporal close together. It is, thus, helpful In Figure 4, we see two things. Detection: Sudden to have a method of “smoothing” First, the syndrome count trends Infant Death the disease rates across regions. match quite well with the three Syndrome and For example, disease rates in outbreak periods, with the black Lyme Disease neighboring counties for a given line clearly increasing in each of time period may be 4, 0, and 7 per the three periods, which suggests The foregoing analysis is an 100,000. It seems reasonable that that mining the chief complaint example of early event detection. the estimate of 7 per 100,000 is text for appropriate key words does However, to be useful for situa- probably too high, and the estimate provide information about disease tional awareness, the data also must of 0 per 100,000 is almost certainly outbreaks. Second, the algorithm have a geographic component. too low. Since this phenomenon routinely signals during the out- Typically, such data are collected is common when the observed break periods, as well as some times in each of several geographical counts are small, we often look to outside those periods. units—states, counties, ZIP codes, smooth the estimates. However, comparing the signals or census tracts. A choropleth map is a plot of the to the time series, we do see that the We would expect a higher cor- geographic regions, color-coded to signals correspond well to increases relation in the disease counts for indicate the value of some variable, in the daily ILI syndrome counts, regions that are close together, such as the disease rate. Darker where it could be that the syndromic than for those that are far apart. colors usually indicate higher rates surveillance is picking up addi- The problem is complicated by the of the disease. Figure 5 (top) shows tional information that was not irregular placement of the counties the raw rates of sudden infant

CHANCE 19 Figure 6. Choropleth maps for lyme disease rates in Connecticut from 1996 to 2015. Data source: http://www.ct.gov/dph/cwp/view.asp?a=3136&q=399694.

death syndrome (SIDS) for the correlation structure is similar to both uncorrelated and correlated counties in North Carolina over that of autoregressive time series heterogeneity. Since the same val- the years 1979 to 1983. These rates models, where the current outcome ues were used for the breaks (that show quite a bit of randomness, depends on the outcome of the is, the intervals that determine the but some regions with high rates most recent time periods; in other shade of blue), the bottom graph seem to appear. words, the only “neighbors” of the shows the smoothing effect of the Statisticians and epidemi- current time are the most recent CAR model. Scotland County is ologists assign two sources of time periods. the region with the highest SIDS variability. One is the county- Spatial models are a bit more rate. It is in the southern part of the by-county variability, similar complicated because the neigh- state, southeast of Fayetteville and to the random error term in a bor structure is more compli- along the South Carolina state line. linear model. The other involves cated. Such models are called Knowing the regions with the correlation between neigh- conditional autoregressive (CAR) higher rates allows investigators boring counties. (Counties are models. The bottom graph shows to search for possible causes. CAR defined to be neighbors if they the SIDS rates that have been fit models are capable of including share a common border.) This using a CAR model to account for covariates, often demographics

VOL. 31.2, 2018 20 Figure 7. Lyme disease rates for each county in Connecticut from 1996 to 2015. Data source: http://www.ct.gov/dph/cwp/view.asp?a=3136&q=399694.

characteristics of each county, in the deer tick. Lyme disease is a high from 1998 through 2002, but the model. flu-like illness, with symptoms dropped sharply in 2003. The rates The North Carolina SIDS data that include fever, muscle pain, remained relatively low except for are just a snapshot of one particular chills, and nausea. Some patients 2009, when the rates spiked in time period (1979–1983). Often, develop a rash that is shaped like a the same three eastern counties. data are collected across coun- “bull’s-eye.” Without early detec- (Figure 7 shows more detail than ties for each time period (weekly, tion and treatment, Lyme disease Figure 6, but the geographic infor- monthly, yearly). The surveillance can become chronic. Incidence mation is only in Figure 6.) problem now looks at changes rates of Lyme disease are highest in Epidemiologic surveillance across time and across the geo- the northeast. Only Maine, also focuses on detecting chronic graphic regions. For illustration, Pennsylvania, Rhode Island, and and non-communicable diseases we look at the occurrence of Lyme Vermont have incidence rates as well. For instance, the National disease in Connecticut. Con- exceeding Connecticut. Cancer Institute surveils many necticut has only eight counties; Figure 6 shows choropleth cancers, including leukemia. only Delaware, DC, Hawaii, and maps for the Connecticut counties Leukemia surveillance often Rhode Island have fewer. for each of the years 1996 through looks for “hot spots” of the disease, Three of the counties are 2015. Incidence rates are lower where the rates are much higher large, with nearly 1 million resi- now than in the early 2000s. The than expected. dents each: Fairfield, Hartford, high-incidence region seems to be The CDC maintains sur- and New Haven; the others— in the eastern part of the state, veillance for various forms of Litchfield, Middlesex, New particularly Tolland, Windham, heart disease. The National London, Tolland, and Windham— and New London counties. These Highway Transportation Safety have between 100,000 and 200,000 counties tend to be more rural and Administration (NHTSA) rou- residents each. have more deer ticks. tinely performs surveillance on Lyme disease is the result of Figure 7 shows time series plots traffic- and non-traffic-related infection by Borrelia burgdorferi, for each of the eight counties. From deaths. In most states, foodborne which is usually transmitted by this, we can see that the rates were diseases, such as Escherichia coli

CHANCE 21 (E. coli), are reportable diseases, trends in the data, as well as the Brookmeyer, R., and Stroup, D.F. and are monitored much like com- hot spots where the incidence rate 2004. Monitoring the Health of municable diseases. is particularly high. This is also true Populations: Statistical Principles for surveillance of other noncom- and Methods for Public Health Conclusions municable diseases such as cancer Surveillance. Oxford, UK: and heart disease, as well as for Oxford University Press. The two main aspects of disease other causes of morbidity or mor- Dobson, M. 2007. Disease: The surveillance are (1) early event tality, such as car accidents. Extraordinary Stories Behind detection and (2) situational Finally, surveillance need not History’s Deadliest Killers. Lon- awareness, and statistical methods involve the number or rate of cases. don, UK: Quercus. have played and continue to play Surveillance of the prevalence of an important role in both. Hagen, K.S., Fricker, Jr., R.D., deer ticks may help predict an Hanni, K., Barnes, S., and In early event detection, the outbreak of Lyme disease. Google goal is to separate the signal from Michie, K. 2011. Assessing Flu Trends monitored the terms the Early Aberration Report- the noise. We want to know when that people would search for on the number of cases or the rate is ing System’s Ability to Locally their search engine in an effort to Detect the 2009 Influenza higher than we would expect by monitor the incidence of influenza. chance. Knowing this allows inves- Pandemic. Statistics, Politics, For a while, it was able to detect and Policy 2(1). tigators to look for causes. This is outbreaks sooner than other meth- Fricker, Jr., R.D. 2013. Introduction particularly important for food- ods of surveillance. As pointed out borne diseases such as E. coli, where to Statistical Methods for Biosur- in an article in Science, though, veillance: With an Emphasis on it is crucial to find the source of the Google Flu Trends significantly infection so it can be eliminated. Syndromic Surveillance. Cam- overpredicted the actual incidence bridge University Press. In situational awareness, know- of influenza in early 2013. As a Lawson, A.B., and Kleinman, K., ing the status of an infectious result, Google Flu Trends is no eds. 2005. Spatial and Syndromic disease is important in investi- longer operating. Surveillance for Public Health. gating an outbreak. For diseases Rat populations, which are Hoboken, NJ: Wiley. such as Lyme disease, which are related to plague (Yersinia pestis) not communicable from person to outbreaks, can be monitored. Lombardo, J.S., and Buckeridge, person, it is important to identify Plague affects rats the same as D.L. 2007. Disease Surveillance: humans and is communicated by A Public Health Informatics fleas. Mary Dobson, in Disease: The Approach. London, UK: Wiley- About the Authors Extraordinary Stories Behind His- Interscience. tory’s Deadliest Killers, describes a M’ikanatha, N.M., Lynfield, R., Ronald D. Fricker, Jr. is professor and head particular type of surveillance done Van Beneden, C.A., de Valk, H., of the Virginia Tech Department of Statistics, a Fellow of centuries ago: “In India and China, eds. 2013. Infectious Disease Sur- the American Statistical Association (ASA), and an elected folk wisdom warns that when the veillance. Wiley-Blackwell. member of the International Statistical Institute. Fricker’s current research focuses on the performance of various statistical rats start dying, it is time to flee.” Rigdon, S.E., and Fricker, Jr., R.D. methods for use in disease surveillance and statistical process Disease surveillance has come a 2015. Health Surveillance in control methodologies more generally. He is the author long way since then, but the spirit Innovative Statistical Methods of Introduction to Statistical Methods for Biosurveillance, has remained the same: Know for Public Health Data, in D-G published by Cambridge University Press, and has published in Statistics in Medicine, the Journal of the Royal Statistical when diseases are coming and take Chen and J.R. Wilson, eds., Society, Environmental and Ecological Statistics, the Journal the necessary steps to minimize the Springer, 203–249. of Quality Technology, Quality Engineering, and Information damage they can do. Rigdon, S.E. and Fricker, Jr., R.D. Fusion. 2018. Monitoring the Health of Further Reading Steven E. Rigdon is professor of biostatistics at Populations by Tracking Disease Saint Louis University and Distinguished Research Professor Outbreaks and Epidemics: Saving Emeritus at Southern Illinois University Edwardsville. He is Banerjee, S., Carlin, B.P., and Gel- Humanity from the Next Plague. the author of Calculus, 8th edition and 9th edition, published fand, A.E. 2015. Hierarchical Boca Raton, FL: Chapman and by Pearson, and Statistical Methods for the Reliability of Modeling and Analysis for Spa- Hall/CRC. Repairable Systems, published by Wiley. His research interests tial Data, second edition. Boca include disease surveillance, quality surveillance and control, Rogerson, P., and Yamada, I. 2009. Raton, FL: CRC Press. recurrent events, and statistical methods in sports. He is a Statistical Detection and Surveil- Fellow of the ASA and is currently the editor of the Journal of lance of Geographic Clusters. Boca Quantitative Analysis in Sports. Raton, FL: CRC Press.

VOL. 31.2, 2018 22 Adversarial Risk Analysis David Banks

lassical assumes a player’s to maximize one’s expected in adversarial opponent is Mr. Spock—hyper-rational and situations than to seek a minimax equilibrium solu- capable of almost limitless logical calculation. tion, and argued that a subjective Bayesian analysis CClassical risk analysis assumes that the opponent is was the best way to achieve that goal. However, Nature—non-strategic and acting without an agenda. neither paper provided any guidance about how to Both theories seem obviously wrong for anticipating operationalize that process, and ARA provides that the actions of opponents in settings such as coun- missing ingredient. terterrorism, environmental regulation, auctions, or Decision analysis is controversial. Roger Myerson, even any of the standard tropes of game theory, such one of the 2007 Nobel laureates in , gave as the Prisoner’s Dilemma, Chicken, or the Hawks- the following criticism: and-Doves game. A fundamental difficulty may make the Adversarial risk analysis (ARA) offers an alterna- decision-analytic approach impossible to tive. ARA involves building a model for the opponent’s implement, however. To assess his subjective strategic reasoning, placing subjective distributions probability distribution over the other players’ over all unknown quantities, and then making the strategies, player i may feel that he should try decision that maximizes the player’s expected utility to imagine himself in their situations. When he under that model. does so, he may realize that the other players People implement ARA informally all the time. If cannot determine their optimal strategies until I play chess with a 10-year-old child, I have a model they have assessed their subjective probability for how sophisticated a player she is. I probably distributions over i’s possible strategies. Thus, assume she thinks about three moves ahead, or pos- player i may realize that he cannot predict sibly a little further if pieces are being captured, so I his opponents’ behavior until he understands don’t analyze the game as deeply or as competitively what an intelligent person would rationally as I would if I were playing an adult at a chess club. expect him to do, which is, of course, the That choice approximately maximizes my utility problem that he started with. This difficulty by balancing intellectual laziness against the desire would force i to abandon the decision analytic to win. approach and instead undertake a game-the- ARA is clearly different from risk analysis, since it oretic approach, in which he tries to solve all supposes that the opponent is intelligent and strategic. players’ decision problems simultaneously. It has important differences from game theory, too. It does not require strong and implausible assumptions John Harsanyi, a 1994 Nobel laureate in econom- that the opponents have common knowledge (e.g., ics, also rejected decision analysis. His discussion that all bidders in an auction know the distribution of Kadane and Larkey argued that decision analy- that each bidder has for each other bidder’s true value sis violated the spirit of game theory, since prob- for the item on offer). Nor does it seek an equilibrium ability assessment of opponents’ actions ought to solution—ARA supports a single player, rather than derive from an analysis of their rational behavior. He solving all participants’ decision problems jointly. implied that such probability assessments should not ARA is a specialization of decision analysis, which depend upon their actual behavior in previous games, was simultaneously proposed by Kadane and Larkey nor upon opinions about an opponent’s motives, and Howard Raiffa. Those authors held it is better knowledge, or strategic sophistication.

CHANCE 23 Bombs on Trains The most significant and novel problem is that of concept uncertainty. Concept uncertainty captures To make things concrete, and perhaps a little more the fact that the agent does not know how the ter- exciting, suppose an FBI agent receives a tip that ter- rorist is making his decision. Perhaps the terrorist rorists may be planning to place a bomb on a train. has studied game theory and thus seeks a Bayes Nash The agent has the option of putting undercover police equilibrium. Perhaps the terrorist has read Kahneman on that train in an attempt to thwart the bombing. and Tversky’s work on patterns of irrational thinking, This discussion will support the FBI agent, and the and thus is trying to exploit the FBI agent’s own ARA neatly partitions the uncertainty into three sepa- cognitive limitations. rate components of uncertainty: aleatory, epistemic, There are many other possibilities: The terrorist and concept. may act at random, or the terrorist may try to think The aleatory uncertainty is the simplest. It is the one step or two steps ahead of the agent, and so uncertainty in the outcome that is conditional on forth. Banks, Rios, and Ríos Insua discuss many the decisions made by the opponents, and it can be such possibilities. handled through standard risk analysis. (I do not mean In practice, the FBI agent will not know what kind to suggest that risk analysis is easy—it generally is of solution concept the terrorist will use in making not—but risk analysis is a mature methodology, and his decisions, but the agent can place a mixture dis- people know how to structure the problem.) In this tribution over all the possibilities, with subjectively case, the aleatory uncertainty arises after, say, the agent assessed probabilities for each mixture component, has assigned undercover officers to the train and the and use that to express his beliefs about how the ter- terrorist has decided to attempt the bombing. rorist is strategizing. Will the bomb be discovered in time? If not, will it To make all this more concrete, suppose the FBI detonate? If it does, how many people will die? These agent is considering two choices—whether to embed questions involve only outcomes, and not strategic undercover police officers on the train, or not. The decision-making. The agent could assess the relevant most-primitive solution is to apply simple risk analy- probabilities of each outcome using historical data sis, implying that the agent assumes the terrorist is on how often undercover police are successful in pre- not being strategic. In that case, the agent calculates venting attacks, how often homemade bombs fail to the expected loss from deploying police officers and detonate, and the empirical distribution of mortality the expected loss from not deploying police, and takes counts after an explosion on a train. the decision that has smallest loss. Epistemic uncertainty concerns the FBI agent’s Suppose the FBI agent thinks the informant’s knowledge about the terrorist’s probabilities, capa- tip has probability 0.1 of being correct. Also sup- bilities, and . In this example, the agent does pose the known fixed cost of embedding police is not know whether the terrorist thinks he has a good $10,000. Based on historical data, the agent thinks chance of successfully smuggling the bomb onto the that undercover police have a 90% chance of forestall- train or a poor chance, so the agent must place a ing a explosion. The agent does not know how many subjective distribution over what he believes are the people would die in a train explosion, but empirical terrorist’s probability assessments. Similarly, the agent data leads him to believe it follows a Poisson distribu- does not know the amount and kind of explosive the tion with mean 15, and the agent assigns $1, million terrorist may have, and thus must place a subjective as the value of a statistical life. Then the expected cost distribution over the terrorist’s capabilities. of deploying the policemen is: Finally, the agent does not know the terrorist’s util- IE[ loss ]  $10,000  IP[ tip correct ]IP[ police  ity function. Does the terrorist see a successful train fail ]IE[ number killed ] $1,000,000 bombing as having great value or small value? Again,  $10,000  (0.1)(0.1)(15)($1,000,000) the agent must use subjective probabilities. Of course, in some cases, it may not be too difficult which comes to $150,010. And the expected cost of to assess these subjective probabilities. Terrorists often not deploying the policemen is announce their goals, which gives insight into their  IE[ loss  IP[ tip correct ]IE[ number killed ] utility function. Their capabilities can be revealed by $1,000,000 previous attacks. It is more problematic to infer their  (0.1)(15)($1,000,000), beliefs about the probability of success in an attack, but it seems safe to assume that anyone in that line of or $1.5 million. Clearly, in this analysis, the agent work has to be something of an optimist. should deploy the police.

VOL. 31.2, 2018 24 But now suppose the agent thinks the terrorist is a which comes to $1,500.10. The expected cost of not strategic thinker. The terrorist may decide whether or deploying the policemen is not to attempt the bombing based upon his analysis IE[ loss ]  IP[ tip correct ]IE[p]IE[ number of the FBI agent’s decision-making. Stahl and Wilson  killed] $1,000,000 (1995) introduced the solution concept known as  (0.1)(0.01)(15)($1,000,000), “level-k thinking.” A level-0 thinker acts at random. A level-1 thinker assumes the opponent is a level-0 or $15,000. Again, the agent will decide to deploy thinker, and maximizes expected utility under that the police, but after a more sophisticated analysis that model (which is exactly what the FBI agent did in models the terrorist’s decision-making when deciding the previous analysis). A level-2 thinker assumes his whether to mount an attack. That modeling induces a opponent is a level-1 thinker. distribution over key components of the calculation This chain can go as high as desired, but experi- that the agent must make. mental evidence suggests that except in certain board The level-k analysis employs a kind of “I think that games, such as chess, checkers, and Go, human you think that I think that you think …” reasoning beings rarely proceed beyond level-2 thinking. (Go or that is awkward to express in language. It is forcibly Gomoku is the ancient Chinese tactical game that was reminiscent of the exchange in The Princess Bride recently in the news because Google’s computer pro- between Vizzini and the Dread Pirate Roberts: gram defeated the Korean grandmaster Lee Sedol.) Roberts: All right: Where is the poison? The For a level-2 analysis, the agent assumes that the battle of wits has begun. It ends when you terrorist is a level-1 thinker, who models the agent as a decide and we both drink, and find out who is level-0 thinker. Suppose the agent thinks the terrorist right and who is dead. thinks the agent acts at random, and has probability 0.01 of deploying police on the train. Vizzini: But it’s so simple. All I have to do is In addition, the agent must model the terrorist—he divine from what I know of you. Are you the sort of man who would put the poison into his own believes that the terrorist will attack if and only if goblet, or his enemy’s? Now, a clever man would the probability of killing 20 or more people exceeds put the poison into his own goblet, because he 0.5. The agent does not know the terorist’s beliefs would know that only a great fool would reach about the distribution of the number of deaths from for what he was given. I’m not a great fool, so I a successful attack, nor does he know what the ter- can clearly not choose the wine in front of you. rorist thinks is the chance that a smuggled bomb will But you must have known I was not a great fool; detonate. As a subjective Bayesian, though, the FBI you would have counted on it, so I can clearly agent can self-elicit these quantities. not choose the wine in front of me. Suppose the agent has done all his, and that the upshot is that the agent believes that, conditional Roberts: You’ve made your decision, then? on the tip being true, the probability of an attack on the train has the Beta (1, 99) distribution. This last Vizzini: Not remotely. Because iocane comes assumption is fairly strong, but seems realistic—the from Australia, as everyone knows. And Aus- mean probability of an attack is 0.01, and the standard tralia is entirely peopled with criminals. And deviation in that subjective belief is 0.0099. criminals are used to having people not trust With this framework, the agent’s uncertainty gen- them, as you are not trusted by me. So I can erates a probability distribution over the probability p clearly not choose the wine in front of you. of an attack. He now uses this distribution to calculate Roberts: Truly, you have a dizzying intellect. his expected loss from this level-2 analysis. With deployment, it is: Despite the apparent complexity of the reasoning, the process is actually a simple mathematical recur-   IE[ loss ] $10,000 IP[ tip correct]IE[p] IP sion, and thus easy to program once the modeling  [ police fail ]IE[ number killed] choices have been made. $1,000,000 Besides level-k thinking, the agent might use   $10,000 (0.1)(0.01)(0.1)(15) other models for the strategic calculus of the terror- ($1,000,000) ist. Perhaps the terrorist has studied game theory, and

CHANCE 25 Figure 1. Multi-Agent Influence Diagram for simultaneous Attack-Defend game.

seeks a Bayes Nash equilibrium. In that case, the agent have many desirable properties. For ARA, MAIDs must infer what common knowledge assumptions the are useful because they clearly distinguish common terrorist is making, which (after extensive calculation) knowledge from private information, and provide a leads to a distribution over the terrorist’s probability of blueprint for ARA modeling. attack. Then the agent consults his own beliefs about MAIDs use hexagons to represent utilities, rect- the probability of successful attack, the distribution angles to indicate decisions, and ovals to indicate for the number of casualties, and his loss function for probabilities. Each decision node is owned by one the value of a statistical life to make the decision that of the adversaries, called here the Defender and the maximizes expected utility. Attacker. Similarly, each adversary has a utility node. This last step is not part of the standard Bayes Nash Arrows indicate information flow through the system. equilibrium calculation, but it is necessary because, In Figure 1, the Defender, D, has decision box D with short of telepathy, the common knowledge assump- arrows to three probability nodes. The shared proba- tion is unreasonable—the agent’s beliefs are unlikely  bility node containing refers to the random outcome to square with those that the terrorist assumes that conditional on the decisions made by both opponents. they share. The probability node containing Y represents the In practice, the agent does not know which solution D probabilities that the Defender assesses about events, concept the terrorist uses. It could be level- 0, level-1, and the probability node containing Y repesents the level-2, a Bayes Nash equilibrium, or some other prin- A probabilities assessed by the Attacker, A. The arrow ciple, but the agent could place subjective probabilities over each of these, generate a mixture distribution for from D to YA indicates that the Defender’s decision the terrorist’s choice, and then maximize expected and the Attacker’s probabilities are both relevant to utility against that mixture (which sounds, and is, the calculation of the Attacker’s utility. pretty complex, but using classical game theory is also There are, of course, many more complicated arduous in any realistic application). scenarios than the simultaneous Attack-Defend This toy example of the FBI agent versus the game. One important example is the sequential terrorist is sometimes called a simultaneous Attack- Defend-Attack game. In the context of the FBI agent Defend game as discussed by Zhuang, Bier, and and terrorist example, this would arise if, instead of Alagoz. The structure of the relevant information placing undercover police officers on the train, the can be represented through a Multi-Agent Influence agent instead had the option of searching passengers Diagram (MAID), as shown in Figure 1. Invented by and luggage before people board the train. In that Koller and Milch , MAIDs were originally applied case, the game is sequential, since the agent sets up the in the context of classical game theory, where they security screening process in advance, and the terrorist

VOL. 31.2, 2018 26 Figure 2. The game tree for the sequential Defend-Attack game.

sees that he will be searched, and must then decide data and experience that would enable the agent to whether to proceed with an attack. make such assessments.) The game tree in Figure 2 provides the sequential Under these assumptions, the FBI agent’s expected structure. The FBI agent (F) moves first, deciding loss from screening is: whether to implement passenger screening. The ter- IE[ loss ]  $2,000,000  IP[ attempt ]IP[ success| rorist (T) observes this, and then decides whether attempt ]($10,000,000) to proceed with an attack. Conditional on those  2,000,100 two choices, there is the random event S, indicating whether the attack was successful. Depending upon Similarly, the expected loss from not screening is that outcome, each opponent receives some utility. IE[ loss ]  IP[ attempt ]IP[ success| attempt ] To make this example a bit more concrete, suppose ($10,000,000) the agent has a loss of $10 million if the terrorist deto-  50,000 nates a bomb on the train, and suppose the annual cost This analysis suggests that setting up security of implementing and maintaining a screening system screening is not helpful—the cost of the screening is $2 million. Also suppose that the agent believes that exceeds the expected value of the protection. the annual probability of a bombing attempt without Note that the previous analysis assumed that if screening is 1/200, but that if the terrorist knows that screening is in place, then the chance of an attempt is luggage will be searched, the probability of an attempt 1/2,000, while if screening is not used, the chance is drops to 1/2,000. 1/200. How might these numbers be assessed? One Further suppose that the probability that an could assume that they result from historical data, attempted bombing is successful when there is no but they also could be derived from an ARA of the screening is 1, while the probability that an attempt terrorist’s strategic thinking. is successful when there is screening is 1/50. (All of For example, the agent may not know what the ter- these numbers are fabricated, but in practice, there are rorist thinks is his chance of successfully smuggling a

CHANCE 27 bomb through the security screening, but based upon • convoy routing through a road network with an informant or long experience, expresses his uncer- IEDs (Wang and Banks, 2011) tainty over this probability as a Beta (4,1) distribution Moreover, the ARA perspective is a natural way to (i.e., the terrorist is an optimist, since the mean of model multiparty negotiations, say, between a com- this beta is 4/(41)). Similarly, the agent does not pany that pollutes, the EPA, and the people who live know the utility function that the terrorist holds, but in the affected community. Each party has a different suppose he believes that the train is one of the top set of utilities, different beliefs about the goals of the two targets, and that those targets are approximately other parties, and a different set of probabilities that of equal value. If there is no screening, the softness drive the risk analysis. of the train target increases the probability that it is The ability to handle multiparty conflicts is a preferred to 0.9. special advantage of ARA. Consider the case of an This line of reasoning leads the agent to decide that auction with three bidders: Abelard, Balthazar, and the expected probability of an attack with screening is  Clytemnestra. Traditional game theory requires that 1/2 4/5 = 0.4, a much larger value than used in the every pair of bidders has the same belief about the dis- previous analysis. If the agent uses these new values, he tribution of other bidder’s bid, all players know what finds that the expected loss from adopting screening is those distributions are, and all players know that these IE[ loss ]  $2,000,000  IP[ attempt ]IP[ success| distributions are common knowledge. Without those attempt ]($10 million) strong assumptions, one cannot calculate the Bayes  $2,000,000  (0.4)(1/50) Nash equilibrium that is the standard solution in auc- ($10,000,000) tion theory, but it is completely realistic to imagine  $2,080,000 that Abelard has one distribution for Clytemnestra’s bid, and Balthazar has a different distribution—and it Similarly, repeating the calculation without screen- is certainly preposterous to assume that they all know ing finds that the expected loss is what each other’s distributions are. The ARA framework allows the analyst to sup- IE[ loss ]  IP[ attempt ]IP[ success | attempt ] port one of the bidders; say, Clytemnestra. Clytem- ($10,000,000) nestra has subjective distributions for Abelard’s bid,  (0.9)(1)($10,000,000) Balthazar’s bid, what Abelard thinks is the distribution  $9,000,000 for her bid, what Balthazar thinks is the distribution Now the expected loss from not screening is much for her bid, what Abelard thinks is the distribution larger than the expected loss from screening, and the of Balthazar’s bid, what Balthazar thinks is the agent should decide to impose passenger searches. distribution of Abelard’s bid, what Abelard thinks Obviously, the two toy examples discussed in this Balthazar thinks is the distribution of her bid, and section are not intended to be at all realistic. Their what Balthazar thinks Abelard thinks is the distribu- purposes are to illustrate the kind of thinking that tion of her bid. (Yes—this gets messy. But most of the goes into the decision analysis that drives ARA, and difficulty is in the language that expresses it, since the logical nesting is actually straightforward.) to show the application to two different kinds of Given all this, ARA enables Clytemnestra to cal- games—the simultaneous and the sequential. culate the bid that will maximize her expected profit, The Big Picture given everything she believes about all the parties in the auction. This shows that ARA can solve a problem ARA applies to a much larger class of problems than that is not amenable to treatment by the existing tools counterterrorism. Examples that have been studied of game theory. include: Another important contribution of ARA is that it allows for modeling irrationality. It is well known • application of ARA to mitigate Somali piracy that people are illogical and their beliefs are often (Sevillano, Ríos Insua, and Rios, 2012) not self-consistent. People may be superstitious, are • analysis of La Relance or the Borel game, a often over-optimistic, and tend to have many kinds famous problem in the history of game theory of cognitive bias. Behavioral psychologists have done (Banks, Petralia, and Wang, 2011) much to map out these consistent patterns of cognitive confusion. Kahneman’s book Thinking Fast and Slow • the theory of auctions (Au, 2014) is a remarkable catalog of such examples.

VOL. 31.2, 2018 28 In the context of counterterrorism, it is clear that Kahneman, D. 2011. Thinking, Fast and Slow. New terrorists prefer anniversary attacks on specific dates, York, NY: Macmillan. such as September 11 and April 19. This is irrational, Koller, D., and Milch, B. 2003. Multi-Agent Influence since law enforcement knows this preference and Diagrams for Representing and Solving Games. thus national security is on higher alert, TSA inspec- Games and Economic Behavior 45: 181–221. tions are more stringent, and police departments are Myerson, R. 1991. Game Theory: Analysis of Con- staffed up. At some informal level, defense experts are flict. Cambridge, MA: Press: applying a kind of ARA to model their opponents’ 114–115. thinking, and exploiting their cognitive biases to O’Hagan, A., Buck, C., Daneshkhah, A., Eiser, J.R., improve public safety. Garthwaite, P.H., Jenkinson, D.J., Oakley, J.E., and Conclusions Rakow, T. 2006. Uncertain Judgements: Eliciting Experts’ Probabilities. Hoboken, NJ: Wiley. Game theory and risk analysis are both poor guides Raiffa, H. 1982. The Art and Science of Negotiation. to real-world conflicts. Adversarial risk analysis offers Cambridge, MA: Harvard University Press. an alternative that seems both realistic and actionable. Sevillano, J. C., Ríos Insua, D., and Rios, J. 2012. ARA is not simple, but neither are game theory or Adversarial Risk Analysis: The Somali Pirates risk analysis. Attempting to forecast the decisions of Case. Decision Analysis 9: 81–85. a strategic opponent is inherently complex. Nonethe- Stahl, D., and Wilson, P. 1995. On Players’ Models of less, there are many situations in which it is critical to Other Players: Theory and Experimental Evidence. do so, and ARA avoids obvious problems that plague Games and Economic Behavior 10: 218–254. traditional methodologies. Wang, S., and Banks, D. 2011. Network Routing for Further Reading Insurgency: An Adversarial Risk Analysis Frame- work. Naval Research Logistics 58: 595–607. Au, T. 2014. Topics in Computational Advertising. PhD Zhuang, J., Bier, V., and Alagoz, O. 2005. Model- thesis, Duke University. ing Secrecy and Deception in a Multiple- Period Banks, D., Petralia, F., and Wang, S. 2011. Adversarial Attacker-Defender Signaling Game. European Risk Analysis: Borel Games. Applied Stochastic Journal of Operations Research 203: 409–418. Models in Business and Industry 27: 72–86. Banks, D., Rios, J., and Ríos Insua, D. 2015. Adversarial Risk Analysis. Boca Raton, FL: CRC Press. About the Author Harsanyi, J. 1982. Subjective Probability and the The- David Banks is a professor in the Department of ory of Games: Comments on Kadane and Larkey’s Statistical Science at Duke University. He has been involved in Paper. Management Science 28: 120–124. research on bioterrorism, cybersecurity, and risk analysis, and Kadane, J. B., and Larkey, P. D. 1982. Subjective Prob- is a long-standing member of the ASA’s Section on Statistics in Defense and National Security. ability and the Theory of Games. Management Science 28, 113–120; reply: 124.

CHANCE 29 Adaptive Testing of DoD Systems with Binary Response Douglas M. Ray and Paul A. Roediger

he U.S. Army Armament say two different material densi- infeasible for most applications. In Research, Development, ties at two thicknesses, giving us addition, we would not have gained and Engineering Center four total combinations to evalu- much insight about each design in (ARDEC)T is the Army’s center for ate (let’s call them A-1, A-2, terms of safety margin; in other lethality, supporting the majority B-1, and B-2). How could we words, we will have some limited of armament systems for the U.S. evaluate the performance of the information about performance at military. ARDEC’s professional different designs—both relative 5 feet, but no insight about break- statisticians, data scientists, and to one another, and in terms of age probability at other stimulus statistical engineers collaborate their ability to protect the phones levels. Also, the test results may be and consult with integrated prod- at a specified drop height (say, ambiguous and not provide a clear uct engineering teams on a wide 5 feet)? In this hypothetical exam- path for decision-makers to select variety of projects and programs ple, the response data are binary, an optimal product design. spanning the product lifecycle, meaning that when we drop a For example, what if the tests providing “cradle to grave” analyt- phone, it either breaks or survives of all four designs result in zero ics support. the fall. breakages? An ideal outcome in Design of Experiments One approach often employed some respects, but we will not be (DoE) is a statistical data collec- in product verification is a “zero- able to conclude whether any one tion methodology that involves failure” reliability test, which is design is better than another. Even the systematic selection of derived from the binomial distri- if some of the four designs experi- factor-level combinations to best bution. As Jovanovic and Levy say enced a few breakages out of the support a credible empirical model. in “A Look at the Rule of Three,” 30 tests, we would be hard-pressed This mathematical model, which simplification of a special case of to establish a significant difference translates input settings into out- the zero failure reliability test is with statistical confidence. Often, put predictions can be used for the “Rule of Three” for 95% confi- all that can be learned when we screening (to identify factors of dence at some specified reliability design a test with only a pass/fail interest), comparison, character- level. For example, if the specified criterion is whether the product ization, and optimization. Some reliability we seek to demonstrate passed or failed the test. unique challenges arise when the in testing is 90% (R  0.90) with An alternative to this approach response data being collected are a 95% Lower Confidence Bound is known in reliability engineering binary, but those challenges can (LCB), then the Rule of Three as a “Probit Test,” as described by be addressed effectively using mod- approximation requires us to use Prairie. This would involve spread- ern DoE techniques. Approaching 3/(1-R)  3/0.10  30 samples ing the 30 samples (or however test activities as designed experi- dropped at 5 feet for each smart- many samples we can acquire for ments, rather than pass-fail events, phone case design. testing) across multiple levels of results in the collection of richer This will be an expensive test, the drop height stimulus, and then information for decision-making since it will result in potential analyzing the resulting test data and insights about the system or destruction of many of the test using Binary Logistic Regression process under study. samples. If our reliability require- or Probit Regression to develop a Imagine we are working for a ment were 99%, 99.9%, 99.99%, predictive model for drop height company that is developing rug- or greater (instead of 90%) at a vs. Probability of Survival (non- gedized smartphone cases for the 95% confidence level (n  300, breakage) for each of the four new “xPhone” and there are several 3,000, 30,000, respectively), then prototype design variants (see prototype design candidates—let’s the sample size quickly becomes Figure 1).

VOL. 31.2, 2018 30 Figure 1. Drop height vs. P(survive drop) model for four candidate designs.

Assuming we tested in the stimulus range that is too far above of how the product will behave, correct stimulus region and that or too far below the stimulus region but the generation of useless test we collected high-quality data, where the “true” curves occur. For data using this approach is a com- this will provide much-more- example, if we had decided to mon occurrence. useful information than the previ- spread our 30 samples across five A third approach would be an ous approach. drop heights, with six samples at adaptive sensitivity test method. However, this approach assumes each drop height centered at our The Department of Defense has that we have some prior knowledge spec of 5 feet, in 1 foot increments a rich history with adaptive sen- of a model—where the center of (3 ft, 4 ft, 5 ft, 6 ft, and 7 ft), we sitivity testing, going back to the the curve lies, and how wide it is may end up with a highly unbal- “Bruceton Up-Down” method in terms of stimulus values cor- anced data set. Based on where our developed in 1948 by Dixon and responding to the all-break and curves lie in Figure 1, we might see Mood. Adaptive sensitivity testing all-survive thresholds. If we don’t two design variants (B-1 and B-2) is similar to the Probit Test in that already know where this region with only one or two failures out of it seeks to generate a set of data lies, then it is possible to select 30 samples (occurring at 7ft), and stimulus levels that provide little to the other two (A-1 and A-2) with that supports a predictive regres- no useful information—a difficult zero failures. sion model for the probability or position to defend when reporting Faced with this situation, quantile associated with various test results. regression analysis wouldn’t be stimulus levels. Where it differs This situation can arise if the able to provide us with a useful is that it adapts to the results as analyst selects: (1) stimulus levels predictive model. The Probit Test each data point is generated and that are spread too far apart from approach is much more desir- analyzed, thereby significantly one another; (2) stimulus levels able when compared to the first reducing the risk of generating that are too close together; or (3) a if we have some prior knowledge useless or unbalanced data sets.

CHANCE 31 Modern sensitivity test meth- Logistic Regression to develop a in a steeper curve. As the stan- ods (such as Neyer’s D-optimal test tentative predictive model. dard deviation approaches zero, or Wu and Tian’s 3pod) require the Using D-optimality after the curve would approach a step use of a computer to execute real- achieving overlap means that the function, which is often just as time calculations to select stimulus algorithm’s placement of the rest important as the mean. levels for each sample based on of the points is optimized to pro- With this in mind, design A-1 previous response data. vide the greatest improvement in is the clear winner. The mean is As Kiefer describes, D- predictive precision. Usually this nearly as high as B-2, but the stan- optimality is a computational means that the points will be alter- dard deviation is smaller, leading to design generation method nately placed near the 17th and better overall performance up to commonly used in DoE as an 83rd quantiles of the curve. approximately 9 feet. This means alternative to factorial-based clas- The 3pod procedure stands that at our specification of 5 feet, sical designs. The D-optimality for three-phase optimal design we can see that A-1 has much bet- criterion seeks to maximize the of sensitivity experiments. ter reliability than B-2. Figure 2 determinant of the Fisher infor- It is similar in some ways to shows the 90% two-tailed con- mation matrix through optimal Neyer’s D-optimal test procedure, fidence intervals (or 95% lower placement of points within the mainly in the execution of phase confidence bound) for the pre- design space. However, the dis- 2, but contains a third phase, the diction model for probability tinction in adaptive sensitivity test Robbins-Monro-Joseph (RMJ) of survival, or reliability. At the approaches is that this algorithm procedure, with what Wang, Tian, 5 feet drop height specification, the is executed after each individual and Wu described as the more- model prediction point estimate is sample has been tested. recently developed skewed RMJ 0.9996, while the 95% LCB for option, which is a nonparametric Adaptive sensitivity tests reliability is 0.9831. quantile estimation method. A Using a modern adaptive test begin with an initial guess for the unique aspect of 3pod is flexibility approach translates to improved mean (where the curve will be and modularity: The three phases decisions in terms of system reli- centered, or the stimulus value of 3pod can be configured to meet ability and performance, increased corresponding to the response specific sensitivity test goals. fidelity as to the true design 50th percentile) and the stan- Using a modern adaptive sen- margins relative to product speci- dard deviation, or sometimes the sitivity test method minimizes fications, minimizes risk of tests stimulus values associated with the the risk of generating useless test which result in generation of low- upper and lower tails ( 0.05 and data, making for more defensible utility data sets, while potentially  0.95 response probabilities). testing. With 20–30 samples per reducing test quantities by an order Typically, the algorithm’s smartphone case design, we can of magnitude or more, thereby determination of the first sev- generate a rich data set, provid- saving substantial test cost, sched- eral stimulus levels in testing is ing the ability to develop relatively ule, and hardware. logic-based and seeks to identify precise models for each prototype One drawback of adaptive a stimulus region that results in design, which enables us to pre- sensitivity test methods is that a mix of responses, often referred dict the probability of smartphone currently, procedures are only to as the zone of mixed results. At survival (reliability) at any stimulus available for experiments with a this point, both Neyer and 3pod level of interest at a specified con- single factor. There might be the implement a D-optimal procedure, fidence level. potential opportunity for an exten- recalculating the D-optimal point Figure 1 shows that the means sion to the adaptive sensitivity test after every stimulus value, with of design A-1 and B-2 are the that can also treat the material each response recorded. largest in terms of drop height thickness and material density of The interesting thing about this (i.e. resistance to breakage from the smartphone case as two design test approach is that even if the greater heights). In fact, they cross factors, while the stimulus factor— initial guess for the center or range very close to their 50th percentiles, drop height—is the “noise” fac- (or both) of stimulus values is wrong, and B-2 is slightly better, based tor. This becomes a three-factor, the test will adjust and adapt based on the mean alone. However, the robust-parameter DOE problem, on the previous responses, and will slope of the curve is determined executed adaptively. usually obtain a data set with over- by the standard deviation, where As Ray, Rodiger, and Neyer lap in five to 15 samples, which the smaller standard deviation suggest, adaptive test approaches can then be analyzed using Binary (less system variability) results that can handle multiple design

VOL. 31.2, 2018 32 Figure 2. Drop height vs. reliability for a 90% two-tailed confidence interval (95% LCB).

      factors, multiple stimulus variables, intervals computed via Fisher max Vnom 4 nom and guess  or some combination of the two Matrix, GLM, or Likelihood Ratio nom. The evaluation of all tests will would be a welcome tool in the methodologies. It includes a fair depend on the estimation of two analyst’s toolkit. amount of documentation to give quantities [12]: The U.S. Army ARDEC statis- the user fast access to the proce- ticians have developed a customized dures described. 1. Maximum No Fire Volt- sensitivity testing implementation Gonogo also includes a simu- age (MNFV), with the in R called Gonogo. With it, experi- lation suite of R functions and voltage having at most a menters are equipped to perform graphics. With it, users can 0.001 probability of firing four adaptive protocols in real time: study the performance of any of (at 95% confidence), and the Neyer test; Wang, Tian, and Gonogo’s four adaptive proce- Wu’s newly revised 3pod2.0; and dures, under various conditions 2. Maximum Allowable Safe the historically important Bruce- of interest. The following is one Stimulus (MASS), with ton and Langlie tests as shown such example. the voltage having a 1 in 1 in the DoD’s MIL-STD-331D. Example: Suppose a study is million probability of firing The powerful and flexible 3pod2.0 to be conducted on the electri- (point estimate) test approach has been adapted to cal sensitivity of an initiator being dozens of armament-related sys- developed by the U.S. Army. The The proposed qualification tems over the past several years. purpose of the study is to reduce criterion will be: Max (MNFV,  By including the Neyer test, this the size and cost of the item’s com- MASS) 500 volts. comparably powerful procedure ponentry by reducing the nominal The design team would like to 2 is now readily accessible without mean (Vnom) and variance ( nom) know: having to acquire a commercial of initiation voltage. Qualification license for SenTest.™ testing of this next generation A. If Vnom can be reduced to Gonogo provides graphics initiator will be accomplished via 700 volts, how small does  and tabular outputs, including 30 shot sensitivity tests, either nom have to be to ensure a sequential plots of the data and, Neyer or 3pod, having initial start- 95% probability of qualify-     when appropriate, confidence ing values of min Vnom 4 nom, ing the new initiator?

CHANCE 33 Figure 3. Test 42983 (a randomly generated 3pod test).

Table 1—90% 2-sided Confidence Interval Estimates (Computed via the GLM Methodology)

A Confidence Interval Calculation for Test 42983 Stress (q) Probability (p)

qlo q qhi plo p phi 455.2194 249.7810 660.6579 0.000000 0.000001 0.213193 (MASS) 408.6442 541.5460 674.4479 0.000000 0.001000 0.298157 (MNFV)

B. Which test, Neyer or 3pod, Gonogo includes a handy Repetition of the process (that  is better suited for the job? function to compute confidence led to the .947 estimate for nom intervals—one bounding prob-  29.2 ) was done 11 more times, While a Gonogo simulation ability for a given stress level, and yielding estimates having an aggre- generates a single test at a time, generating many at a time will the other bounding stress for a gate mean of .9542. The entire 3pod given probability. For the 3pod case was completed after repeating require writing your own custom-  ized, and usually, brief R script. test above, it returns MNFV and this procedure for other nom. Figure 3 depicts one test we came MASS estimates in the following A Neyer case counterpart sub- across in our first simulation study format (see Table 1). sequently completed in a similar of 2,000 3pod tests of size 30 for The 2,000 MNFV and MASS manner allowed eventual assembly   nom 29.2. estimates are plotted in Figure 4. of Table 2.

VOL. 31.2, 2018 34 Figure 4. Pr [Qualifying]  1894 / 2000  0.947

Table 2—Each P is an Average Obtained from 12 Simulations (2,000 Tests of Size 30 for Each)

 P = Pr[Qualifying | Vnom = 700, nom]  nom 28.4 28.6 28.8 29.0 29.2 29.4 29.6 29.8 30.0 30.2

P3pod .9609 .9614 .9582 .9572 .9542 .9490 .9501 .9458 .9420 .9395

PNeyer .9611 .9581 .9566 .9545 .9532 .9497 .9475 .9455 .9407 .9388

This simple summary abated where the voltage vs. P[initiation] industry. One reason may be that the team's biggest concern: that predictions can then be rolled up to the applications are less obvious,  limitations imposed on nom by our system-level reliability models via and none of the techniques have analysis could far exceed the new block diagrams, ballistic testing of been adapted in DoE textbooks item’s process control capability. soldier protective armor (by or commonly available software Table 2 also reveals that the two hand-loading cartridges to varying implementations. test protocols are comparable for velocities), and ammunition pen- A few application areas beyond this application. etration testing (often referred to the DoD use variations of some Sensitivity testing has a as V50 testing). of the approaches discussed here, wide variety of applications in The need for this method of including psychoacoustics (e.g., the DoD, including testing testing in the DoD is driven by hearing tests with different vol- energetic materiel to determine the types of systems in use, and ume “beeps”), and dose-ranging impact sensitivity thresholds, the sometimes-destructive nature in pharmaceutical research. precisionguided munition fuze of the testing. However, sensitivity Recently, the U.S. Army ARDEC component reliability testing testing is still underused in private consulted with Sartorius Stedim

CHANCE 35 Biotech (an international supplier mulations; Space-Filling Designs Neyer, B.T. 1994. A D-Opti- of equipment and services to the for computational modeling and mality-Based Sensitivity Test. biopharmaceutical industry) on simulation experiments, and Technometrics 36: 61–70. implementing modern sensitivity Uncertainty Quantification (UQ) Wu, C.F.J., and Tian, Y.B. 2014. testing procedures, and they have studies (see Sacks, Welch, Mitch- Three-phase optimal design of already applied 3pod to a variety ell, and Wynn); optimal designs sensitivity experiments. Journal of products using Gonogo. for complex real-world problems, of Statistical Planning and Infer- In one example, the adap- such as test constraints or hard-to- ence 149: 1–15. change factors; and covering arrays tive process enabled their engi- Kiefer, J. 1959. Optimum Experi- for software testing (see Dalal and neers to assess the fragility of mental Designs. Journal of the Mallows), to name a few. containers protecting high-value Royal Statistical Society, Series B, DoE practitioners understand product quickly while consuming 21: 272–319. a reduced number of prototypes. that it is generally preferred to capture continuous response Wang, D., Tian, Y.B., and Wu, Other applications now being C.F.J. 2015. A Skewed Version investigated include estimating measurements rather than binary response data, whenever possible. of the Robbins-Monro-Joseph lifetimes for parts that cannot be Adaptive sensitivity testing is a Procedure for Binary Response. inspected during use, and improved powerful tool that belongs in the Statistica Sinica 25(4):1,679– methods for determining the low toolkit of any DoE practitioner in 1,689. temperature characteristics of plas- industry, as well as the DoD. When Ray, D.M., Roediger, P.A., and tic film materials. continuous response measurement Neyer, B.T. 2014. Discussion of Although classical Design of is not possible, these test methods Three-phase optimal design for Experiments has origins in agri- are the best available tools for deal- sensitivity experiments. Journal cultural applications, modern DoE ing with the challenges and risks of Statistical Planning and Infer- is a powerful and flexible family of inherent to binary response data ence 149: 20–25. statistical tools with specialized collection when faced with real- techniques that can be adapted Wang, D., Tian, Y.B., and Wu, world resource constraints. C.F.J. Comprehensive Com- to a wide variety of applications: Currently, statisticians from the Mixture DoE for chemical for- parisons of Major Design Pro- U.S. Army ARDEC and Air Force cedures for Sensitivity Testing. Institute of Technology (AFIT) Journal of Quality Technology (to are collaborating on developing a appear). universally available R package to About the Authors execute the U.S. Army ARDEC’s MIL-STD-331D, Appendix G. 2017. Statistical Methods to ® customized sensitivity testing Douglas Ray, PStat is the lead statistician at Determine the Initiation Prob- the U.S. Army ARDEC in Picatinny Arsenal, NJ. His work capability. To obtain the latest focuses on application of industrial statistics and analytics to documentation and version of ability of One-Shot Devices. armament systems spanning the engineering lifecycle; he is an Gonogo, visit https://www2.isye. Washington, DC: Department experienced Design of Experiments (DOE) practitioner, and of Defense. focuses on statistical quality control and process improvement, gatech.edu/~jeffwu. reliability data analysis, data mining analytics, and Neyer Software LLC, Sen- Uncertainty Quantification (UQ)/Probabilistic Optimization Further Reading Test,™ Version 1.0. http://bit. of computational models and simulations. Ray holds a BS in ly/2FVN1fS. applied mathematics and an MS in statistical engineering, Jovanovic, B.D., and Levy, P.S. MIL-DTL-23659 F. 2010. Ini- and is currently working toward his PhD in systems 1997. A Look at the Rule of engineering analytics at Stevens Institute of Technology. Ray is tiators, Electric, General Design an ASA Accredited Professional Statistician,™ Lean Six Sigma Three. American Statistician Specification for. Washington, Black Belt, and ASQ Certified Reliability Engineer. He is also 51(2): 137–139. DC: Department of Defense. a combat veteran of the U.S. Army. Prairie, R.R. 1967. Probit Analy- Sacks, J., Welch, W.J., Mitchell, T.J., Paul Roediger is a subject matter expert (SME) in sis as a Technique for Estimat- and Wynn, H.P. 1989. Design mathematics and statistics with UTRS, Inc. He retired from ing the Reliability of a Simple and Analysis of Computer federal service in 2012 as lead statistician at the U.S. Army System. Technometrics 9(2): Experiments. Statistical Science ARDEC, Picatinny Arsenal, NJ. He holds a BS degree in 197–203. mathematics and MS degrees in both mathematics and 4(4):409–423. Dixon, J.W., and Mood, A.M. statistics. While with UTRS, Inc. and supporting the U.S. Army Dalal, S.R., and Mallows, C.L. 1998. ARDEC, he developed a suite of R programs to conduct, 1948. A Method for Obtaining Factor-Covering Designs for analyze, and simulate two of the most-modern sensitivity and Analyzing Sensitivity Data. Testing Software. Technometrics experiments available—Jeff Wu, et. al.’s 3pod and Barry Journal of the American Statistical Neyer’s SenTest™ experiments. 40(3):234–243. Association 43: 109–126.

VOL. 31.2, 2018 36 THE ASA STORE SHOP AND MORE! RESOURCES ASA LOGOITEMS, T-SHIRTS, orders/index.cfm scriptcontent/BEWeb/ ww2.amstat.org/eseries/ Visit the ASA Storeat on your rst purchase on your rst ASASTORE at checkout. by entering by entering SAVE 10%

Developments in the Statistical Modeling of Military Recruiting Samuel E. Buttrey, Lyn R. Whitaker, and Jonathan K. Alt

ecruiting is an expensive, These gains in efficiency might faced with the need to find, and ongoing challenge for all come from a number of sources. train, his or her replacement. Third, of the U.S. military services. First, smarter allocation of recruit- the Army alone employs more than RMore than a quarter of the U.S. ers to territories might allow the 9,000 recruiters. If the Army could Department of Defense (DoD) services to use fewer recruiters. recruit the same numbers using, budget—around $150 billion per Second, close screening and ade- say, 1,000 fewer recruiters, it could year—goes to fund the pay and quate preparation of recruits could find itself with 1,000 fewer salaries benefits of active and retired mem- reduce the numbers of those who and benefit packages to support— bers of the military. Even modest are unable to complete their terms or, alternatively, the equivalent of gains in recruiting efficiency can of service; every time a military two brand-new infantry brigade translate to large dollar savings. member drops out, the services are combat teams.

VOL. 31.2, 2018 38 This article describes some a particular recruiter so he or she service. However, there must be of the research in attacking this can focus on a more limited area) other factors at work as well, which first problem—devising strategies can be expected to produce more can be examined by considering for more-efficient allocation of recruits from that area. The number the non-military options available recruiting resources. The starting of recruits from an area depends on to eligible young people: college point is to define the problem of how many young people live there, and employment. Areas that have locating young people who qualify and also on how much recruiting four-year colleges nearby will pre- for, and are interested in, the mili- effort is applied there. sumably produce fewer recruits, tary. (Simply using past numbers Finally, estimates of market as will areas where the local job of recruits in a specific area can depth can be used as inputs for market is strong. be problematic.) Determining this optimization-type models that To some extent, the services number of available people requires determine the optimal locations compete against one another—in detailed data, some of which are for recruiters, and recruiter stations, areas with a strong Army recruiting hard to come by or incomplete. under different sets of constraints presence, it might be harder to find Some of the factors that seem and assumptions. These models are recruits for the Navy. There is also to be associated with the number also constrained by real-life factors, evidence that the military presence of eligible and interested young such as driving distances and real- in an area, as represented by active people, and what data might estate leases. duty or veteran populations, has be used to help measure those a positive effect on the services’ factors, are also featured. In real Eligibility and recruiting efforts. This might be a life, military recruiting commands Propensity function of community visibility of are responsible for recruiting not the base itself, a statement about only ordinary enlistees, but also The market depth of an area veterans who settle in the area, or officers, reservists, and members of depends most importantly on some other factor. specialty fields such as doc- the number of people who live Beyond college and employ- tors, lawyers, and chaplains. This there. Most specifically, the mili- ment, other attributes of an area research focuses on the bulk of tary needs young, healthy people can be useful in statistical mod- their mission: the enlisted service who are eligible to join by satis- els of recruiting. These might be members who make up the major fying a number of requirements. demographic, medical, economic, part of the active-duty armed forces. For example, prospective recruits or educational. Consulting with both the Army’s have to be medically sound, Of course, in the end, recruiting and Navy’s recruiting enterprises have clean criminal records, per- is a decision made by one indi- leads to discussing matters in terms form adequately on the entrance vidual, not by an area, and many of of those service-members. examination, and so on. Rates of the factors influencing recruiting Having the data in hand makes eligibility differ from place to place, are personal. Many recruits join out it possible to model the number of with some places, for example, of a sense of purpose or patriotism; available personnel in a variety of having higher obesity rates than these people might be persuaded ways. Some statistical models can others. It makes sense to focus to join any of the services. Oth- be used to estimate market depth— recruiting efforts on places with ers look for specific job training that is, the number of young people high rates of eligibility. that might only be available in one both eligible for, and interested in, Separately from eligibility, service, or might not happen to be military service in a particular area. however, there is the question of available at all at the time a poten- Knowing the market depth is propensity, or the rate at which eli- tial recruit is inquiring. Young not quite enough to predict the gible young people are willing to people who lack the means to pay number of recruits, however. In join the all-volunteer force. for college might sign up to take general, recruiting requires effort Propensity varies from place advantage of educational benefits by recruiters to identify potential to place for a number of reasons. through programs like the G.I. Bill. service members, inform them Anecdotes suggest, for example, Moreover, enlistment bonuses of the benefits and costs of ser- that the Deep South is a more- are available for a number of vice, and process their paperwork. fertile region for military recruiting military specialties; for many Increasing recruiter effort in an than New England, for example. jobs, these might be $1,000 or area (by assigning more recruiters, This might partly have to do with $2,000, while for jobs that are or by reducing the area assigned to societal attitudes toward military particularly difficult, to fill these

CHANCE 39 Recruiting’s Chicken- and-Egg Problem One natural way of estimating how many recruits an area can produce is to look at the number produced by that area in the past. However, the number of recruits produced is certainly a function of recruiter effort—how many recruiters were assigned to an area, how hard those recruiters worked, how persuasive they were, and so on. The number of recruiters assigned to an area is at least partly determined by the num- ber of recruits produced by that area in the past. Military person- nel assigned as recruiters will typically meet the recruiting goal assigned to them, regardless of the amount of effort it might take due to the propensity of the young people in their area of responsi- bility. This gives rise to recruit- ing’s “chicken-and-egg” problem: An area that has seen lots of past recruits is assigned more recruiters, who in turn produce more recruits, which makes the area seem par- ticularly productive. One way of avoiding the prob- lem is by considering “leads.” Recruiters typically start with a set of leads—names of youngsters who might be persuaded to enlist—that they amass in a number of ways. Some leads are referrals from other bonuses can be in the tens of However, the services needs some recruits or young people who show thousands of dollars. While we try of these first-term service mem- interest at sporting events or other to measure market depth as a bers to re-enlist, to fill the jobs set locations where recruiters main- function of local economic and aside for more-experienced per- tain a presence. These might be demographic factors, we have to sonnel and to fill the ranks of considered “high-quality” leads, be aware that recruits are highly non-commissioned officers. since the young person in ques- variable individuals, each with his Personnel who enlist primarily tion is already interested in military or her own story. to earn a lump sum or to acquire service. Another source of high- As a final note, we focus here college money would not be quality leads is “walk-ins,” the set on recruiting brand-new service expected to re-enlist at high rates, of people who literally walk in to members for their first term of so simply increasing first-term a recruiting station and begin the active-duty service. Presumably, recruiting may not be all that the process on their own. the services could expect larger services need. Like so many targets Lower-quality leads might numbers of recruits if they were of modeling, the underlying real- include, for example, the list of to offer larger bonuses or more- world problem here is complicated seniors graduating from a nearby generous educational benefits. and multi-faceted. high school. Both of these sorts

VOL. 31.2, 2018 40 11–58 3–10 1–2

Figure 1. ZIP Code centroids with 1–2 leads (gray), 3–10 leads (black), and 11 or more leads (red) in FY 2014 (Navy).

of leads are inherently “local,” leads can be expected to occur at station. Recruiters at that sta- and at least to some extent the a higher per capita rate from areas tion are responsible for finding number of leads that a recruiter with no recruiting station nearby, prospective recruits in their set obtains—and the number of which would normally be more of Zips, selling those recruits on leads that can be converted into rural ones. service, and shepherding the actual recruits—depends on his Still, the use of national-level recruit’s application through the or her own effort and skill, again leads is one way to try to get around necessary steps. showing the chicken-and-egg recruiting’s chicken-and-egg prob- There are some 35,000 Zips problem associated with estimat- lem, and in some of our work we of interest in the USA, although ing the number of recruits. Disen- have used national-level leads as a thousands more have no actual tangling the inherent propensity surrogate for propensity. For FY residents because they represent to join of youth in an area from 2104, the ZIP codes with the top businesses, post office boxes, uni- the skill of the recruiters assigned 5% of the of leads for one of the versities, and so on. there is not an easy task. services (ZIP codes with 11 more Since each Zip is assigned to A second sort of lead is “national” leads for that fiscal year) are shown a recruiting station, it is natural rather than local. National leads in red in Figure 1. to compute market depth at the come from young people making Zip level. Data describing demo- the first move, perhaps by filling ZIP Code-level Data graphic, educational, and other out a web form or using a toll-free characteristics at the Zip level is telephone number. National leads Recruiting stations are assigned publicly available. are much like walk-ins in that they to sets of postal ZIP codes (Zips, Inevitably, though, these data are high-quality, self-generated for simplicity). Some Zips, often have issues. For one thing, some are leads. Areas with a lot of national in rural areas, are not assigned to missing (some states do not appear leads, or a lot of leads per resident, any station; in more-populous in FBI crime reports). For another, can be assumed to be areas of high areas, each Zip will be assigned some data are at the county level. propensity. Of course, national to (exactly) one recruiting Knowing which Zips appear in

CHANCE 41 which counties makes it possible to ZIP code-level estimates are Poisson model. If the size of the distribute county-level data to ZIP highly variable. set of unproductive ZIP codes codes, but this induces noise, as For example, market depth is is bigger than the usual Poisson well as correlation between nearby often higher in areas where lots model would support, better results Zips. An additional complication of veterans reside. If one Zip has can be achieved with a more- is that some Zips span county lines. a comparatively high number of complicated two-stage model. Even when the ZIP-level data veterans residing there, and an The so-called “zero-inflated” are in good shape, a number of adjacent one has a low number of model is one of these. The first practical considerations arise. First, veterans, a positive veteran effect stage tries to determine which ZIP recruiters have to be physically might still be expected in the sec- codes will never produce recruits; present in ZIP codes where they ond one, since increased veteran the second is a count distribu- recruit. They drive to high schools, presence might be an attribute tion-type model for the number event sites, recruits’ homes, and so of the whole area rather than of of recruits from those that will. on. When a particular ZIP code is individual Zips. In these cases, Under this model, some ZIP codes added a recruiter’s portfolio, it is the models might be improved will never produce recruits; others important to account for the cost, by constructing estimates based may produce recruits, but may not. to the recruiter, of covering that on smoothing those from several A related two-stage model called Zip. Large Zips, or those far from adjacent Zips. The mechanisms for the “hurdle” supposes that an area the recruiting station, result in these smoothing operations are a will not generate any leads until a long driving times for the recruiter, subject of ongoing work. minimum threshold of recruiting which reduces the amount of pro- effort is expended. ductive recruiting interactions that Modeling While development of these are possible. Market Depth models is continuing, both of these two-stage models, with a Earlier modeling efforts have Since the number of recruits in computed the Euclidean distance Negative Binomial count distri- a ZIP. This model assumes that bution, seem to perform better from the recruiting station to the the number of leads in any area than a simple Poisson model. geographical centroid of the Zip, is a Poisson random variable Figure 2 shows an example of and used that distance to evaluate whose expected value is a func- the output of these models; each the cost in time of assigning a Zip tion of some explanatory variables. ZIP code in the area of the Navy’s to a recruiting station. However, it These explanatory variables would Houston recruiting district is colored is clear that geographical features certainly include the number by the expected number of recruits such as mountains and lakes can of people living in the area (or according to one model. make driving times quite different a related, more-specific count from Euclidean distances. called the Quality Military Avail- Optimizing the Moreover, recruits are not able population, in which the Locations of uniformly distributed over ZIP Department of Defense attempts Recruiters codes; they concentrate in popula- to account for ineligibility). tion centers and, in particular, high The set of variables in the The final piece of the modeling schools and community colleges, model probably also would include effort takes, as input, the estimated so one area of ongoing research geographic and demographic values of market depth for each is determining the recruiter effort characteristics of the area, such ZIP code. The goal is to place needed to recruit in a ZIP code, as distance to the nearest public recruiters to cover the maximum given the road network and what- university and number of service number of eligible young people. ever indices of population density members who live nearby, as well However, just knowing the number are available. as socio-economic factors such of available young people in an Moreover, ZIP code-level as local unemployment and area is not enough to determine effects can differ substantially crime rates. optimal recruiter placement. This between neighboring Zips, some- Many ZIP codes have some is because we expect a diminishing times because adjacent Zips are residents, but never produce any rate of return as more recruiters quite different, and sometimes leads. Moreover, the variability are added—as more recruiters are because Zips have such low of the number of leads per ZIP added to an area, the recruiters populations that the corresponding code is greater than expected for a eventually will exhaust the market

VOL. 31.2, 2018 42 Figure 2. Estimated Recruiting Depth in the Houston, TX, area. Colors show expected numbers of recruits according to one model. Photo courtesy of Daniel Ammons-Moreno, Richie Powell, and Gary Ton, Naval Recruiting Command; not a product of the Naval Recruiting Command.

of available young people in that considerations have to be consid- the recruiters from that station area, and additional recruiters will ered in the optimization models. elsewhere. However, these spaces produce no additional benefit. It is For the moment, the number of are often leased under multi-year difficult to estimate the extent of recruits in an area are modeled as a agreements, so there is a cost, in this diminishing return, since the function not only of market depth, terms of continued rent payments, services have no short-term interest but also as a function of the num- incurred even by abandoned sta- in saturating an area with recruiters ber of recruiters assigned to that tions. Moreover, many recruiting to try to measure this effect. Zip. That number can be fractional: stations are in buildings shared by Another consideration is the A recruiter who is assigned equally two or more services, so the services quality of life for a recruiter. As to exactly two ZIP codes might be might be interested in cooperative an individual recruiter’s work- assigned with weights (0.7, 0.3), efforts regarding stations. Cost con- load increases, by assigning say, indicating that he or she is to siderations will have to be taken up more area and more driving time, spend 70% of recruiting time in the as the project continues. the recruiter can be expected to first of the two Zips. There are several possible for- become less happy and less produc- The objective to be optimized mulations of a program to optimize tive. It is important that models is the expected number of recruits recruiter placement, depending on enforce a maximum workload per produced from the areas designated the set of constraints imposed. As recruiter, where workload is mea- to be covered (and by the effort a starting point, security concerns sured in some reasonable way, such expended). The current modeling require that any recruiting station as expected driving time per week, effort does not consider the overall have at least two recruiters if it rather than in recruits per month. dollar cost associated with recruit- has any at all. In its most basic Moreover, these models move ing. This is partly because some form, the optimization will serve recruiters instantly from station to elements of the cost arise from to assign a fixed number of recruit- station, but moving a recruiter in expenditures outside the recruit- ers to existing recruiting stations, real life carries a cost. For a move ing commands’ immediate control. and recruiters to Zips, in the way of more than a short distance, the For example, it might be effective that produces the globally largest recruiter might need to move a for a service to abandon a par- expected number of recruits. Under residence and family. All of these ticular recruiting station and place this simple model, no existing

CHANCE 43 recruiting stations would be aban- Of course, the optimal set of contacts between recruiters and doned and no new ones created. locations for recruiting stations potential recruits, and these models Each recruiter is assumed to today will not necessarily be the continue to expect that recruiters conduct a normal workload. If each optimal set in 10 years’ time, since will be assigned to physical stations recruiter were assigned to a fixed demographics and local conditions and to nearby ZIP codes. set of ZIP codes with no fractions, change. A second optimization The world is changing, though, this would be a straightforward, if starts from a list of current recruit- and recruiting has an opportunity sizable, mixed-integer linear pro- ing stations, as well as potential to change with it. Just as one exam- gram. With fractional allocation, locations for new ones. These ple, the services could consider cre- the problem becomes more com- potential locations might be drawn ating a single recruiting force for all branches. Such an effort might plicated, because the diminishing from a list of known, suitable sites that are available to lease. The opti- be staffed by active-duty members returns associated with increasing mization then assigns recruiters to rotating through assignments, as numbers of recruiters mean that stations, real or possible, select- services recruiting works now, but the relationship between number ing some and omitting others, alternatively might be staffed by of recruiters and number of recruits once again assigning recruiters full-time recruiting professionals, is nonlinear. to Zips and maintaining at least career civilians, or a combination Still, this augmented problem two recruiters in every station. The of these. can also be solved in reasonable results of this optimization can Since mobile communication time with modern computing help recruiting planners project is practically universal, the services resources. Variations on this opti- the best locations for stations in also can expect more recruits to mization can help determine the the near future. conduct their business—filling out cost, in terms of expected number A final optimization might forms and so on—using computer of recruits, associated with reduc- place recruiting stations anywhere or mobile devices and on their own tion in the number of recruiters, if, on the map to produce the larg- time. This might change the way for example, the services were to est possible expected number of the recruiters of the future allocate decide to shrink their strength by recruits. If the optimal locations their own time. 10% over three years. of stations under this model are The U.S. military, whether it similar to the existing placements, shrinks or grows in the short term, it might create confidence that the will continue to require on the current assignment is adequate. If, order of 100,000 or more eligible however, the unconstrained model volunteers every year. The chal- About the Authors suggests vastly different placement lenge of recruiting that many from what is actually present, there young people will continue to be Samuel E. Buttrey is an associate professor in the might be an opportunity to gain considerable. This work to try to Department of Operations Research at the Naval Postgraduate insight from the differences. increase the efficiency of the pro- School. His current interests include computationally intensive cess is ongoing and holds out the methods for classification and clustering, particularly with promise of real savings across categorical and text data. His book A Data Scientist’s Guide The Future of to Acquiring, Cleaning, and Managing Data in R, written with Recruiting the Department of Defense. Esti- Lyn Whitaker, was recently published. mating the market depth is an This description of this approach, important part of this effort. Lyn R. Whitaker is an associate professor in and the work in this area, has the Department of Operations Research at the Naval Postgraduate School. She is interested in scalable measures of assumed that the business of Further Reading inter-point dissimilarity in mixed categorical and numeric data, recruiting will go on much as it has as well as machine learning. in the past. In the services, many Buttrey, S., and Whitaker, L. Esti- recruiters are drawn from regular mating the Depth of the Navy Jonathan K. Alt served as an assistant professor Recruiting Market. DTIC in the Department of Operations Research at the Naval service members whose past spe- Postgraduate School at the time of this work and currently is cialties might have been something number to be assigned. a principal analyst with the U.S. Army Training and Doctrine quite different. Zeileis, A., Kleiber, C., and Jack- Command Analysis Center, where he leads and contributes to After a stint as a recruiter, most man, S. Regression Models for applied research and analysis. of these personnel will return to Count Data in R. https://cran. their original jobs. Recruiting still r-project.org/web/packages/pscl/ depends, in large part, on in-person vignettes/countreg.pdf.

VOL. 31.2, 2018 44 Data Farming: Reaping Insights from Simulation Experiments Susan M. Sanchez

uring the past two farming. However, all of these there is no way of determining decades, there has been techniques have the potential to from the data whether both speed a quiet revolution in the yield equivalent benefits in aca- and stealth have to be high. Dway that insights are obtained from demic or industrial areas. In (b), speed and stealth change simulation models. Data farming in a one-factor-at-a-time manner is a suite of tools and techniques Designed from the baseline, but that presents for designing and analyzing large- Experiments: Planting another problem: The results do scale simulation experiments, and Seeds for Successful not show any improvement that has been a significant contributor Data Farming can be attributed to either factor, to that revolution. so it may appear that there is no While data farming embraces The bread-and-butter of data way to win the game. the use of data mining, the meta- farming is the use of large-scale In contrast, (c) shows that while phor of “farming” emphasizes that designed experiments, but before neither high speed nor high stealth we also manipulate our simulation discussing the large-scale issues, can produce a win by itself, they models prospectively to maximize it might help to review a few result in a win when combined. insights, similarly to the way real- key DOE principles for a small- Moreover, if we use quantitative world farmers manipulate their scale experiment. probabilities of win instead of sim- environment to improve yields. We A good example is the game of ple color codes, then regression or “grow” our data in a smart way, “Capture the Flag.” This game has ANOVA will allow us to estimate using designed experiments. two teams, each of which tries to the effects of speed, stealth, and Design of experiments (DoE) is sneak up and capture the flag of their interaction. a well-established field in statistics, the opposing team. Consider a While (c) is definitely an but its applications to simulation stochastic simulation of the game. improvement over (a) and (b), have provided both challenges and Suppose two characteristics of a it does not provide information opportunities that are not present favorite team can be controlled: about what happens in the interior when conducting physical experi- their speed and their stealth. of the factor space. Both (d) and ments. The added control we have Figure 1 shows the results of (e) do so, but with far different in the virtual world means that it running many simulated games, sampling requirements. is possible to vary a much larger where green circles, yellow This is where “smart” DoE number of factors—and gain a triangles, and red squares repre- comes into play. Both (c) and (d) much better understanding of how sent good, intermediate, and bad are examples of factorial (gridded) these affect the outcomes—dur- outcomes (probabilities of win- designs; the number of design ing a simulation experiment than ning), respectively. points (experimental combinations during a live experiment. This has The graphs in Figure 1(a) and of factors) is equal to mk, where m driven a need for designs that are (b) show the results of poor sam- is the number of factor levels and far larger than any used for physical pling plans. In (a), because speed k is the number of factors. experimentation. and stealth have been changed The graph in (e) comes from Most of the examples that simultaneously in a perfectly a different type of design called follow are oriented toward the correlated manner, the factors an orthogonal Latin hypercube. It U.S. Department of Defense are confounded. This means that has only 17 design points, rather (DoD), which has been both a although we can tell that the com- than the 172  289 of the design major sponsor and beneficiary of bination of high speed and high in (d), but still provides much much of the development of data stealth results in a good outcome, of the same qualitative insight.

CHANCE 45 Figure 1. Two poor designs and three informative designs for Capture the Flag, where good, intermediate, and bad outcomes are rep- resented by green circles, yellow triangles, and red squares, respectively. Designs (c)–(e) indicate that some combination of speed and stealth is needed to win, but only the space-filling designs in (d) and (e) reveal that there is more than one winning option. Design (e) does so with far less sampling.

Although neither speed nor stealth big is that? If our simulation ran in experiments provides the exciting are particularly effective on their a nanosecond, and we had started potential to explore a broad vari- own, there is a wide band of inter- our experiment at the dawn of ety of potential futures, and make mediate outcomes, and there are time, we would still be only a tiny better choices regarding which multiple ways to win the game. fraction through a single replica- one(s) to pursue. For instance, the Using, for instance, regression or tion. However, using a space-filling DoD is continually assessing how logistic regression to fit a statis- design such as a Latin hypercube to modernize its forces, respond to tical model of the response with makes it possible to explore a new types of threats, incorporate the data in (e), it might be just as hundred factors with a few hun- new commercial technologies into informative as one obtained from dred design points. Consequently, its operations, or anticipate what (d). In simulation, such a statistical well-designed experiments take new systems its forces will need in “model of a model” is referred to as large-scale experimentation from the years to come. a metamodel. the realm of the impossible to the Such systems will undergo One type of graph omitted from realm of the practical. developmental and operational Figure 1 is that associated with testing. Elsewhere in this issue running only a single design point, Applications of CHANCE, Freeman and War- such as the baseline case (with low ner describe how statistics and Data farming can be applied to any speed and low stealth). All we could designed experiments help this type of computer model, provided do in this instance is use descriptive process. If prototypes are avail- it is either programmed with data statistics to characterize the dis- able, simulation experiments can farming in mind or incorporated tribution of potential outcomes. A complement physical experi- into some kind of data farming study of this type cannot yield any ments. For example, a torpedo’s wrapper. There are literally hun- information about whether speed, dreds of interesting applications of path could be simulated under stealth, or their interaction might data farming. A brief overview of a variety of starting conditions affect who wins the game. a few types of data farming stud- (launch angle and velocity) and What happens when we study ies follows; there are many more environmental conditions (cur- systems on a larger scale? If we examples at http://harvest.nps.edu. rent and water density). In turn, shift from exploring k  2 factors the results of live tests might  to k many factors, the efficiencies Planning for provide useful information for from moving away from factorial the Future assessing and improving the simu- designs are even more pronounced. lation model. Factorial designs suffer from the A Danish proverb states However, well before any proto- curse of dimensionality. If we that “It is difficult to make types are developed, it can be useful have 100 factors, and experiment predictions, especially about the to consider how they might be used with each at only two levels, that future.” However, the combina- in actual operations. Simulation requires 2100 design points. How tion of simulation and design of makes it possible to vary many

VOL. 31.2, 2018 46 more factors than can be explored with harvesting near-term savings or one-at-a-time sampling to seek in physical exercises, and does so at of billions of dollars and thousands reasonable solutions. Decision a tiny fraction of the cost. of billets, and a long-term cost factors in this model represent One example, described in avoidance of $20 billion. The study the numbers of Navy recruiters, Sanchez, et al. (2012), involves also provided useful information as well as budgets for advertising, the U.S. Army’s planned use of about the trade-offs between the enlistment bonuses, and educa- unmanned aerial vehicles (UAVs, optimization interval and the qual- tion incentives. Experiments also or drones) for surveillance ity of the solution. can incorporate market factors purposes. The actual study was The use of drones is still of inter- that affect recruiting, such as the conducted back in 2006, when est in today’s DoD. Unmanned sizes of various pools of applicants, these technologies were just vehicles of many types—aerial, unemployment rates, differences becoming available. The Army ground, surface, and underwa- between military and private sector was contemplating the procure- ter—have the potential to give the pay, and cultural propensities for ment of five classes of these DoD new capabilities in a variety serving in the military. drones, and asked the Training of ways. They can improve border and Doctrine Command Analy- security and critical infrastructure Finding Robust sis Center (TRAC) to develop protection, protect and enhance Solutions in recommendations. the capabilities of those in uniform, Uncertain Worlds Our part of the study involved help deliver water and food to civil- a computer tool that employed ians displaced by a natural disaster, Military forces around the world discrete event simulation, coupled and more. They are also changing are heavily involved in humanitar- with a rolling-horizon optimi- the way our adversaries operate, so ian assistance and disaster relief zation approach, to determine plans for the future have to take efforts. This is not a new situation, a schedule for drone missions. these rapidly changing technolo- but has been true for some time. Since the drones were not yet gies into consideration. The U.S. Navy and Marine Corps available, we set up a computer Using large-scale simulation are often among the first to pro- model to see how well they could experiments to assist in planning vide assistance to foreign countries be launched and, if need be, is not limited to exploring new after a natural disaster such as a dynamically rescheduled to surveil technologies. The U.S. Navy hurricane, earthquake, or typhoon. different parts of the battlefield Recruiting Command has, for As noted by Gardner, their over a 15-day operation. More years, used a tool for planning efforts may include reestablishing than 21,000 different potential called PRO Model (Planned communication networks, estab- targets popped up over the time- Resource Optimization Model) lishing security, clearing debris frame of the simulation. to assist in determining how to to facilitate transportation, and The experiment involved 26 invest a variety of resources, such providing logistics support and factors, including time horizon, as recruiting and retention incen- medical assistance before the airspeed, operating time, operat- tives, to ensure the right mix of arrival of other humanitarian assis- ing radius, and transition time for officers projected for the next two tance organizations. each of five classes of drones, as decades. Recruiting is important These types of scenarios well as the optimization interval because without the right influx evolve quite rapidly and involve for rescheduling purposes. Our of junior officers today, the Navy huge amounts of uncertainty for space-filling design had 272 design will have difficulty achieving the decision-makers, up to and includ- points. Although it took more than right mix of senior officers in ing defining the case-specific 700 hours of processing time to the future. In 2017, Hogarth goals of the mission. This makes make the runs, we were able to embedded this in a data farming them ideal for “robust analysis,” “grow” the data in just a few days wrapper, PROM-WED (short for which seeks to find solutions that by using cluster computing. PRO Model With Experimental work well across a broad variety The results of the UAV study as Design), to make it a more useful of circumstances that might arise a whole were dramatic. The Army tool for Navy analysts. in practice. had been contemplating the pro- A large-scale designed experi- An excellent example of curement of five classes of drones, ment allows analysts to explore a robust disaster relief planning is but ended up reducing that to three wide variety of future possibilities a 2015 study by Li. She examined classes. The director of TRAC in a timely manner, rather than alternatives for Taiwan military credited the UAV modeling effort rely on a trial-and-error approach disaster planners, who must decide

CHANCE 47 Figure 2. Three views of humanitarian assistance operations. Graph (a) shows the relationship between expected casualties and budget, for three curves that correspond to different survival rates among those in immediate need of medical assistance. Graph (b) shows that the expected number of relief workers stabilizes early in the budget cycle, and that the probability of a Typhoon Scenario 1 affects this response. The boxplots in (c) show there is little chance of using most of the mobilized workers, although Scenario 4 is less conservative than the others. Graphs in (a) and (c) are adapted from Li (2015).

how many Taiwanese troops to approach to determine the com- most-important factors from the mobilize in advance of a typhoon, bination of decisions that best large number varied in an experi- and how to use information about addressed the disaster in expec- ment. Just by itself, this can be the typhoon’s intensity and direc- tation. Of course, the expected quite informative. For example, Li tion of approach to improve the numbers of casualties due to unmet found that of the 33 factors in her disaster relief efforts. Between 1958 needs for medical assistance, food, experiment, only six appeared in a and 2015, Taiwan experienced an and water were of primary interest, regression metamodel for any of average of 6.9 typhoons per year, with low values being desirable. her responses of interest. This, in including two classified as strong Other outputs indicated how turn, guided her graphical explora- intensity. This number is expected many troops were actually mobi- tion of the data. to increase as climate change lized for each type of storm and the Three examples of interesting accelerates. Li varied 33 factors in total amount spent on humanitar- graphs appear in Figure 2. In (a), her model. ian assistance. the expected casualties decrease These related to the total More-detailed information sharply with budget and then level budget for humanitarian assis- included the funding breakdown off. Different curves correspond tance activities, penalties for not into different activities, such to different survival rates among transporting food and water to as expanding ramp space at the those in immediate need of medi- survivors, survival rate (if res- areas soon to be affected, as well cal assistance. cued) for those who are injured as setting up temporary health- Further inspection of the and in need of medical assistance care facilities, warehouses, and detailed simulation output indi- immediately after a typhoon mass housing shelters at locations cates that as the maximum budget passes, probabilities of occurrence to handle people displaced by the increases from $5 to $23 mil- of several types of typhoons, and storm. Costs associated with trans- lion, additional funds are spent numbers of available relief workers porting food, water, and displaced to establish shelters capable of in different geographic regions. persons were also provided. housing displaced people in prep- A single run of her model sim- One of the joys of data farm- aration for typhoons, although ulated five types of storms, with ing is the data analysis. Stepwise this has no effect on reducing the different damage profiles, and regression and partition trees are expected number of casualties. used an embedded optimization quite useful in identifying the Slight improvements are achieved

VOL. 31.2, 2018 48 between budgets of $23 and $30 The amounts of data are: 512 identifying important factors have million to expand pre-positioned design points, each simulating five been mentioned, but partition emergency health service capacity typhoon scenarios. The insights trees may have an added benefit and commodity supplies. that can be obtained from this mass when communicating with a non- The embedded optimization of data are much richer than from technical audience. Figure 3 shows model uses the maximum budget using a small experiment. an example of a partition tree for as a constraint, and consequently an important model output: the has no incentive to keep expendi- Rapidly Developing proportion of instances that hostile tures lower than allowed. Future New Concepts ships are identified appropriately implementations could modify the and Systems before reaching the exclusion zone optimization objective to encour- and hence triggering the use of age efficiency, but decision-makers New technologies may not be effec- lethal weapons. shown the plot in Figure 2(a) could tive unless they are accompanied by The concept of a partition tree is already determine that there is no new doctrine and tactics. Using simple: Start with all the data in a benefit to spending $23 million simulation models as a springboard single group, and then search over instead of $5 million. for this type of development can be all factors and all settings to find A different slice of the data orders of magnitude faster—and the single split that provides the appears in Figure 2(b). For com- safer for the troops—than asking greatest explanatory power (i.e., parison purposes, Taiwan used, on those in the field to develop suit- the greatest improvement in R2). average, roughly 3,800 relief work- able tactics. For example, Sickinger A good split will take a large group ers per typhoon between 2010 and (2005) explores a spectrum of of dissimilar data and break it into 2014. This is slightly more than the approaches for identifying hostile two groups that are internally less average numbers from the simula- intent from small boat threats to a variable, but differ in their means. tion model, which may indicate high-value vessel at sea. The tree in Figure 3 does a good that the simulation model is less This can be a difficult problem job of explaining the variation in conservative, or that its embedded for naval vessels near ports and the response (R2  0.816) with optimization uses relief workers other high-traffic areas. Naval ves- only five splits: two involving a more effectively, or a combination sels are authorized to use lethal hostile capability, two involving of the two. force on boats that intrude into rules of engagement, and one Insights evident from this graph their exclusion zones, but there involving a non-lethal capability. are that expected number of relief is a trade-off between protecting The red leaves on the left show that workers used seems to stabilize the vessel and using lethal force there is little chance of identifying around a maximum budget of $4 on a fishing or tourist boat that threats as hostile if they travel fast, million, and there is an interaction might be unaware of the restric- and if warning munitions cannot with the probability of occurrence tions. Consequently, non-lethal be used unless a high threshold of the most-severe typhoon (Sce- approaches—such as long-range of suspicion is met. The green nario 1), which affects the entire acoustic devices to provide audio leaves on the right show there are mainland and several outlying warnings, optical dazzlers, and multiple ways of being success- islands. In (c), a closer look at dif- warning munitions—may both ful against slower threats: either ferences by the typhoon scenario deter threats and reduce the risk by relaxing the restrictions against shows there is little chance of using of using lethal force unnecessarily. the use of lethal weapons or by most of the mobilized workers. Sickinger explores an agent- using short audio warnings so This is particularly striking based model that determines multiple warnings can be made since Li’s model already restricts intent by monitoring repeated in a shorter time span. The two the maximum allowable number incursions into the warning zone. yellow leaves have similar means; of mobilized workers to far lower Her experiment varies 29 fac- the rightmost leaf has more-pre- values (totals ranging from 24,719 tors, including capabilities of the dictable results, but the leftmost to 29,349) than the historical aver- non-lethal devices; rules of engage- shows that the negative impact of age of nearly 41,000 relief workers ment for their employment; and high-speed threats can be partially mobilized with only 9% actually characteristics and behaviors of mitigated by allowing earlier use of used. The total available workers threats, fishing boats, and recre- warning munitions. varies by scenario, ranging from ational boats. When this information is around 500 for Scenario 5 to more The benefits of stepwise coupled with similar analyses than 8,000 for Scenarios 1 and 2. regression and partition trees for for other responses of interest,

CHANCE 49 Figure 3: Partition tree for proportion of hostile ships that are identified appropriately before reaching exclusion zone. Red, yellow, and green color coding can clarify leaves as bad, intermediate, and good outcomes, respectively. Adapted from Sickinger (2005).

analysts and decision-makers can intervention and without go along. Without this “test-as- obtain a greater understanding of a graphical user interface you-go” approach or a massive the approaches that are likely to (GUI). sensitivity analysis experiment work well. when a model appears to be com- • Read model inputs from plete, the unfortunate situation can files—flat files, XML, or Model Development, arise where changes to the model’s databases—or command- Verification, and inputs break it. line arguments, rather than Assessment Running a series of experi- hard-coding them into ments in moving from a the program or requiring Data farming can also assist in conceptual model to a computa- the model development process in inputs from a GUI. tional model also helps establish several ways, so it is worth think- Once the basic data farming credibility by encouraging engage- ing about using it throughout the structure is in place, designed ment with subject matter experts modeling process. When coding a experiments can be used to test and decision-makers along the model of one’s own, design it to features as an integral part of way. The result will be a better be data-farmable from the outset. the model development process. model, and all stakeholders will This can be accomplished by a This often speeds up the develop- end up with a better understanding few simple expedients: ment process and increases model of the complex system under inves- • Make sure the model has reliability, since bugs can usually tigation. As the earlier examples a “batch mode”—one that be identified more readily and demonstrate, this ultimately leads can be run without human fixed testing the model as you to better decisions.

VOL. 31.2, 2018 50 Figure 4. Screenshot of a simulation of a virus on a network in Netlogo as an example of data farming with a simple agent-based model. Source: Stonedahl and Wilensky, 2008.

Data Farming “behavior space” within the data-farming wrapper written by and You Netlogo platform will run a Upton, to conduct an experiment, factorial (gridded) design for are available at http://harvest.nps.edu. One of the beauties of data farming specified levels of each of the fac- is that it can be easily applied in a Fundamentally, simulation is tors to vary. As described earlier, variety of contexts. Once you have a powerful tool for developing this built-in apparatus constrains built a simulation model, “it’s time insight and informing decisions. the analyst to experimenting with to have the model work for you.” Building a stochastic simulation a small number of factors and a Large-scale simulation experi- model and running it multiple small number of levels to avoid ments can help. The heavy lifting times for a single configuration of the curse of dimensionality. How- of constructing good designs has inputs means limiting it to using ever, with a data farming wrapper, kept statisticians busy for years— descriptive statistics to answer it is possible to run any design, and will continue to do so for “What is?” types of questions. the foreseeable future—but once such as an nearly orthogonal If you run it for a few alternative designs become available, anyone Latin hypercube. can use them. Figure 4 shows a screenshot configurations, you can start One easy way to try this out of a simulation of a virus spread- answering “What if?” questions— is to look at the agent-based mod- ing across a computer network. such as “What if my team were els from the Netlogo community, Brief examples of how to use Net- faster in playing Capture the Flag?” as Wilensky has shown. The logo’s behavior space, or a custom or “What if drones were able to

CHANCE 51 carry 25% more food to survivors of Unmanned Systems Education & Conference, eds., Piscataway, NJ: a typhoon?” Research (CRUSER); and Naval Institute of Electrical and Elec- Data farming makes it possible Research Program (NRP). tronic Engineers, Institute of to go beyond What is? and What Disclaimer: The views expressed Electrical and Electronic Engi- if? to answer much richer ques- in this article are those of the neers, Inc., 1795–1809. tions, such as “What matters?,” authors and do not reflect the offi- Sanchez, S.M. 2015. Simulation “How?,” and “Why?” Doing so can cial policy or position of the United experiments: better data, not just result in valuable insights from the States Navy, Department of big data. In Yilmaz, L., Chan, simulation experiment. Defense, or U.S. Government. W. K. V., Moon, I., Roeder, T. Acknowledgments Further Reading M. K., Macal, C., and Rossetti, M. D., eds., Proceedings of the The capture-the-flag example is Freeman, L.J. and Warner, C. 2018. 2015 Winter Simulation Confer- adapted from Sanchez and Wan “Informing the warfighter— ence, Piscataway, NJ: Institute of (2015). These and several other why statistical methods matter Electrical and Electronic Engi- references are available online at in defense testing.” CHANCE, neers, Inc., 800–811. http://harvest.nps.edu. 31(2), 4–11. Sanchez, S.M., Lucas, T.W., San- This work was supported in Gardner, M.J. 2015. Investigat- chez, P.J., Nannini, C.J., and part by the Office of the Secretary ing the naval logistics role Wan, H. 2012. Designs for of Defense, Director of Opera- in humanitarian assistance large-scale simulation experi- tional Test and Evaluation (OSD activities. MS in Operations. ments, with applications to DOT&E); Test Resource Man- Monterey, CA: Naval Postgrad- defense and homeland security. agement Center (TRMC); Science uate School. of Test research consortium; U.S. In Hinkelmann, K., ed., Design Hogarth, A.R. 2017. Improving Army TRAC-Monterey; Joint and Analysis of Experiments: Navy recruiting with the new Non-Lethal Weapons Directorate; Special Designs and Applications, Planned Resources Optimiza- and Naval Postgraduate School's 1st ed., v3. New York, NY: John tion Model with Experimental Consortium for Robotics and Wiley & Sons, 413–441. Design (PROM-WED). MS in Sickinger, L. 2006. Effects of non- Operations Research. Monterey, lethal capabilities in a maritime CA: Naval Postgraduate School. environment. MS in Operations Li, H. 2015. Improving the Taiwan Research. Monterey, CA: Naval About the Author military’s disaster relief response Postgraduate School. to typhoons. MS in Operations Susan M. Sanchez is a professor in the Research. Monterey, CA: Naval Stonedahl, F., and Wilensky, Operations Research Department at the Naval Postgraduate U. 2008. NetLogo Virus on School (NPS), with a joint appointment in the Graduate Postgraduate School. a Network model. http:// School of Business & Public Policy. She has a BSE in Lucas, T.W., Kelton, W.D., San- industrial & operations engineering from the University of chez, P.J., Sanchez, S.M., and ccl.northwestern.edu/netlogo/ Michigan, and an MS and PhD in operations research from models/VirusonaNetwork. Cornell University. Her research interests include the design Anderson, B.L. 2015. Changing and analysis of large-scale simulation experiments, robust the paradigm: Simulation, now Evanston, IL: Center for design, and applied statistics, with application to military a method of first resort. Naval Connected Learning and operations, manufacturing, and health care. She cofounded Research Logistics 62(4):293– Computer-Based Modeling, and serves as co-director of the NPS SEED Center for Data Northwestern University. Farming (Simulation Experiments & Efficient Designs). Her 305. honors include a Koopman Prize from the INFORMS Military Sanchez, S.M. and Wan, H. 2015. Upton, S.C. 2015. SimpleNetLog- Applications Society, a NATO Scientific Achievement Work smarter, not harder: oFarmer. Software available on Award, and recognition as a Titan of Simulation and an request. INFORMS Fellow. A tutorial on designing and conducting simulation experi- Wilensky, U. 1999. NetLogo. ments.' In Yilmaz, L., Steiger, http://ccl.northwestern.edu/ M.E., Chan, W.K.V., Moon, I., netlogo/. Evanston, IL: Center Roeder, T.M.K., Macal, C., and for Connected Learning and Rossetti, M.D., eds., Proceedings Computer-Based Modeling, of the 2015 Winter Simulation Northwestern University.

VOL. 31.2, 2018 52 Updated Guidelines, Updated Curriculum: The GAISE College Report and Introductory Statistics for the Modern Student Beverly L. Wood, Megan Mocko, Michelle Everson, Nicholas J. Horton, and Paul Velleman

ince the 2005 American the elementary school Statistical Association’s mathematics curriculum. (ASA) endorsement of Although the GAISE Sthe Guidelines for Assessment and PreK-12 Report, developed Instruction in Statistics Education by Christine Franklin College Report, changes in the and colleagues, presents statistics field and statistics edu- a framework for teaching cation have had a major impact statistics at the pre-college on teaching and learning statis- level, the GAISE College tics. We now live in a world where Report recommendations “statistics—the science of learning can still be introduced from data—is the fastest-growing at this level and then science, technology, engineering, enhanced at the post- and math (STEM) undergradu- ate degree in the United States,” secondary level. This might according to the ASA, and where be especially useful in light many jobs demand an understand- of the many students who ing of how to explore and make will become elementary sense of data. and secondary mathemat- These are some other changes ics and statistics teachers. affecting the field, teaching, and • The growth in available practice of statistics. data has made the field of • More students study statistics more salient and statistics. Exposure to has provided rich opportu- statistical thinking often nities to address important begins in grades 6 through statistical questions, even 8 as numerous states adopt in an introductory course. the Common Core State Standards or similar sets • The discipline of “data of educational standards science“ has emerged as for mathematics. Other a field that encompasses states might expose stu- elements of statistics, dents to statistical ideas computer science, and even earlier as part of domain-specific knowledge.

CHANCE 53 Table 1—Executive Summary—Guidelines for • More powerful and afford- Assessment and Instruction in Statistics Education able technology options College Report 2016 (endorsed by the have become widely ASA Board of Directors, July 2016) available. • Alternative learning envi- In 2005, the American Statistical Association (ASA) endorsed the ronments have become Guidelines for Assessment and Instruction in Statistics Education more common. (GAISE) College Report. This report has had a profound impact on the teaching of introductory statistics in two- and four-year • Innovative ways to teach institutions, and the six recommendations put forward in the report the logic of statistical infer- have stood the test of time. Much has happened within the statistics ence have received increas- education community and beyond in the intervening 10 years, ing attention. making it critical to re-evaluate and update this important report. In addition, a number of reports The revised GAISE College Report takes into account the many endorsed by the ASA—available changes in the world of statistics education and statistical practice at www.amstat.org/asa/education/ since 2005 and suggests a direction for the future of introductory undergraduate-educators.aspx— statistics courses. Our work has been informed by outreach to the have addressed aspects of statistics statistics education community and by reference to the statistics education, including: education literature. • Curriculum guidelines for We continue to endorse the six recommendations outlined in the undergraduate programs in original GAISE College Report. We have simplified the language statistical sciences within some of these recommendations and re-ordered other recommendations to focus first on what to teach in introductory • Statistical education of courses and then on how to teach those courses. We have also teachers added two new emphases to the first recommendation. The revised • Qualifications for teaching recommendations are: introductory statistics 1. Teach statistical thinking. The ASA’s statement on  Teach statistics as an investigative process of values in the American Statistician problemsolving and decision-making. also addresses teaching a core con- cept in an introductory course.  Give students experience with multivariable thinking. In light of these new reports and 2. Focus on conceptual understanding. other changes and demands on the 3. Integrate real data with a context and purpose. discipline, a group of volunteers led by Michelle Everson (Ohio State 4. Foster active learning. University) and Megan Mocko 5. Use technology to explore concepts and analyze data. (University of Florida) have 6. Use assessments to improve and evaluate student learning. revised the 2005 GAISE College Report. The board of directors of The revised GAISE report includes an updated list of learning the American Statistical Associa- objectives for students in introductory courses, along with tion endorsed the updated report suggested topics that might be omitted from or de-emphasized in July 2016. in an introductory course. In response to feedback from statistics The executive summary of the educators, we have substantially expanded and updated recently endorsed Guidelines for some appendices. New appendices provide a history of the Assessment and Instruction in Sta- evolution of introductory statistics courses, examples involving tistics Education College Report multivariable thinking, and suggestions for ways to implement the 2016 (reprinted in full in Table 1) recommendations in a variety of different learning environments. explains why the original report needed revision and summarizes key changes in the new report.

VOL. 31.2, 2018 54 Sharing insights on the com- general education; whether it is The six recommendations of the mittee’s thoughts and assumptions a seminar, large lecture, or dis- original report have held up well will help shed additional light on tance class; or whether algebra over time, so these have essentially the revision process and subse- or calculus is a prerequisite. The stayed the same with only small quent changes in the report. The GAISE College Report 2016 can changes in wording and order. key recommendations can lead to usefully inform the pedagogy These recommendations appear in a discussion of future work and of any instructor. (In fact, the the revised report with additional next steps. committee asserts that these fun- resources from recent statistics damental ideas are also relevant education literature and specific Insights for courses beyond the intro- suggestions for teachers, with ductory one.) frequent references to examples in Throughout the revision process, Like the original report, the the appendices. the committee remained aware revision includes sections on Goals The primary change has been of the tension between realis- the addition of two new empha- tic expectations and aspirational for Students and Recommenda- tions for Teaching. The goals ses under the recommendation to goals, and the need to be more teach statistical thinking; therefore, descriptive rather than prescrip- have been reorganized, but retain non-prescriptive advice while pro- the two new emphases have extra tive. There is dramatic variability details. The first of these emphases viding structure relevant for a wide in how introductory statistics is (Teach statistics as an investigative variety of courses. The commit- taught and the curriculum that process of problem-solving and tee expanded the list to include is delivered, and that variability decision-making) begins with the remarks about data ethics and sta- is increasing. Instructors face statement, “Students should not tistical software to acknowledge multiple constraints when leave their introductory statistics including these areas in courses attempting to change their courses course with the mistaken impres- with a focus on contemporary sta- and curricula. While some instruc- sion that statistics consists of an tors and programs may enjoy tistical practice. unrelated collection of formulas the freedom to completely restruc- As a way of acknowledging the and methods.” ture their courses from scratch, realities of limited time in the cur- Some textbook authors have others may find that regular riculum, a section in the report also already begun to take a more incremental review and change, addresses how instructors could holistic approach to statistics in combined with periodic assess- make room for new content by the introductory—and potentially ment, is more feasible. looking at suggestions for topics only—course in statistics. The The revised report supports that might be omitted. In an effort emphasis on this aspect of teach- either approach through a set of to help instructors meet the rec- ing statistical thinking is meant augmented appendices that include ommendations regardless of their to inspire more authors to do the numerous examples of activi- learning environments, Appendix same and to encourage instructors ties, assessments, and technology F, Learning Environments, offers to be aware of the big picture per- tools, all of which can be used in a methods for meeting the recom- spective that may be missing from variety of learning environments mendations in diverse settings. their current courses. and settings. The GAISE guide- For example, rather than teach lines can serve as an important Key New statistics as a list of topics that guide in making sure that the Recommendations students might perceive as being course changes remain on track disconnected from one another, with future trends in the discipline One could argue that little has important content can be taught of statistics. changed in terms of the recom- from the first day of class and The revised report acknowledges mendations from the revised repeated frequently through case that there is no single introduc- GAISE College Report compared studies or engaging examples that tory statistics course, but a variety to the 2005 report developed by highlight the investigative nature of courses reflecting the needs of Joan Garfield and colleagues. It of the discipline. Students can particular institutions and particu- would also be reasonable to posit learn the process of beginning with lar students. that everything has changed, given a research question and crafting Some of this variability is the dramatic shift in landscape for a plan to gather and explore rel- introduced by whether a course is statistics and data science. The evant data to draw appropriate discipline-specific or part of truth is somewhere in between. conclusions about the research

CHANCE 55 question. Instructors should be Will students use statistics in • “Why do we need data to encouraged to think about how follow-up courses and careers, answer this question?” to incorporate this approach into and will they be consumers of sta- • “What about variation?” their current curricula. tistical information presented in The second emphasis (Give the news and abounding in every- • “How do we make deci- students experience with mul- day life? sions based on the data?” tivariable thinking) includes, It is essential to work on the “We must prepare our students development of skills that will • “How do we account for to answer challenging questions allow students to think critically variability in that deci- that require them to investigate about statistical issues and recog- sion?” and and explore relationships among nize the need for data, importance • “Are there limitations on many variables.” of data production, omnipresence our conclusions based on A stand alone appendix on mul- of variability, and quantification how our data were mea- tivariable thinking provides specific and explanation of variability. As sured and/or collected?” examples of extensions from more part of the development of statis- typical content in an introductory tical thinking skills, it is crucial to Wild and Pfannkuch give a course that can help instructors add focus on helping students become detailed breakdown of statistical some exposure without adding too better-educated consumers of sta- thinking in their 1999 paper. much more content to an already- tistical information by introducing We also should teach students that the practical operation of sta- crowded syllabus. This does not them to the basic language and tistics is to collect and analyze data necessarily require using advanced fundamental ideas of statistics, to answer questions. As a part of inferential techniques, but rather and by emphasizing the use and the overarching recommendation encourages the students to explore interpretation of statistics in every- to teach statistical thinking, the beyond two variables, perhaps day life. introductory course could teach with visualization tools, stratifica- Instructors of statistics would statistics as an investigative process tion, or multiple regression as a be smart to emphasize practical of problem-solving and decision- descriptive approach. problem-solving skills that are making. All students also should As an example, if students are necessary to answer statistical already exploring the relationship receive experience with multivari- questions. We should model sta- able thinking in the introductory between two quantitative variables tistical thinking and the statistical such as gas mileage and weight of a course. Rather than limiting the analysis process for our students focus to single variables or rela- car, the students might have ideas throughout a course, rather than about what other factors might tionships between two variables, present students with a set of present examples that illustrate affect these variables. They might isolated tools, skills, and proce- wonder or the instructor could how multiple variables relate to dures. Effective statistical thinking posit, “Is there a difference if the one another. requires seeing connections among car is made domestically or not?” This is especially important statistical ideas and recognizing This additional information can because the types of questions that most statistical questions can be explored by creating scatterplots students are likely to encounter be solved with a variety of proce- for each group or overlaying them beyond the introductory course— dures and that there is often more in some fashion. perhaps within their own fields than one acceptable solution. of study—could easily involve Recommendation 1: Instructors can illustrate statis- the interplay among many vari- Teach Statistical Thinking tical thinking by working through ables. An example described in examples and explaining, along the the report in more detail involves An introductory course is often way, the questions that come up average teacher salary and aver- the only statistics course that a and the processes involved in tak- age SAT score across the 50 states, student takes. As such, it is impor- ing a problem from its conception and how the relationship between tant to think carefully about what to its conclusion. The instructor these variables—and the result- the focus should be in this course: can ask things such as: ing interpretation—changes when What should we teach, and what taking into account the proportion skills do we want our students to • “What is the question of students who take the SAT in have when they leave the course? being asked?” different states.

VOL. 31.2, 2018 56 Recommendation 2: Focus on STATPREP: IMPROVING STATISTICS Conceptual Understanding INSTRUCTION USING THE GAISE GUIDELINES Students have to be able to The recent report “A Common Vision for Undergraduate Mathematical Sciences make decisions about the most- Programs in 2025” (Saxe & Braddy, 2016), expressing the collective vision of the appropriate ways to visualize, American Mathematical Association of Two-Year Colleges (AMATYC), American explore, and, ultimately, analyze Mathematical Society (AMS), American Statistical Association (ASA), Mathematical a set of data. It will not be help- Association of America (MAA), and Society for Industrial and Applied Mathematics ful for them to know about the (SIAM), and endorsed by the Conference Board of the Mathematical Sciences tools and procedures that can be (CBMS), urged the mathematics and statistics community to modernize undergraduate used to analyze data if they do not curricula and pedagogies to improve student learning. first understand the under- lying concepts. Among the challenges the report issued was a call for the community to close the ever- Procedural steps too often widening gap between the traditional introductory statistics courses taught at higher claim students’ attention. Teach- education institutions across the country and the data-driven workplace that today’s ers should instead direct their college students will enter. attention toward concepts. When students have a sound understand- The Ntional Science Foundation-funded StatPREP program was created to help ing of concepts such as variabil- address this ambitious agenda. The goal of StatPREP is to create a professional ity, bias, randomness, distribution, development program and regional communities of practice that support underserved and inference, they find it easier statistics educators—adjuncts, community college instructors, and instructors outside to use necessary tools and pro- math/stat departments. cedures to answer particular questions about a data set. Two-day summer workshops in urban centers across the USA will focus on the Students with a good conceptual transition to data-centered teaching that relies on statistical, rather than mathematical, foundation from an introductory habits of mind. The project works to implement GAISE recommendations in statistical course will be better-prepared thinking, real data, and technology. to study additional statistical techniques in a second course. Recommendation 3: Integrate Real Data with a Context and a Purpose

It is crucial to use real data in Using real data sets of inter- Recommendation 4: context for teaching and learning est to students is an effective Foster Active Learning statistics, both to give students way to engage them in thinking Students play an active role in the experience with analyzing genuine about data and relevant statisti- learning process when they take data and to illustrate the usefulness cal concepts. One way this could responsibility for their learning. and fascination of our discipline. be accomplished is to gather data Statistics can be thought of as the This might involve, for example, from students, either as part of collaborating with peers by dis- science of learning from data, so a class activity or through a sur- the context of the data becomes cussing or debating concepts and vey administered during the an integral part of the problem- ideas, teaching their peers, or start of the course. This class- solving experience. working through problem-solving generated data could then be used The introduction of a data exercises. Active learning methods to investigate relationships among set should include a context allow students to discover, con- that explains how and why the variables or illustrate different struct, and understand important data were produced or collected. ways to graph and explore par- statistical ideas and engage in sta- Students should practice for- ticular types of variables. tistical thinking. Other benefits mulating strong questions and Other examples of real data include practice with using the answering them appropriately, can be found in online open data language of statistics in commu- based on how the data were pro- resources or perhaps through col- nication and working in teams to duced and analyzed. laboration with a local researcher. solve problems.

CHANCE 57 Activities provide teachers constraints), lets students do analy- maximize opportunities to include with a method of assessing stu- sis more easily, opening up time formative assessments into their dent learning, provide feedback to focus on interpreting results courses rather than focus exclu- on how well students are learning, and testing conditions rather than sively on summative assessments. and shift their classroom role from on computational mechanics. It is also worthwhile for instruc- lecturing to facilitating. Instruc- Technology should aid students in tors to review model assessments tors should not underestimate the learning to think statistically and created by key leaders in statistics learning gains that can be achieved to discover concepts, sometimes education, such as the Advanced with activities or overestimate the through the use of simulation. Placement (AP) Statistics Exam, value of lectures to convey infor- It should also facilitate access to Comprehensive Assessment of mation. Embedding even brief real (and possibly large) data sets, Outcomes in Statistics (CAOS), activities within lectures can foster active learning, and embed and Level of Conceptual Under- break the natural occasional dips assessment into course activities. standing in Statistics (LOCUS). in attention experienced by passive Statistics is practiced with or minimally engaged listeners. computers, usually with specially Closing Thoughts Some rich activities can take designed computer software. The six Recommendations for an entire class session, but many Students should learn to use a Teaching are essentially unchanged valuable activities need not take statistical software package, if from the original GAISE report, much time. One active learning possible. Calculators can provide but have more-robust explana- technique is known as think-pair- some limited functionality for tions and discussion; in particular, share. Students are presented with smaller data sets, but their use they are supported by references to a question, asked to think about a should be supplemented with recent and current statistics educa- response to the question, and then experience reading typical out- tion research literature, much of asked to discuss that response with put from statistical software (e.g., which did not exist in 2005. a neighbor. The instructor might Minitab, SPSS, R/RStudio). The two new emphases given to then call upon different pairs of Regardless of the tools used, the recommendation “Teach sta- students to share their responses it is important to view the use of tistical thinking” may strike some with the class. technology as not just a way to readers as being among the aspi- A think-pair-share discussion or generate statistical output, but a rational aspects of the new report, prediction exercise may take only way to explore conceptual ideas even though they are supported by two or three minutes. Collecting and enhance student learning. the statistics education community on-the-spot data may take more Recommendation 6: Use discourse over the previous decade. time, but reaps benefits beyond Assessments to Improve and An intentional focus on the the single activity that prompted Evaluate Student Learning investigative nature of statisti- the collection. Technology can also cal study that leads to problem- be used through a course man- Students will value what you assess; solving and decision-making, agement system or an audience therefore, assessments have to as well as acknowledgment that response system to decrease the be aligned with learning goals. many of those problems and time required. Assessments need to focus on decisions require handling multi- Recommendation 5: Use understanding key ideas, and not variable data, bring the classroom Technology to Explore just on skills, procedures, and experience closer to modern sta- Concepts and Analyze Data computed answers. Being able to tistical practice. calculate a p-value is not enough; To find out more about the Technology has changed the prac- students need to be able to draw original and revised reports, go to tice of statistics and, hence, should conclusions about the research https://bit.ly/2GcuOvl. change what and how we teach. question from a p-value and There are many challenges “Technologies” refer to a range also explain the reasoning pro- inherent in changing curricula. In of hardware and software that can cess that leads from the p-value to his closing rejoinder to the discus- do far more than simply handle the conclusion. sion of his provocative paper in the computational burden of Useful and timely feedback is the American Statistician, George analysis. Adopting the best avail- essential for assessments to lead Cobb stated that our thinking able tools (subject to institutional to learning. Instructors should about the statistics curriculum has

VOL. 31.2, 2018 58 to go from the ground up, “start- meaning can be extracted from the Wasserstein, R.L., and Lazar, N.A. ing necessarily with alternatives data around them. 2016. The ASA’s statement on to the former consensus introduc- p-values: context, process, and tory course.” He called for all in Further Reading purpose. American Statisti- the profession to “experiment and cian 70(1):129–133. http://bit. evaluate, question everything, and Cobb, G.W. 2015. Mere renovation ly/1QCraDC. take nothing for granted.” is too little too late: We need Wild, C., and Pfannkuch, M. 1999. to rethink our undergraduate Cobb also noted that change Statistical Thinking in Empiri- curriculum from the ground up. is hard: “[C]hanging curriculum, cal Enquiry (with discussion). American Statistician 69(4):266– like moving a graveyard, depends International Statistical Review 282. http://bit.ly/2FEgk7u. on local conditions.” We concur, 67(3):223–265. delMas, R., Garfield, J., Ooms, A., and reiterate the importance of and Chance, B. 2007. Assess- faculty development efforts and ing students’ conceptual under- resources created by organizations standing after a first course in About the Authors like the ASA and the Consortium statistics. Statistics Education for the Advancement of Under- Research Journal 6:28–58. http:// Beverly L. Wood is an assistant professor and the graduate Statistics Education discipline chair for mathematics at Embry-Riddle Aeronautical bit.ly/2tNR1Kr. University’s Worldwide Campus. (CAUSE) to support instructors Franklin, C., Hartlaub, B., Peck, and the profession. R., Scheaffer, R., Thiel, D., and Megan Mocko is a master lecturer at the University of Florida in the Department of Statistics. Her focus is on The revised GAISE College Freier, K.T. 2011. AP statis- Report provides guidance for next teaching introductory statistics in multiple formats: face-to-face, tics: Building bridges between hybrid, and completely online. She studies different techniques steps along that path. It will take high school and college sta- to improve learning for students with learning disabilities extensive faculty development and tistics education. American and students in the online learning environment, as well as training for its substance to be real- Statistician 65:177–182. http:// teaching with technology. ized. Such development can help bit.ly/2DqH3yj. Michelle Everson is a teaching professor and instructors achieve the new learn- Gould, R. 2010. Statistics and the program specialist in the Department of Statistics at The ing outcomes emphasized in this modern student. International Ohio State University. Before this, from 2002 until 2014, report and highlight the impor- she was a member of the faculty in the Quantitative Methods Statistical Review 78:297–315. in Education program in the Department of Educational tance of teaching statistics in the Jacobbe, T., Whitaker, D., Case, C., Psychology at the University of Minnesota. She received an spirit of the GAISE recommen- and Foti, S. 2014. The LOCUS Outstanding Teaching award from the University of Minnesota dations. Rob Gould suggests that assessment at the college level: in 2009 and the ASA Waller Education Award for excellence more examples that resonate with in undergraduate teaching in 2011. From 2013 until 2015, Conceptual understanding in she was the editor of the Journal of Statistics Education, and modern students are needed. The introductory statistics. In Sus- co-chair of the ASA committee tasked with updating the field also needs additional efforts to tainability in statistics education. GAISE College Report. bridge the gap between statistical Proceedings of the Ninth Inter- Nicholas J. Horton is Beitzel Professor of education and statistical practice. national Conference on Teaching Technology and Society at Amherst College. He is active in Such initiatives will help ensure Statistics (ICOTS9, July 2014), efforts to ensure that students develop the capacity to “think that the introductory statistics Makar, K., de Sousa, B., and with data” in their statistics and data science courses. He course remains a vibrant choice for Gould, R., eds. 2014. http://bit. serves as a member of the National Academies Committee on Applied and Theoretical Statistics and chairs the Committee of students seeking to learn how ly/2peDzuk. Presidents of Statistical Societies.

Paul Velleman is professor emeritus of statistical sciences at Cornell University and the developer of the Data desk statistics software, DASL library of teaching datasets, and ActivStats multimedia statistics e-book and associated online tools. He is the co-author of several statistics texts, including Intro Stats (5th edition).

CHANCE 59 [The Odds of Justice] Mary W. Gray Column Editor

Whom Shall We Kill? How Shall We Kill Them?

upreme Court Justice Harry ranking behind only China, Iran, whether a punishment must be Blackmun once said, “I Saudi Arabia, Iraq, Pakistan, and both “cruel” and “unusual” to be shall no longer tinker with Egypt in the number of executions forbidden, perhaps in the belief Sthe machinery of death.” What in 2017? that the terms are redundant, but it about the use of statistics to address Since the Eighth Amendment is clear that forms of execution now these questions? to the U.S. Constitution forbids considered cruel may well once A 2012 report from the National “cruel and unusual” punishment, have been quite common. Research Council concluded that perhaps statistics might fortify Death by firing squad (fusil- statistics concerning the deter- opposition based on the unusual lade), hanging, or electrocution is rence, or lack of thereof, of the nature of the penalty. On the other still legal in some states, although death penalty should be ignored. hand, 31 states still permit execution. lethal injection is the most com- Due to problems with the data, What else does the U.S. mon form of execution currently. either those based on a panel study Constitution say? There is no Difficulties in obtaining preferred of comparable jurisdictions or explicit mention of the death pen- substances for such executions, longitudinal examinations of alty in the Constitution, but the botched attempts, and contro- before-and-after effects of the Fifth Amendment uses the term versy about the pain and suffering imposition or abolishment of “capital crime” in guaranteeing involved have led to claims that the the death penalty, the conflicting due process. However, the term firing squad—the most-common results—yes, it was a deterrent, is defined to be a crime so serious form of execution worldwide—is no it was not, or there is clearly no that it would justify the imposition the least cruel, as asserted in Justice effect—could not be supported. of the death penalty—a somewhat Sonia Sotomayor’s dissent to the The effects of the death penalty circular result. refusal of the U.S. Supreme Court hardly being a situation for a clini- In addition to murder and trea- to hear Arthur v. Alabama last year. cal trial, it is unlikely that statistics son, a variety of crimes have his- Although the last such execution will achieve any more-conclusive torically been considered “capital,” was in 2010, Utah recently rein- results on this subject. If only but the definition has evolved, at stituted fusillade as an alternative those executed can be guaranteed least in the United States. Even should substances for lethal injec- to be deterred, what is the justifi- constitutional “originalists,” such tion not be available. cation for the sentence of death? as the late Justice Antonin Scalia, If not deterrence, closure, or “Closure” for relatives of victims would not have maintained that revenge, what about justice as is sometimes cited, but instances many of the crimes punishable by justification for the existence of abound of unsuccessful pleas for execution at the time of the forma- the death penalty? Surely jus- revocation of the penalty from tion of the Constitution should be tice demands equal treatment those most affected by the crimes. considered “capital” today, but how as guaranteed by the Fifth and What then? Revenge? Perhaps, far has the definition evolved? 14th amendments? Statistics has but why is it that a country that Although proponents of aboli- played a role in guiding decisions likes to portray itself as a bas- tion of the death penalty rely on about what constitutes equal treat- tion of human rights is the only the Constitutional prohibition of ment in employment, housing, western country with such vio- “cruel and unusual” punishment, education, and—what is relevant lent state-administered revenge, there has been little parsing of here—criminal justice.

VOL. 31.2, 2018 60 With respect to imposition of demonstrate the factors influenc- influence juries and judges in the the death penalty, the question ing the sentencing. In addition to ultimate case of the death penalty? of equality of justice arises in two aspects as such race, gender, age, There are other roles for an ways: in the process employed— religion, education, economic sta- expert statistical witness in death was jury selection discriminatory? tus, and previous criminal record penalty cases. Fairness dictates that and in the outcome—what factors of the defendant and of the vic- only those capable of making the determined the penalty? tim, a model may include variables appropriate judgment (mens rea) For many years, statistical argu- like the cruelty of the murder, should be held responsible for their ments have been made regarding whether there was an admission of crimes. Thus, the Supreme Court the fairness of a trial, including guilt or contrition on the defen- has found that the death pen- the selection of the jury members dant’s part, and the impact on the alty is impermissible for juveniles who choose to impose the death victim’s family. (Roper v. Simmons) and for some penalty. In some cases, courts have While such post facto analy- who are mentally impaired (Atkins observed discrepancies, but usually ses are interesting, courts have v. Virginia) or mentally ill (Ford v. Wainwright). some other factor also points to a often remarked that the statistics The problem is to determine, source of discriminatory results, tell us nothing about appropriate in the latter case, the degree of such as choosing potential jurors sentencing in any individual case diminished responsibility. Can from segregated lists of taxpay- (McCleskey v. Kemp). the defendant assist in her/his ers (white paper for whites, yellow Sentencing by algorithm in paper for blacks), rather than to the own defense? Can she/he under- other cases is already in use. Courts stand the meaning of being tried statistics themselves. in a number of states employ con- For example, although the for murder and facing the death sultants to predict recidivism and Supreme Court asserted that it penalty? A recent Supreme Court hence the length of sentences of was not necessary to its finding of ruling faulted measurement convicted felons. It is nothing new discrimination, it found in Whitus experts for the impreciseness of for pre-sentencing reports to aid v. Georgia that “assuming that tests for mental impairment, rul- judges in this responsibility, but 27% of the list was made up of the ing that an IQ score of at least should statistics decide in an indi- names of qualified Negroes, the 70 as the sole determining fac- vidual case? If the defendant is mathematical probability of hav- tor permitting a death penalty a young minority male with less ing seven Negroes on a venire (jury was unconstitutional. than a high school education who selection) of 9 is .000006.” Other Is a sentence of life imprison- commentators have addressed is homeless or from a part of the ment without the possibility of the composition of the judicial city with a high crime rate, should parole fair if the developmental system—how many minority that call for a longer sentence if immaturity or defect may disap- prosecutors or judges? Regard- statistics show that these are fac- pear? Expansion of neuroscience less of whether their backgrounds tors traditionally associated with techniques for identifying the effect of developmental immatu- unfairly influence the judgments, high recidivism rates? What is rity or damage encompass new certainly an argument can be made worse, what if only the consultant areas for statistical help. Are we for diversity in our justice system. knows the factors being used in a ready and willing to assist? Without convincing evidence patented algorithm not available concerning the presence of imbal- for challenge by the defendant’s ance in the process or direct attorney? Was race used? How evidence of discrimination in the about religion? judgment, statisticians seek to Are we responsible for the mis- About the Author use of our discipline? We challenge model the outcome. The favor- Mary Gray, who writes The Odds of Justice column, is ite Simpson’s paradox examples studies of e-cigarettes financed by professor of mathematics and statistics at American University included in many elementary tobacco companies or similar situ- in Washington, DC. Her PhD is from the University of Kansas, statistics texts illustrate the sig- ations because we see potential and her JD is from Washington College of Law at American. bias in the source. In sentencing A recipient of the Elizabeth Scott Award from the Committee nificance of the race of the victim of Presidents of Statistical Societies, she is currently chair of the and race of the defendant in death felons, are the courts—our justice American Statistical Association’s Scientific and Public Affairs penalty cases. system—perpetrating this process? Advisory Committee. Her research interests include statistics Analysts have constructed elab- Should we take part? In particu- and the law, economic equity, survey sampling, human rights, education, and the history of mathematics. orate logistic models attempting to lar, should algorithmic judgments

CHANCE 61 [Visual Revelations] Howard Wainer Column Editor

Ancient Visualizations Howard Wainer and Michael Friendly

he graphic representation research and scientific evidence: abstract or geometric signs, which of scientific phenom- “The graphic method translates all carbon dating has estimated to be ena serves at least four these changes in activity of forces 17,300 years old. purposes.T One of these is standard- into an arresting form that one Although the Rouffignac Cave izing phenomena in visual form; could call the language of the phe- drawings are stylized, they are a second is communicating such nomena themselves …” spectacular by any measure. What phenomena more broadly, thus Such a language was, for Marey, were the artists thinking? Why serving the cause of publicity to universal in two senses. Graphi- did they draw them? We can only the scientific community. They cal representation could cut across speculate: perhaps a celebratory preserve what is ephemeral and the artificial boundaries of natu- display of past hunting success, or distribute it to all who would view ral languages to reveal nature to a visual representation of group or purchase the volume, not just all people, and across disciplinary stories or myths. What does seem the lucky few who were in the right boundaries to capture phenomena clear is that they had some sym- place at the right time with the as diverse as the pulse of a heart bolic meaning: The artist, having right equipment to see the evi- and the downturn of an economy. neither a written language nor dence directly with their eyes. Pictures became more than merely number system, used the drawings/ The graphic display of natural helpful tools: They were the voice visualization to provide at least a phenomena, however, was viewed of nature herself. rudimentary communication to as yet more. Étienne-Jules Marey, The history of visualization is any who viewed but were unable the French scientist who did much much longer. Some the best known to see it first-hand. to develop the Graphical Method of very early examples of human An example of awakening in the late 1800s, suggested in an visualizations are found in the power is the story of the wooly accompanying note to his design of Rouffignac caves in the south of mammoth. The tale begins in 1801, a portable polygraph that through France. On the cave walls are hun- when Charles Wilson Peale, a the use of graphics, scientists could dreds of remarkable drawings of curator of one of the first museums reform the very essence of scientific animals, human-like figures, and of natural history in the U.S., put

VOL. 31.2, 2018 62 Figure 1. One panel of a wooly mammoth, from the Rouffignac Cave. Photo courtesy of the Bradshaw Foundation. a mammoth skeleton on display in by looking closely at the art of to us across the centuries and aided Philadelphia’s Philosophical Hall. those who observed the animals modern understanding. Note: This essay The skeleton was a mixture first-hand. More than half of We needn't limit ourselves to is an expansion of anatomy and guesswork, with the hundreds of drawings left by implicit visual instruction from missing bones replaced by wood ancient observers, however. On the of material from Paleolithic artists on the walls of Chapter 1 of a new or papier mâché. Among his other walls of a 4,000-year-old tomb in the Rouffignac Cave were of the book by Friendly & mistakes, Peale had the tusks on Beni-Hassan, an ancient Egyp- wooly mammoth (see Figure 1). Wainer scheduled upside down (modern recon- tian cemetery (about 20 kilometers The value of such visual com- for publication in structions have it right: upward south of modern-day Minya), are April 2019. curling tusks, humped shoulders, munications has sometimes proved inscribed explicit visual wrestling downward sloping spine. Yet our to be profound. Through such instructions for how to throw one's modern conception was not depictions, the original artists, their opponent, that can be profitably obtained by studying bones, but bones long dust, have reached out studied even today (Figure 2). It

CHANCE 63 Figure 2. Wrestling instructions from a drawing on the east wall of the tomb of Baqt at Beni-Hassan Cemetery (ca. 2000 BCE). Source: www.celtictigerswrestling.com/about-us/history-of-wrestling-2

tells a graphic story of wrestling Further Reading About the Authors techniques, animated in a way that can compete with modern visu- Friendly, M., and Wainer, H. 2019. Howard Wainer is a statistician and author who alizations. Who knew that there On the Origin of Graphic Species. lives in Pennington, NJ. He has written this column since were so many different positions? Cambridge: Harvard University 1990. Consequently, through their Press. Michael Friendly is a professor of psychology at visualizations, ancient artists—we Hohn, D. 2017. When they roamed York University in Toronto, Canada. He has long studied the might call them early data scien- the Earth. New York Times Book history of data visualization. tists—conveyed information across Review. the ages, to those who were not Marey, E.-J. 1878. La Méthode able to be there and observe for Graphique dans les Sciences themselves. These two examples Expérimentales et Princi- provide a vivid reminder of one of palement en Physiologie et en the most important goals of scien- Médecine. Paris: G. Masson. tific visualization.

VOL. 31.2, 2018 64 STUDENTS: TAKE THE NEXT STEP AND JOIN THE AMERICAN STATISTICAL ASSOCIATION

GET INVOLVED Connect with other young professionals through the ASA Community and STATtr@k

LEARN Get online access to journals and subscriptions to Amstat News and Significance

ADVANCE Access networking and career opportunities

Student memberships are only $25! Join today at www.amstat.org/join. HELP US RECRUIT THE NEXT GENERATION OF STATISTICIANS The field of statistics is growing fast. Jobs are plentiful, Site features:

opportunities are exciting, and salaries are high. • Videos of young So what’s keeping more kids from entering the field? statisticians passionate about their work Many just don’t know about statistics. But the ASA is • A myth-busting quiz working to change that, and here’s how you can help: about statistics

• Send your students to www.ThisIsStatistics.org • Photos of cool careers in statistics, like a NASA and use its resources in your classroom. It’s all biostatistician and a about the profession of statistics. wildlife statistician

• Download a handout for your students about careers • Colorful graphics in statistics at www.ThisIsStatistics.org/educators. displaying salary and job growth data

• A blog about jobs in statistics and data science If you’re on social media, connect with us at www.Facebook.com/ThisIsStats and • An interactive map www.Twitter.com/ThisIsStats. Encourage of places that employ your students to connect with us, as well. statisticians in the U.S.