CHANCEVol. 32, No. 3, 2019 Using Data to Advance Science, Education, and Society Special Issue on ASTROSTATISTICS

Including...

Statistics for Stellar Systems: From Globular Clusters to Clusters of

ARIMA for the : How Statistics Finds Exoplanets

09332480(2019)32(3) EXCLUSIVE BENEFITS FOR ALL ASA MEMBERS! SAVE 30% on Book Purchases with discount code ASA18. Visit the new ASA Membership page to unlock savings on the latest books, access exclusive content and review our latest journal articles!

With a growing recognition of the importance of statistical reasoning across many different aspects of everyday life and in our data-rich world, the American Statistical Society and CRC Press have partnered to develop the ASA-CRC Series on Statistical Reasoning in Science and Society. This exciting book series features: • Concepts presented while assuming minimal background in Mathematics and Statistics. • A broad audience including professionals across many fields, the general public and courses in high schools and colleges. • Topics include Statistics in wide-ranging aspects of professional and everyday life, including the media, science, health, society, politics, law, education, sports, finance, climate, and national security.

DATA VISUALIZATION Charts, Maps, and Interactive Graphs Robert Grant, BayersCamp This book provides an introduction to the general principles of data visualization, with a focus on practical considerations for people who want to understand them or start making their own. It does not cover tools, which are varied and constantly changing, but focusses on the thought process of choosing the right format and design to best serve the data and the message. September 2018 • 210 pp • Pb: 9781138707603: $29.95 $23.96 • www.crcpress.com/9781138707603

VISUALIZING BASEBALL Jim Albert, Bowling Green State University, Ohio, USA A collection of graphs will be used to explore the game of baseball. Graphical displays are used to show how measures of batting and pitching performance have changed over time, to explore the career trajectories of players, to understand the importance of the pitch count, and to see the patterns of speed, movement, and location of different types of pitches. August 2017 • 142 pp • Pb: 9781498782753: $29.95 $23.96 • www.crcpress.com/9781498782753

ERRORS, BLUNDERS, AND LIES How to Tell the Difference David S. Salsburg, Emeritus, Yale University, New Haven, CT, USA In this follow-up to the author’s bestselling classic, “The Lady Tasting Tea”, David Salsburg takes a fresh and insightful look at the history of statistical development by examining errors, blunders and outright lies in many different models taken from a variety of fields. April 2017 • 154 pp • Pb: 9781498795784: $29.95 $23.96 • www.crcpress.com/9781498795784

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Vol 112, 2017 THE AMERICAN STATISTICIAN Vol 72, 2018 STATISTICS AND PUBLIC POLICY Vol 5, 2018

http://bit.ly/CRCASA2018 CHANCE Using Data to Advance Science, Education, and Society http://chance.amstat.org

6 14 Special Issue on 20 Astrostatistics

41 Identifying Open Clusters ARTICLES With Extreme Kinematics Using PRIM Mark R. Segal and Jacob W. Segal 4 Special Issue on Astrostatistics 50 Time Series Clustering Methods for Jessi Cisewski-Kehe and Chad Schafer Analysis of Astronomical Data 6 Topology of Our Cosmology with David J. Corliss Persistent Homology

Sheridan B. Green, Abby Mintz, Xin Xu, and Jessi Cisewski-Kehe COLUMNS 14 Mapping the Large-Scale Universe through Intergalactic Silhouettes 59 The Odds of Justice Mary W. Gray, Column Editor Collin A. Politsch and Rupert A.C. Croft Alexa Did It! 20 Just How Far Away is that , Anyway? Estimating Galaxy Distances Using Low-Resolution Photometric Data DEPARTMENTS Peter E. Freeman 27 Statistics for Stellar Systems: From Globular 3 Editor’s Letter Clusters to Clusters of Galaxies 62 Letter to the Editor Gwendolyn Eadie Reconsidering the Human Face as Box Plot 35 ARIMA for the Stars: How Statistics David C. Hoaglin Finds Exoplanets Eric D. Feigelson

Abstracted/indexed in Academic OneFile, Academic Search, ASFA, CSA/Proquest, Current Abstracts, Current Index to Statistics, Gale, Google Scholar, MathEDUC, Mathematical Reviews, OCLC, Summon by Serial Solutions, TOC Premier, Zentralblatt Math.

Cover design: Melissa Gotherman EXECUTIVE EDITOR Scott Evans George Washington University, Washington, D.C. AIMS AND SCOPE [email protected] ADVISORY EDITORS CHANCE is designed for anyone who has an interest in using data to advance science, education, and society. CHANCE is a non-technical magazine highlight- Sam Behseta ing applications that demonstrate sound statistical practice. CHANCE represents a California State University, Fullerton cultural record of an evolving field, intended to entertain as well as inform. Michael Larsen St. Michael’s College, Colchester, Vermont Michael Lavine SUBSCRIPTION INFORMATION University of Massachusetts, Amherst Dalene Stangl CHANCE (ISSN: 0933-2480) is co-published quarterly in February, April, Carnegie Mellon University, Pittsburgh, Pennsylvania September, and November for a total of four issues per year by the American Hal S. Stern Statistical Association, 732 North Washington Street, Alexandria, VA 22314, University of California, Irvine USA, and Taylor & Francis Group, LLC, 530 Walnut Street, Suite 850, EDITORS Philadelphia, PA 19106, USA. Jim Albert Bowling Green State University, Ohio U.S. Postmaster: Please send address changes to CHANCE, Taylor & Francis Phil Everson Group, LLC, 530 Walnut Street, Suite 850, Philadelphia, PA 19106, USA. Swarthmore College, Pennsylvania Dean Follman ASA MEMBER SUBSCRIPTION RATES NIAID and Biostatistics Research Branch, Maryland Toshimitsu Hamasaki ASA members who wish to subscribe to CHANCE should go to ASA Members Office of Biostatistics and Data Management Only, www.amstat.org/membersonly and select the “My Account” tab and then National Cerebral and Cardiovascular Research “Add a Publication.” ASA members’ publications period will correspond with Center, Osaka, Japan their membership cycles. Jo Hardin Pomona College, Claremont, California Tom Lane SUBSCRIPTION OFFICES MathWorks, Natick, Massachusetts USA/North America: Taylor & Francis Group, LLC, 530 Walnut Street, Michael P. McDermott University of Rochester Medical Center, New York Suite 850, Philadelphia, PA 19106, USA. Telephone: 215-625-8900; Fax: 215- Mary Meyer 207-0050. UK/Europe: Taylor & Francis Customer Service, Sheepen Place, Colorado State University at Fort Collins Colchester, Essex, CO3 3LP, United Kingdom. Telephone: +44-(0)-20-7017- Kary Myers 5544; fax: +44-(0)-20-7017-5198. Los Alamos National Laboratory, New Mexico Babak Shahbaba For information and subscription rates please email [email protected] or University of California, Irvine visit www.tandfonline.com/pricing/journal/ucha. Lu Tian Stanford University, California OFFICE OF PUBLICATION COLUMN EDITORS Di Cook American Statistical Association, 732 North Washington Street, Alexandria, Iowa State University, Ames VA 22314, USA. Telephone: 703-684-1221. Editorial Production: Megan Visiphilia Murphy, Communications Manager; Valerie Nirala, Publications Coordinator; Chris Franklin Ruth E. Thaler-Carter, Copyeditor; Melissa Gotherman, Graphic Designer. University of Georgia, Athens Taylor & Francis Group, LLC, 530 Walnut Street, Suite 850, Philadelphia, PA K-12 Education 19106, USA. Telephone: 215-625-8900; Fax: 215-207-0047. Andrew Gelman Columbia University, New York, New York Ethics and Statistics Copyright ©2019 American Statistical Association. All rights reserved. No part Mary Gray of this publication may be reproduced, stored, transmitted, or disseminated in American University, Washington, D.C. any form or by any means without prior written permission from the American The Odds of Justice Statistical Association. The American Statistical Association grants authoriza- Shane Jensen tion for individuals to photocopy copyrighted material for private research use Wharton School at the University of Pennsylvania, Philadelphia on the sole basis that requests for such use are referred directly to the requester’s A Statistician Reads the Sports Pages local Reproduction Rights Organization (RRO), such as the Copyright Nicole Lazar Clearance Center (www.copyright.com) in the United States or The Copyright University of Georgia, Athens Licensing Agency (www.cla.co.uk) in the United Kingdom. This authoriza- The Big Picture tion does not extend to any other kind of copying by any means, in any form, Bob Oster, University of Alabama, Birmingham, and and for any purpose other than private research use. The publisher assumes no Ed Gracely, Drexel University, Philadelphia, Pennsylvania Teaching Statistics in the Health Sciences responsibility for any statements of fact or opinion expressed in the published Christian Robert papers. The appearance of advertising in this journal does not constitute an Université Paris-Dauphine, France endorsement or approval by the publisher, the editor, or the editorial board of Book Reviews the quality or value of the product advertised or of the claims made for it by its Aleksandra Slavkovic manufacturer. Penn State University, University Park O Privacy, Where Art Thou? RESPONSIBLE FOR ADVERTISEMENTS Dalene Stangl, Carnegie Mellon University, Pittsburgh, Pennsylvania, and Mine Çetinkaya-Rundel, Send inquiries and space reservations to: [email protected]. Duke University, Durham, North Carolina Taking a Chance in the Classroom Printed in the United States on acid-free paper. Howard Wainer National Board of Medical Examiners, Philadelphia, Pennsylvania Visual Revelations WEBSITE http://chance.amstat.org EDITOR’S LETTER

Scott Evans

Dear CHANCE Colleagues, n 1961, the president of the United States John F. Kennedy wanted to land humans on the moon. For the next several years, NASA prepared to do so. On July 16, 1969, Apollo 11 blasted off carrying astronauts Neil Armstrong, Edwin I“Buzz” Aldrin and Michael Collins. Four days later, on July 20, Armstrong and Aldrin landed the lunar module called the Eagle on the moon. Astronaut Collins stayed in orbit performing experiments and taking photos. Armstrong became the first human to step on the moon. He and Aldrin walked around for three hours, conducting experiments, and picking up bits of moon dirt and rocks. They planted a United States flag and left a sign that read “Here men from the planet Earth first set foot upon the moon July 1969, A.D. We came in peace for all mankind.” The two astronauts returned to orbit, joining Collins. On July 24, 1969, all three astro- nauts came back to Earth safely. President Kennedy’s wish came true in less than 10 years. As humanity has continued to explore space, the collaborations between statisticians and astronomers has increased. This special issue ofCHANCE is devoted to astrostatistics. I wish to thank the ASA’s Astrostatistics Special Interest Group, who helped us develop this special issue. Special thank you to the guest editors for this issue: Jessi Cisewski-Kehe, assistant professor in the Department of Statistics and Data Science at Yale University and past chair of the Astrostatistics Special Interest Group, and Chad Schafer, associate profes- sor in the Department of Statistics & Data Science at the Carnegie Mellon University and chair of the Astrostatistics Special Interest Group. Scott Evans

CHANCE 3 Special Issue on Astrostatistics Jessi Cisewski-Kehe and Chad Schafer

ike most areas of science, astronomy has bene- involve undergraduate and graduate students, with no fited from a tremendous increase in the amount background in astronomy, in fruitful projects. of available data gathered in recent decades, With this in mind, this special issue of CHANCE Lfrom both ground- and space-based instruments. For seeks to provide some context for common themes example, the (SDSS) com- in astrostatistics, the types of data encountered, and menced in 2000, at a time when studies were typically the important role that statistical methods can play. conducted with galaxy samples of sizes on the order of Methods well-established in statistics may be little- 1,000. By 2008, SDSS had produced a map of 930,000 known or used in astronomy. Astronomical problems galaxies, with high-resolution spectral information also present unique challenges that force the devel- for each. By the time of its 15th data release in 2017, opment of new methods, or at least push existing SDSS had a catalog of more than 2.5 million galax- methods in interesting directions. ies. The Large Synoptic Survey Telescope (LSST), In this issue, Green, Mintz, Xu, and Cisewski- currently under construction in Chile, is projected to Kehe discuss one of the recurring challenges of yield a catalog of 1010 galaxies by the time its 10-year working with modern astronomical data—namely, survey ends in 2032. that the data in their raw form are complex and not Data gathered about astronomical objects take a amenable to classical statistical analysis. Astronomers range of forms, but at their basic level, they consist of have often used ad hoc compression of these data, intensity measurements, often observed at a range of extracting low-dimensional features that are believed different wavelengths, at different spatial positions, to encode important information, but the potential and at various points in time. It is the variability loss of information is great. Ideas from topological across objects, wavelengths, and space and time that data analysis, a new and growing area of study in sta- encodes invaluable information regarding the forma- tistics, have great promise for data-driven approaches tion, evolution, and current state of the universe and to extracting important information from data such as its components. Analyses are complicated, however, the large-scale structure of the universe. by observational limitations, including measurement Politsch and Croft describe the challenges and error, contamination, and the inherent limitations of potential of working with the Lyman-alpha forest, our Earth-centered frame of reference. another complex data set that informs our under- The role and participation of statisticians in astron- standing of the large-scale structure of the universe. omy has increased along with the sizes of the data Lyman-alpha forest data are particularly interesting sets and the complexity of the inference challenges, because of the innovative approach for collecting these but further collaboration is needed to address the data. The light from distant quasars provides access to many open problems. A core group of statisticians is some aspects of the intergalactic medium, which can committed to this effort; they are working to make then be used to infer properties of the distribution of astronomical data analysis known to the broader sta- gas in regions that would otherwise be inaccessible. tistical community and helping to bridge any language Although modern astronomical data sets are mas- barriers that may exist. sive, individual observations are often of low quality. It has been our experience that the perceived diffi- Freeman provides an overview of the challenges culty of the subject matter is exaggerated. We typically of using low-resolution photometry to estimate a

VOL. 32.3, 2019 4 fundamental property of a celestial object: its distance from us. On cosmological scales, distance is a proxy for time, so any study of the evolution of the universe relies on accurate distance measures, typically quanti- fied via the . Eadie considers a similarly fundamental prob- lem—that of estimating the mass of gravitationally bound dynamical systems like galaxies. Such esti- mates are crucial to understanding the nature of dark matter. How the positions and velocities of stars (called tracers) are used to constrain a system’s mass Jessi Cisewski-Kehe Chad Schafer illustrates a range of inference approaches in astron- Guest Editor Guest Editor omy: These observations have a complex, physically motivated relationship with the quantity of interest. Estimation in such situations presents unique chal- lenges, and has motivated the development of a range of statisticians and astronomers, including of novel techniques. students, postdocs, researchers, faculty, and A chief benefit of increased participation of statis- members of industry or government. ticians in astronomy is the cross-pollination of ideas across fields: Statisticians work in a wide range of • The Astrostatistics and Astroinformatics Por- domains, of course, and experts in the development of tal (ASAIP, https://asaip.psu.edu) is a central statistical methods are always on the lookout for new site that compiles information about the field areas of application for classic and novel approaches. relevant to astronomers, computer scientists, Segal and Segal apply the Patient Rule Induction and statisticians. It includes information about Method (PRIM), a less-known but potentially use- papers, meetings, and other resources for both ful, tool for supervised learning about a classification new and active researchers in astrostatistics. problem in astronomy that involves identifying open • The Cosmostatistics Initiative (COIN, https:// clusters of stars using Gaia data. cosmostatistics-initiative.org) is an international Perhaps one of the most-exciting recent develop- and interdisciplinary community focused on ments in astronomy is the discovery of a large number developing stronger ties between the various of exoplanets orbiting stars in the Milky Way, and fields related to astrostatistics. COIN also statistical tools play a crucial role in making these organizes “Residence Programs” where a small discoveries. Feigelson provides an overview of the group of researchers in astronomy, computer challenges of searching for the signatures of exoplan- science, cosmology, and statistics meets for ets in noisy time series, and discusses how classical a week in various destinations around the time series models and modern machine learning world to focus on solving several problems can combine to develop improved approaches to this in the field while forming closer interdisci- important problem. plinary ties. The Residence Programs have Corliss demonstrates the potential of modern been quite productive and are a great way approaches to clustering in the classification of to make connections with those interested supernovae—the explosive deaths of stars—based on in astrostatistics. their observed time series. This work demonstrates that clustering, especially with data such as these, • The International Astrostatistics Associa- requires careful consideration of the choice of simi- tion (IAA, http://iaa.mi.oa-brera.inaf.it/IAA/ larity measure. home.html) seeks to bring together researchers With this issue as a starting point, there are many from around the world who are interested in avenues for getting involved in astrostatistics. advancing the field of astrostatistics. • The American Statistical Association has an Involvement in any of these organizations can be Astrostatistics Special Interest Group (https:// a great way to learn more about astrostatistics. We community.amstat.org/astrostats/home) that hope many of you will join us in the exciting pursuit welcomes and encourages the participation of exploring our universe.

CHANCE 5 Topology of Our Cosmology with Persistent Homology Sheridan B. Green, Abby Mintz, Xin Xu, and Jessi Cisewski-Kehe

riginally a speculative the Sloan Digital Sky Survey density. We also have yet to verify branch of natural phi- [SDSS]), which is also referred to whether Einstein’s cosmological L losophy, cosmology is as the “cosmic web,” has ushered constant , which is an unchang- Oan ancient field concerned with in a data-driven, interdisciplin- ing property of space, does indeed answering some of the big-picture ary era of “precision cosmology.” describe dark energy or if the prop- questions about the origin and The high nonlinearity of cosmic erties of dark matter instead vary evolution of our universe. Our structure formation at and below across spacetime, as predicted by modern scientific understanding of megaparsec- (Mpc-) length scales “quintessence” models. physical cosmology has been under precludes the ability to make many Recently, a “crisis in cosmology” development for roughly a century, purely theoretical predictions of has been brewing, as disagreement beginning around the time that the consequences of LCDM between the value of the Hub- Albert Einstein was writing about (1 is approximately 3.25 ble constant, which describes the the astronomical implications of light years, and the largest galaxy expansion rate of the universe, his theory of general relativity. clusters are on the order of Mpc). calculated from temperature fluc- The discovery of the Cosmic Fortunately, recent times have also tuations in the CMB, has risen to a  Microwave Background (CMB) brought about substantial increases 4.4 tension with the value calcu- (the farthest structure whose light in computational power, enabling lated using Cepheid variable stars. we can observe in the universe, predictions of LCDM to be made These stars, which are thousands which provides a snapshot of via cosmological N-body simula- of times brighter than the sun and the universe only 400,000 years tions (see Figure 1), one of the oscillate periodically in brightness, after its birth) in 1964 validated most-successful tools to date for are known as “standard candles” predictions of the original Big making quantitative predictions of because their absolute bright- Bang model, leading to several cosmic structure. ness can be determined directly additional decades of theoretical This abundance of data, both by measuring its period. From the developments that have resulted observed and simulated, has pro- relationship between the absolute in the current “standard model of vided support for LCDM: Both brightness and observed brightness cosmology.” In this Lambda-Cold theoretical and computational pre- of these stars, their distances can be Dark Matter (LCDM) picture dictions of the model on large scales determined, and that information of the universe, the vast major- have been validated in statistical can be used to trace the expansion ity of matter is “dark” (i.e., it does stress-tests against observations. history of the universe. not emit light and only interacts However, LCDM is not with- Various solutions to this appar- L with regular matter via gravity) out its potential discrepancies and ent inconsistency with CDM and the cosmos is expanding at inconclusivities with the current have been proposed, primarily an ever-increasing rate due to the generation of astronomical data. including modifications to dark abundance of dark energy (known For instance, it remains unclear energy and the incorporation of as L), a mysterious substance with whether the structure of galaxy the effects of the nonzero neutrino L negative pressure that seems to centers is consistent with CDM, mass on large-scale structure. permeate all of space. since there have been some hints To estimate free parameters L In the past 20 years, the arrival that dark matter particles may and test CDM, various model of massive astronomical data sets actually scatter off each other predictions are compared to astro- that map out the large-scale struc- (CDM is assumed to be collision- nomical observations and often ture (LSS) of the universe (e.g., less), changing the central galactic combined using Bayesian inference

VOL. 32.3, 2019 6 Figure 1. A simulation of the cosmic web under the CDM paradigm. The large-scale structure in such simulations is both qualitatively and quantitatively consistent with the universe (right). The haloes within a thin slice of the simulation volume have points representing individual dark matter haloes, and the color represents the density of the local matter field with redder regions having higher density (left). The individual points of the 3D cosmological volume represent the locations of dark matter haloes.

techniques. Such comparisons become available from the upcom- properties of the cosmic web. TDA between theory and observation use ing generation of astronomical provides tools for analyzing the summary statistics to reduce the surveys (e.g., the Large Synop- shape of data. These tools have massive observational data sets tic Survey Telescope [LSST] and potential for improving the under- down to several physically inter- the Dark Energy Spectroscopic standing of the cosmic matter pretable quantities, which are also Instrument [DESI]) will provide field and find physically interest- predicted by theoretical models. sufficient information to discrimi- ing summaries of the cosmic web,  For example, one key summary nate between CDM and various which can help constrain some statistic used for such parameter alternative models, and make it of the unknown quantities of estimation is the galaxy two-point possible to gain deeper insights the universe. correlation function (r), which into the fundamental nature of our describes the excess probability, evolving universe. The Basics of relative to a Poisson distribution, To exploit the observational Persistent Homology of finding a pair of galaxies sepa- data optimally and unlock the rated by a distance r. The value of maximum information content Persistent homology is a tool of r at which this function reaches available, efforts are underway to TDA that can be useful in settings a particular local maximum plays identify additional, complementary with data characterized by holes, an important role as a “cosmic summary statistics that are particu- such as the cosmic web, because it ruler,” with a predicted length larly sensitive to the nuanced dif- produces summaries of different that depends on the cosmological ferences between various models dimensional holes in a data set that model and its parameters (“baryon (e.g., the effects of dark energy or are not easily available using other acoustic oscillations” are respon- massive neutrinos on LSS). methods. Persistent homology is sible for this peak in (r)). This article explores how ideas a framework for computing the It is expected that the massive from topological data analy- homology of actual data. In terms amounts of data that will soon sis (TDA) can be used to study of homology, we speak of these

CHANCE 7 different dimensional holes by the There are different ways to do set shows that there seem to be homology groups they generate. this. One approach involves grow- only three clear loops. Addi- For example, 0-dimensional ing balls around the points, where tional loops can form in the

homology groups (H0) are gener- the filtration parameter is the radius Rips filtration due to the extra ated by what in statistics would of the ball; this is known as a Rips noise points and the layout of be considered to be clusters, one- filtration. As the radius increases, the loops. dimensional homology groups eventually balls start to intersect Fortunately, other filtration

(H1) are generated by loops, with each other. When two balls methods tend to be more-robust and two-dimensional homology intersect, they join to form a single to noise. One alternative filtra-

groups (H2) are generated by things connected component (H0 genera- tion method uses the so-called like the interior of 3D balls (such tor). If a loop is present, such as “distance to a measure” (DTM) as a soccer ball). those in Figure 2(a), eventually the function, which can be thought of There are higher-dimensional balls connect in such a way that as estimating on a grid the aver- homology groups as well, but we the loop forms. The radius of the age distance to some subset of ball (the filtration parameter value) the nearest neighbors within the need only consider H0, H1, and H2 for cosmology, since the universe when this loop forms is called the original data set (requiring the only has three spatial dimensions. “birth” of the H1 generator. selection of the number of nearest As the balls continue to grow, neighbors to use). TheH 0 groups can be generated by clusters of dark matter haloes (as eventually the loop gets filled in by Another alternative is to use a seen in cosmological simulations) the larger balls; the radius at which kernel density estimate to smooth or clusters of galaxies (as observed this occurs specifies the “death” over the data set, which requires of the H generator. The “persis- in galaxy surveys such as the SDSS). 1 selection of the band width. DTM tence” of a feature is the difference turns the point-cloud data into TheH groups can be generated by 1 between the birth and death times. a function on a grid, so growing the filaments (the weblike strings in Persistent homology keeps track the balls for the filtration will no Figure 1) that form loops, and the of the birth and death times of the longer work. Instead of the radius H groups can be generated by voids 2 different homology group genera- of a ball, the filtration parameter (the low matter density regions of tors and then plots these values is governed by a threshold para- the universe). on a summary diagram known as meter that defines lower-level With some notion of a con- a “persistence diagram.” The Rips sets, as explained below. As the nection between homology and filtration persistence diagram for threshold parameter increases, the cosmology, it is possible to get this data set is displayed in Figure lower-level sets are used to assess a better idea of what persistent 2(c). The red triangles indicate the when homology group generators homology provides. Consider the birth and death times of the H1 form and die. simulated point-cloud data set in generators, and the black circles Figure 2(b) displays an esti- Figure 2(a). There are points ran- indicate the birth and death times of mated DTM function. To get a domly sampled on three loops with the H0 generators. The persistence sense of how, for example, an H1 a little noise and additional points diagrams show that the further a generator can form using a DTM scattered away from the loops. generator is from the diagonal (the function, imagine setting a thresh- Calculating the homology of the 45-degree line), the longer that fea- old value of around 0.4 (see the data set would reveal that there are ture lasts in the filtration. color bar on the right side of Figure 200 H0 generators (the 200 points Sometimes it makes sense to 2(b)) and consider the lower-level sampled in the data set), and no H1 consider the features that last the sets (the grid locations that corre- or H2 generators…but there appear longest to be topological signal spond to the DTM function values to be at least three clear loops, sug- (e.g., if larger, fixed structures are that are less than 0.4). The pixels gesting there should be at least three expected), while the features near that are “activated” at that thresh- H1 generators. the diagonal may be considered old would form the three loops, as This is where the “persistent” in topological noise. shown in Figure 3. persistent homology comes into From the Rips persistence As the threshold increases, play. For persistent homology, we diagram of Figure 2(c), there are the lower-level sets will start to

create some sort of intermediate six H1 generators that seem well- fill in the loops. The DTM per- structure that changes with a speci- separated from the diagonal. sistence diagram is presented fied filtration parameter. However, the corresponding data in Figure 2(d). The persistence

VOL. 32.3, 2019 8 Figure 2. (a) A simulated data set. (b) An estimated DTM function for the data set displayed in (a). The corresponding persistence diagrams using a Rips filtration (c) and a DTM filtration (d). The red triangles represent the H1 generators and the black circles represent the H0 generators.

Figure 3. Lower-level set (in black) of the estimated DTM function from Figure 2(b) using a threshold of 0.4. Notice the three loops, indicating that their births occurred before 0.4 and their deaths will be after 0.4.

CHANCE 9 diagram only has three H1 gen- the DTM grid, forming filament This is helpful because it erators (red triangles) that are loops or enclosing cosmic voids. means being able to forgo running well-separated from the diagonal, Because the homology group gen- expensive, complicated cosmo- in contrast to the six in the Rips erators are actually mathematical hydrodynamical simulations, persistence diagram of Figure 2(c). groups, unique representations which attempt to incorporate the Persistence diagrams can be are not available for the genera- many details of galaxy formation. useful objects to consider for tors indicated on a persistence Instead, the results of relatively data visualization, but they have diagram. However, obtaining a cheap dark matter-only cosmo- also been employed for inference representation of these features logical N-body simulations of or prediction in settings beyond in the cosmological volume is enormous physical volumes can be astronomy and cosmology (e.g., important for understanding the used as the input to SCHU. brain artery trees, natural language potential scientific relevance of The large volumes studied in processing, biology). Many of the the objects. such simulations make it pos- inference and prediction methods Other methods exist for iden- sible to use the “linear theory” of developed use functional summa- tifying cosmological voids, so to structure formation, which is only ries of persistence diagrams. valid on large scales (i.e., above compare the SCHU H2 voids to such methods, estimating their Mpc scales), while also providing The SCHU Method physical locations is essential. For information about the evolution on smaller scales that would be a DTM persistence diagram, the As noted previously, the homol- otherwise unobtainable through representation of a generator using ogy group generators, particularly analytical techniques. These simu- SCHU tend to be the inner con- H1 and H2, have connections to lations also allow easily specifying, tour of the smallest lower-level set relevant cosmological structures. and then varying, the cosmological that forms the homology group In previous work, a method was model under which the universe generator. The choice of the spatial developed, called “Significant Cos- evolves. Thus, in principle, we resolution (i.e., the size of the grid mic Holes in Universe” (SCHU), can run simulations with widely for finding statistically significant cells) and the number of the near- different models of dark energy est neighbors used for computing H1 generators (known as “fila- or that span a large range of the the DTM function influences the  ment loops”) and H2 generators CDM parameter space, profiling (cosmic voids) back in the original appearance of the representations. how the topology (encoded in the cosmological data volume using persistence diagrams) depends on techniques available in the R pack- Using SCHU to the cosmology. age called “TDA.” Study Cosmological Currently, various simulation In SCHU, a three-dimensional Simulations databases cover interesting regions grid is built around the data vol- of cosmological model space. The distribution of matter in the ume and a DTM filtration is For example, the Dark Energy constructed to obtain the persis- universe is visibly traced by galax- Universe Simulation Series ies, which are what are observed tence diagram. SCHU can then  (DEUSS) consists of several simu- assign statistical significance to in galaxy surveys. In the CDM lations with various models of dark generators on persistence diagrams model, galaxies are predicted to energy, including CDM and the and also locate a representation form within dark matter haloes. Ratra-Peebles and SUGRA quin- of those features back in the data While there is not exactly a one- tessence models. volume. More specifically, SCHU to-one mapping between the For each of these models, provides a p-value for each cos- unseen dark matter haloes and DEUSS contains several simu- mic feature based on a bootstrap the observed galaxies (the inter- lations that range over different procedure, which is computa- ested reader can look up “galaxy spatial scales to capture the large tionally intensive because each assembly bias,” for example), the spatial dynamic range over which bootstrap sample requires its own cosmic web can be studied using dark energy acts. DTM filtration computations. dark matter haloes (from large- One of the standard Representations of cosmologi- scale cosmological simulations). CDM simulations in DEUSS cal structures as defined by the Similar results can be expected tracks the evolution of dark homology group generators are when using galaxies (retrieved matter within a cube of side length provided by SCHU as points on from real astronomical surveys). 225 megaparsecs (i.e., about

VOL. 32.3, 2019 10 Figure 4. (a) The persistence diagram and (b) Betti functions from the z = 0 snapshot of the 225 Mpc CDM DEUSS simulation displayed in Figure 1.

1 millionth of the volume of the corresponding “Betti func- we use the DTM filtration, the observable universe) for explor- tions,” which can be thought t can be understood as an average ing the cosmological features using of as a functional summary of a distance. In Figure 4(b), this sug- SCHU. Specifically, the input data persistence diagram, also can gests that the distances at which to SCHU consist of a positional be displayed. For a particular the H1 generators form and die catalog of dark matter haloes dimension of homology group off are generally smaller than the identified at a particular time generators i, the ith Betti function distances at which the H genera-  2 “snapshot” during the evolution i(t) captures, as a function of the tors form and die off. (the haloes within a thin slice of the filtration threshold length scale t, As described above, SCHU simulation volume are displayed in the number of Hi group genera- enables us to also visualize rep- Figure 1). tors that have already been born by resentations of these group Motivated by the fact that the t and have yet to die off. generators in the cosmological  spatial extent of the largest haloes In other words, i(t) represents volume. The most physically inter- (analogous to galaxy clusters) is the number of points directly esting generators in this setting on the order of megaparsecs, and above a square of side length t are for H1 and H2 with the high- being primarily concerned with extending from the origin of the est persistence (i.e., those that are the much-larger filament loops persistence diagram. furthest from the diagonal on the and voids, a DTM with a spatial Figure 4(b) shows the plot of persistence diagram). Figure 5 resolution for the grid of roughly the Betti functions computed from plots the locations of the dark one megaparsec can be employed. the SCHU persistence diagrams. matter haloes and overlay the 10 Start by looking at the fully Functional summaries like these cosmic voids and filament loops evolved simulation, which should be Betti functions can be easier to with the highest persistence in the roughly consistent with a present- work with than persistence dia- simulation volume. The H2 cosmic day universe (i.e., is consistent with grams, especially when attempting voids highlight the low-density the local environment at “redshift” to compare the persistence dia- regions of the data volume and z = 0 ) in a CDM cosmology. grams of different simulations as range in both size and shape. Figure 4(a) shows the plot of done below. The location of the There are other methods for the persistence diagram gener- Betti function peaks reveal the t at locating cosmic voids presented in ated by SCHU for the H0, H1, and which the most homology group the literature, but there is not yet a

H2 of the cosmological simula- generators are active (i.e., born consensus on a definition for cos- tion illustrated in Figure 1. The before and die after that t). Because mic voids. The H1 filament loops

CHANCE 11 Figure 5. The most-persistent (left) voids and (right) filament loops in thez = 0 snapshot of the 225 Mpc CDM DEUSS simulation identified using SCHU.

are a new type of cosmological billion years ago, respectively). That higher filtration thresholds. The structure, and we are not aware of is, the z = 4 snapshot in Figure 6 DTM filtration threshold, t, cor- methods, other than the combina- is from the early (simulated) uni- responds to an average comoving tion of persistent homology and verse, whereas the z = 0 snapshot is distance from a point to a fixed SCHU, that would be able to find analogous to the present day (and number of nearest-neighbor dark such information in LSS. the z = 1 snapshot is somewhere matter haloes. The filament loops displayed in in between). At first glance, it may seem that Figure 5, like the cosmic voids, vary Several things can be noted this implies the early universe had in size and shape—some stretch when comparing these different larger voids, but this is because  and twist through a large portion i(t) functions from Figure 6. First, of the comoving distances: The of the LSS, while others are smaller the Betti function values and peak universe is constantly expanding, and isolated to subsets of the cube. heights are lower for the early so we often employ comoving Visualizing the H1 and H2 struc- universe snapshots (z = 4) com- coordinates, where distances are tures can be useful for later analyses pared to the later ones (z = 0 or measured with respect to the expan- that may investigate, for example, z = 1) because the early universe sion scale at present day. There were how these structures form or prop- contains fewer dark matter haloes.  fewer dark matter haloes overall erties of the local environment of The locations of the i(t) function at z = 4, so the average comoving such structures. peaks also differ between the z = 4 distance out to a fixed number of Another topic of interest in and z = 0 or z = 1 snapshots, but haloes is larger than at later times, astronomy is how structure evolves the z = 0 and z = 1 peaks have when the overall halo clustering across cosmic time. These TDA tools similar locations. and abundance has strengthened. turn out to be useful summaries for Consider Figure 6(c) to illus- If the expansion scale factor was comparing different snapshots of trate what may be happening here.   included, the 2(t) function for the LSS at different points in time. The 2(t) functions at z = 0 and z = 4 would be shifted to the left of In particular, Betti functions allow z = 1 are relatively similar, so the z = 0 and z = 1 functions. for such comparisons. we only need to focus on the Figure 6 displays the  (t) func- differences between, say, the z = 0  i  Concluding Remarks tions for the CDM DEUSS and z = 4 2(t) functions. Overall,

simulation snapshots at fewer H2 generators have formed Topological data analysis can be a

z = 0, z = 1, and z = 4 (corresponding by z = 4, but also, most of the H2 useful framework for summarizing to present day, and roughly 8 or 12 generators formed and died off at information in complicated data

VOL. 32.3, 2019 12 Figure 6. Plots of the (a)  (t), (b)  (t), and (c)  (t) functions at redshifts = 0, = 1, and = 4 (corresponding to 0 1 2 z z z present day, and roughly 8 or 12 billion years ago, respectively) of the 225 Mpc CDM DEUSS simulation.

such as the large-scale structure of Freedman, W.L. 2017. Cosmology Xu, X., Cisewski-Kehe, J., Green, the universe. The persistent homol- at a crossroads. Nature Astron- S.B., and Nagai, D. 2019. Find- ogy summaries can be used to pull omy 1 (article id 0121). ing cosmic voids and filament potentially relevant information Rasera, Y., Alimi, J-M., Courtin, loops using topological data analysis. Astronomy and Com- from the cosmic web, which can J., Roy, F., Corasaniti, P-S., puting 27: 34–52. then be visualized with Signifi- Fuzfa, A., and Boucher, V. 2010. cant Cosmic Holes in the Universe Introducing the Dark Energy Wasserman, L. 2018. Topological data analysis. Annual Review of (SCHU) or analyzed quantita- Universe Simulation Series. Statistics and Its Application 5: tively using functional summaries AIP Conference Proceedings 501–532. of persistence diagrams like the 1241: 1134–1139. www.deus- Betti functions. consortium.org. When exploring simulation Springel, V., Frenk, C.S., and About the Authors snapshots of the universe at differ- White, S.D. 2006. The large- ent cosmic times, these functional scale structure of the Universe. Jessi Cisewski-Kehe is an assistant professor summaries can help to highlight in the Department of Statistics and Data Science at Nature, 440(7088): 1137–1144. Yale University. differences in structure formation. van de Weygaert, R., Vegter, G., While TDA provides a set of tools Sheridan B. Green is a PhD student in the Edelsbrunner, H., Jones, B.J., natural for cosmology, other areas Department of Physics at Yale University. He earned his BS Pranav, P., Park, C., Hellwing, in physics and mathematics in 2017 from the University in science with data characterized W.A., Eldering, B., Kruithof, of North Carolina. His research involves using numerical by loops and holes may also find N., Bos, E.G.P., and Hidding, J. simulations to constrain cosmology and probe the nature of it useful. dark matter. 2011. Alpha, betti, and the mega- Abby Mintz is an undergraduate student studying Further Reading parsec universe: on the topology of the cosmic web. In Transactions astrophysics and statistics at Yale University. She has worked on projects on exoplanets, computational cosmology, and Fasy, B.T., Lecci, F., Rinaldo, A., on Computational Science XIV, quasar absorption line spectroscopy. Wasserman, L., Balakrishnan, 60–101. Springer-Verlag. Xin Xu is a PhD student in the Department of Statistics S., and Singh, A. 2014. Con- and Data Science at Yale University. She earned her BS in fidence sets for persistence statistics in 2015 from Nankai University. She has worked diagrams. The Annals of Statistics on topological data analysis and astrostatistics. 42(6): 2301–2339.

CHANCE 13 Image courtesty of Getty Images Mapping the Large- Scale Universe through Intergalactic Silhouettes Collin A. Politsch and Rupert A.C. Croft

majority of the atomic allow scientists to indirectly study over from the Big Bang, as well matter in the universe the structure of the intergalactic as a variety of metals occasionally takes the form of a highly medium on vast cosmic scales. ejected from galaxies by particu- diluteA gas that permeates the over- As light travels from a distant larly forceful supernova explosions. whelming volume of intergalactic quasar along its path to Earth, the However, the bulk of the inter- space. Dubbed the intergalactic intergalactic medium leaves an galactic medium is composed medium, this ubiquitous gas is too absorption signature in the light, of electrically neutral hydrogen sparse to be observed directly, but marking the atomic elements that gas, which marks its presence by its presence is embedded in the are present in the intervening inter- absorbing a very specific wave- light of luminous background galactic gas at each point along the length of ultraviolet light: the sources. Most notably, quasars— light’s path. This signature collec- so-called Lyman-alpha transition supermassive black holes that shine tively reveals the presence of diffuse at ~1,216 Angstroms. so brightly they can be seen bil- primordial hydrogen and helium For this reason, the Lyman-alpha lions of lightyears from Earth— residue in intergalactic space left forest—the series of absorptions

VOL. 32.3, 2019 14 Figure 1. Simulated electromagnetic spectrum of a quasar ~10.9 billion lightyears from Earth. The orange dashed curve shows the intensity of light at each wavelength at the time that it was first emitted by the quasar and the solid black curve shows the spectrum as observed from Earth, after neutral hydrogen absorption. The absorptions appear at many different wavelengths because of the continuous stretching of light waves traversing intergalactic space caused by the expansion of the universe (Bautista, et al. 2015).

originating from the Lyman-alpha (wiggly) curve shows the intensity emitted from the quasar (orange transition—is the richest source of the light as viewed from Earth. curve) is not known, and accu- of data for scientists to study the The decrease in the intensity of rately estimating this curve large-scale structure of the inter- the quasar’s light in this region of represents a crucial statistical step galactic medium. the electromagnetic spectrum is in any Lyman-alpha forest analysis. The original detection of the due to intergalactic neutral hydro- The literature about this topic is Lyman-alpha forest dates back gen gas partially absorbing the light extensive, but popular approaches to 1970, but scientific studies of passing through it. This phenom- include principal components the forest did not mature until the enon is akin to observing a distant analyses over a set of candidate early 1990s, with the advent of lighthouse through a patchy cloud functional shapes, interpolation of high-resolution spectrometers— of fog. When seen through a dense observed regions deemed to have an instrument that connects to a patch, the light is dim. When seen minimal absorption, and low-order through a relatively thin patch, the smooths of the observed spectrum telescope and records the intensity light is bright. The intergalactic subsequently scaled to match the of the observed light as a func- “fog” in this case is too tenuous cosmic mean absorption fraction. tion of its constituent wavelengths. to be observed directly, but study- As previously stated, neutral In particular, the commissioning ing its matter density distribution hydrogen gas absorbs at a fixed of the High Resolution Echelle is fortuitously made possible by wavelength of ~1216 Angstroms. Spectrometer (HIRES) on the analyzing the fraction of light Figure 1 should therefore provoke Keck telescope in Mauna Kea, that is absorbed along its journey a couple of questions, such as Why Hawai’i, marked the beginning of to Earth. do the absorptions appear at many the golden age of Lyman-alpha The fraction of absorbed light different wavelengths? and, Why are forest cosmology. is nonlinearly related to the those wavelengths significantly lon- Figure 1 shows a simulated neutral hydrogen density, but ger than 1,216 Angstroms? electromagnetic spectrum of a qua- the relation is monotonic, which The answer to both questions sar ~10.9 billion lightyears from allows the absorption fraction to is that the universe is expanding. Earth. The orange dashed curve serve as a suitable proxy for the When the universe began with shows the intensity of the light at neutral hydrogen density. the Big Bang ~13.7 billion years each wavelength when it was emit- In practice, the intensity of ago, it was not only all matter con- ted from the quasar and the black the unabsorbed light originally tained in the infinitesimal speck

CHANCE 15 Figure 2. Simulated cubic volume of the universe’s large-scale atomic structure (side length ~326 million lightyears). The distribution of galaxies in the universe (left) can be thought of as a sparse point process, while the intergalactic medium (right) has a continuous matter distribution permeating the entirety of intergalactic space. (Image: Cen and Ostriker. 2006)

that began an outward trajectory, neutral hydrogen density at more cosmology has progressed to but the fabric of space as well. Ever distant points along the path. a point where it is now, in since that moment, space has con- Another important observa- principle, possible to use the tinued to expand. tional phenomenon arises due aggregate of currently available This expansion leads to a fas- to the vast distances quasars are data to statistically reconstruct a cinating phenomenon known as observed at and the finite speed continuous large-scale map of the redshift. Namely, when light trav- at which their light travels to us. intergalactic medium in all three els through intergalactic space the Namely, the light we observe from spatial dimensions. Mapping the expansion of the space, it is trav- quasars is actually extremely old; intergalactic medium in three eling through effectively causes its age is directly related to the dimensions can be viewed as an the light itself to be continuously distance it traversed on its path extension of charting the locations stretched to longer wavelengths. to Earth. of the galaxies that the intergalac- For a (fixed) absorption For example, an observed spec- tic medium envelopes, in the sense line such as Lyman-alpha, the trum of a quasar 10.9 billion light- that the distribution of galaxies (a consequence is that a forest of years from Earth—such as that point process) and the distribu- absorptions becomes inscribed displayed in Figure 1—is in fact a tion of intergalactic gas (a random in the light’s spectrum similarly picture of what that quasar looked field) together provide a complete to how a seismograph operates. like 10.9 billion years ago—6.4 picture of how atomic matter is dis- Although the pen of the seis- billion years before the Earth even tributed throughout the universe. mograph is stationary, the paper formed (with the various absorp- Scientifically, a large-scale continuously moves underneath tion lines being imprinted at intergalactic medium map would it, recording the amplitude of different times during that period)! allow for testing cosmological the pen’s oscillations as a one- Mapping the matter distribu- models in an entirely new regime dimensional curve. Here, redshift tion of the intergalactic medium of both scale and age of the uni- is the mechanism that moves the via Lyman-alpha absorptions in verse, potentially leading to a paper, and the resulting forest spectra of distant quasars is there- more-refined understanding of effectively provides a one-dimen- fore akin to charting how the the parameters that govern the sional map of the neutral hydrogen adolescent universe evolved into its Big Bang cosmological model. density along the entire path from present-day form. Moreover, the scale and holistic Earth to the quasar, with longer The past decade of Lyman- nature of such a map could be used wavelengths corresponding to the alpha forest observational to locate never-before-seen regions

VOL. 32.3, 2019 16 Figure 3. The locations of the ~300,000 quasars observed by the Baryon Oscillation Spectroscopic Survey (BOSS). The BOSS final data release is the most-prolific quasar catalog to date and constitutes a ~25% coverage of the sky (Alam, et al. 2015).

of the universe that would be inter- hydro-dynamical simulations sug- with Einstein’s general theory of esting to subsequently revisit and gest it also possesses a weblike relativity and the standard model  gather more detailed observations. distribution closely tracing that of Big Bang cosmology ( CDM). A simulated illustration of the of the intergalactic medium and Figure 3 shows the locations of universe’s large-scale structure is the galaxies. the BOSS quasars on the celestial depicted in Figure 2, via both its The Lyman-alpha forests of sphere and Figure 4 shows them galactic point process (left) and its quasars—and to a lesser extent, as a function of their distance from intergalactic random field (right). luminous -forming galax- Earth (in red). Both processes arise as a result of ies—currently serve as our only The BOSS catalog constitutes gravity and dark energy acting on window into glimpsing interga- a coverage of approximately 25% the primordial matter fluctuations lactic neutral hydrogen. Therefore, of the sky and, in principle, allows left over from the nearly perfect, our ability to produce high fidelity for a three-dimensional statistical scale-invariant Gaussianity of maps of the intergalactic medium reconstruction of the intergalactic the Big Bang, as revealed by the is intrinsically tied to how densely medium over the radial range of Cosmic Microwave Background these objects populate the sky 10.4 to 11.7 billion lightyears— radiation (CMB). The mutual at each radial distance and how in principle because the statisti- gravitational forces of the mat- effective our telescopes are at cal modeling step of producing ter have subsequently pulled the detecting them. a high-fidelity three-dimensional universe’s large-scale structure From 2008 to 2014, the Baryon map from a Lyman-alpha data set into a highly non-uniform web- Oscillation Spectroscopic Survey of this magnitude is quite non- like distribution—appropriately (BOSS)—part of the Sloan Digital trivial, and therefore has not yet dubbed the cosmic web. Dark Sky Survey at Apache Point Obser- been accomplished. matter—matter that (thus far) vatory in Sunspot, NM—collected This task poses a unique statis- only reveals its presence by way spectra for ~300,000 quasars. This tical challenge for many reasons, of its gravitational interaction is currently the most-prolific quasar but foremost among them are: the with regular matter—is also a catalog to date, and has been used complex spatial dependence of the central acting force in the Big to show that the structure of the underlying matter distribution, the Bang cosmological model and intergalactic medium is consistent associated computational cost of

CHANCE 17 Figure 4. A slice of the radial distribution of observed BOSS sources, in terms of the lookback time to the object (analogously, the distance the light traveled to reach Earth). Earth is at the left vertex and the red pixels signify quasars observed by BOSS, while yellow pixels signify the locations of galaxies. The solid-black region represents the so-called Dark Ages between recombination and reionization when stars and galaxies had not yet formed. (Image: Anand Raichoor and the SDSS-IV/eBOSS collaboration. www.sdss.org.)

Figure 5. High-resolution 3D reconstruction of the intergalactic medium absorption field obtained via a Wiener filter on a set of 240 spectra of star-forming galaxies densely populated on a patch of sky with an area of 0.157 deg2 and a radial distance range of ~10.6 billion lightyears to ~11.3 billion lightyears. The locations of the galaxies are indicated by the white pixels. (Image: Khee-Gan Lee and the CLAMATO collaboration; 3D video: www.youtube.com/watch?v=QGtXi7P4u4g)

VOL. 32.3, 2019 18 suitable spatial methods on such a introduced by the sequential observa- the same variety of statistical large scale, and the highly unusual tional pipeline of analyses the data methods developed for Lyman- sampling design characterized have already undergone? alpha forest analyses. by closely sampled observations All of these are exceedingly dif- along a collection of sparsely ficult unanswered questions that Further Reading sampled line segments originating rely on the development of rigor- from a common vertex in three- ous statistical methods. Alam, S., et al. 2015. The eleventh dimensional space. Beyond the groundbreaking and twelfth data releases of A number of methods have work of the BOSS collabora- the Sloan Digital Sky Survey: been proposed for producing point tion, the rapid influx of Lyman- final data from SDSS-III. The estimate three-dimensional recon- alpha data is assured to continue Astrophysical Journal Supplement structions, with Wiener filtering into the foreseeable future. The Series 219(1): (27 pp.). currently the most popular among next generation of the collabora- Cen, R., and J.P. Ostriker. 2006. them. However, the computational tion, the extended BOSS survey Where are the Baryons? II. cost of proposed methods has (eBOSS) began gathering data Feedback Effects. The Astro- thus far limited the reconstructed immediately after completing the physical Journal 650(2):560-572. intergalactic medium maps previous phase, and is still car- Cisewski, J., et al. 2014. Non- to small fractions of the total rying out its objective of adding parametric 3D map of the volume for which Lyman-alpha ~500,000 new quasar spectra to intergalactic medium using the data are available. its predecessor’s catalog. Lyman-alpha forest. Monthly Figure 5 displays a Wiener- The Dark Energy Spectro- Notices of the Royal Astronomical filtered three-dimensional scopic Instrument (DESI) Survey Society 440(3):2599-2609. reconstruction of the intergalac- conducted on the Mayall 4 meter Croft, R., et al. 2000. Towards a tic medium within a cylindrical telescope at Kitt Peak National Precise Measurement of Matter 2 skewer covering a 0.157 deg patch Observatory in Arizona is set to Clustering: Ly Forest Data at of the sky and spanning a radial begin a five-year data collection Redshifts 2-4. The Astrophysical distance range of ~10.6 billion later this year, with a targeted sky Journal 581(1):20-52. 2 lightyears to ~11.3 billion light- area of 14,000 deg —approxi- Lee, K.G., et al. 2018. First Data years (K.G. Lee, et al. 2018). This mately 34% sky coverage. Release of the COSMOS particular map was not made with The CLAMATO survey using Lyman-Alpha Mapping and BOSS data, but rather a densely the Low Resolution Imaging Spec- Tomography Observations: 3D sampled collection of star-forming trograph (LRIS) on the Keck-I Lyman- Forest Tomography at galaxy spectra collected by the telescope at Mauna Kea, Hawai’i, 2.05 < z < 2.55. The Astrophysi- recently commissioned Cos- is actively collecting spectra of cal Journal Supplement Series mos Lyman-alpha Mapping bright star-forming galaxies over 237(2):(31 pp.). and Tomography Observations comparatively smaller regions McQuinn, M. 2016. The Evolution (CLAMATO) survey. than BOSS, eBOSS, and DESI, of the Intergalactic Medium. Certainly, the greatest challenge but with a much-more densely Annual Review of Astronomy & in this sort of mapping—as is the populated target selection in the Astrophysics 1:1–55. case in the majority of statistical sky, allowing for higher resolution analyses—is not in producing the mappings. Moreover, although point estimate itself, but rather, not a Lyman-alpha survey, the accompanying the point estimate Square Kilometer Array (SKA) About the Authors with reliable inference. For any given has begun construction on the Collin A. Politsch is a PhD candidate in the structure appearing in the estimated world’s largest radio telescope, Department of Statistics & Data Science and the Machine map—e.g., a galaxy supercluster or with which it will pioneer the col- Learning Department at Carnegie Mellon University. His a cosmic void—what is the probabil- lection of observational data from research interests include spatio-temporal statistics, uncertainty ity that the structure indeed exists? the so-called Dark Ages (shown quantification, and applications to astronomy and cosmology. Is there enough signal in the map to in solid black in Figure 4), using Rupert A.C. Croft is a professor of physics in the detect significant cross-correlations an entirely different region of the McWilliams Center for Cosmology, which is part of Carnegie with other potential tracers of struc- electromagnetic spectrum—the Mellon University in Pittsburgh. He is a computational ture, such as the anisotropies of the 21 cm radio emission. Mapping cosmologist and astrophysicist. He studies structure in the CMB? Does the established inference this completely uncharted Universe on the largest scales, seeking to understand the cosmic mysteries of dark matter and dark energy. adequately account for the uncertainty will almost surely rely heavily on

CHANCE 19 Just How Far Away is that Galaxy, Anyway? Estimating Galaxy Distances Using Low-Resolution Photometric Data Peter E. Freeman

Figure 1. A portion of the night sky as observed by the . The majority of the objects in this image are galaxies, and for most, the image itself is the only information we have about them. How do we figure out how far away these galaxies are? (NASA Astronomy Picture of the Day. https://apod.nasa.gov/apod/ap140605.html)

ow do we make sense of Unlike real galaxies, we can trace Specifically, given images of the observable universe? these simulated galaxies through galaxies (see Figure 1), we need to It is an incomprehensi- time, from their formation soon figure out how far away they are. Hbly vast (and mostly empty) space after the Big Bang to now. These Since the speed of light is finite, containing perhaps hundreds of ensembles allow us to make statisti- if we know the distances, we can billions of galaxies. Luckily for us, cal inferences about the parameters place the galaxies in time: The far- it is also just simple enough that that dictate the initial conditions and ther away they are, the younger astrophysicists can create realis- large-scale evolution of the universe, the universe was when they emit- tic ensembles of mock galaxies by as well as about the smaller-scale ted the light that can be observed using codes that run for months physics of galaxy formation and today. Once we have a set of on each of thousands of nodes in a evolution. However, to make such simulated and observed galaxies computer cluster. inferences, we need observed data. from the same epoch of the uni- verse…inference ensues.

VOL. 32.3, 2019 20 Figure 2. A theoretically generated, high-resolution galaxy spectrum (black) overlaid with two filter transmission curves (Sloan Digital Sky Survey green and red filters). For approximately 1% of the galaxies in the SDSS catalog, there are data akin to the black curve, whose spikes make redshift estimation relatively easy. For the other 99%, we have only photometric data. For instance, the photometric flux through the green filter is the summation over wavelength of the product of the spectrum and the green filter transmission curve. For SDSS, there are five filters (an ultraviolet filter to the left of green, green, red, and two infrared filters to the right of red: In photometry, all the detail in spectra is sum- marized in just five numbers. (http://classic.sdss.org/dr5/algorithms/spectemplates)

But, as is common in life, the of the light that can be seen spectrum can be considered as devil is in the details. Just how do around us, it is as if it becomes less what we observe when we take the we figure out how far away galax- blue and more red; hence the galaxy’s light and pass it through ies are? historical term “redshift.” While a prism. Before diving into these details, we cannot lay down a tape mea- The spectrum in Figure 2 it is useful to talk about the sure to determine the distance to exhibits three features: a smooth expansion of the universe and an a galaxy, we can directly infer its portion dubbed the continuum; associated quantity: redshift. redshift with the right observations dips relative to the continuum that Since the Big Bang, the universe and a little bit of luck. Given the represent absorption lines; and has been expanding. At first, the redshift and some assumptions spikes above the continuum that rate of expansion slowed with about cosmology, we can finally represent emission lines. A “line” time, but now the rate of expan- infer the galaxy’s distance. is an indicator of a chemical tran- sion is increasing with time, due What are the right obser- sition, with particular transitions to the influence of dark energy, vations? Figure 2 presents a mapping to particular wavelengths. a theoretically still-unexplained theoretically generated galaxy For instance, the tallest spike in phenomenon that appears as a spectrum that shows the intensity Figure 2 represents the so-called form of anti-gravity. Photons emit- of light given off by the galaxy as H transition at 656 nm, caused ted by astronomical sources have a function of wavelength (400 to when electrons fall from the third wavelengths that are tied to the 700 nanometers—the range for energy level of hydrogen to the scale of the universe. which human eyes are most sensi- second energy level. (A dip at As the universe expands, the tive). While the actual details are, this wavelength would occur if, wavelengths get longer. In terms of course, more complicated, this for instance, electrons moved up

CHANCE 21 from the second to third energy enough time to record spectra for are smeared out. Because of this levels, because to have that happen all of them. More than 200 mil- smearing, one cannot simply look the atoms would have to take in lion galaxies have been observed at the five numbers and infer a photons with wavelength 656 nm, in images for SDSS in particular, redshift by eye. leaving relatively fewer photons of but only for some 2 million do we That’s where the computers that wavelength to travel to us.) have redshifts inferred from high- come in. Over the last two decades, The other observed dips and spikes resolution spectra. astronomers and statisticians have are related to known transitions of You may ask, “Is all hope lost?” developed a myriad of algorithms electrons in hydrogen, nitrogen, The answer is an unequivocal to solve the so-called “photo-z oxygen, and sulfur atoms, all of no. Much statistical information problem”: predicting the redshifts which are present in the gas and is present in the 99% of galaxies of photometrically observed galax- stars of galaxies. without spectra, information ies. We can split this myriad into If we observe two or more astronomers and cosmologists two general classes, template-based lines in a spectrum, then redshift would love to lay their hands on— algorithms, which compare ideal- estimation is easy, because we can and we can collect that informa- ized spectra to the observed data, compare the ratios of wavelengths tion using photometry: the action of one galaxy at a time, and so-called of observed lines to known ratios observing the same galaxy repeat- empirical algorithms, which learn and thus identify the atomic tran- edly with different filters in place, statistical models relating magni- sition responsible for each line with each observation yielding a tudes to redshifts given ensembles individually. Once we know, for magnitude, a logarithmic measure of training data. instance, that a particular line of brightness that is believed to A template is an artificially  observed at 750 nm is H , we have first been used by the ancient constructed, noiseless galaxy spec- compute the ratio between that Greek astronomer Hipparchus. trum. Astronomers understand wavelength and the true wave- When the SDSS telescope the physics of stars, and thus can  length of H (656 nm) and set that observes the sky and takes images construct accurate stellar spectra. ratio equal to 1+z, where z symbol- (instead of spectra, because Galaxy templates are, more or izes redshift; here, z = 0.143. sometimes it takes images, and less, the cumulative spectra that To return to the question of sometimes it collects spectra), it result when adding stars together the right observations: The right does so with one of five filters in different proportions. (“This ones are those that collect spectra placed in the light path. A given template has lots of older and with sufficiently high resolution filter is made of a material that lighter stars and few younger and to resolve two or more individual lets light of certain wavelengths heavier stars; this other template dips and spikes. Saying that we also pass, while absorbing light at other has more younger and heavier stars; need a “bit of luck” refers to the fact wavelengths. Overlaid on Figure 2 and this other template has rela- that not all galaxies have obvious are the filter transmission curves tively more gas, and …”) dips and spikes, although most do. for SDSS’s green and red filters. To estimate redshift, one selects You ask, “Well, what’s the issue The heights of these curves show a template, redshifts it by some here? Take spectra, find dips and the relative transmission efficien- amount (think of taking the black spikes, infer redshift. Easy!” cies of each filter as a function of curve of Figure 2 and shifting it to The issue is that collecting wavelength. (SDSS’s three other the right by some amount, while spectral data is a time-intensive filters include one in the ultraviolet keeping the green and red curves in exercise. A typical SDSS spectrum regime, “to the left“ of green in place), passes the redshifted spec- shows the intensity of observed wavelength, and two in the infra- trum through photometric filters light in 3,500 separate bins, each of red, “to the right” of red.) to get five magnitudes, and then which is associated with a different The green filter is most- sees how well those five numbers wavelength. To have a sufficient transparent around 500 nm, and match those observed for the gal- number of photons land in each totally opaque at 600 nm. The axy in question. bin to clearly differentiate a gal- brightness of a galaxy observed The quality of match is usu- axy from other sources in the sky by SDSS can thus be summarized ally measured via the normal, or around it might require observing with five numbers: one magnitude Gaussian, likelihood function, that galaxy for hours. Even with for each filter. One may think of with the errors in the likelihood clever spectrographs that observe photometry as very-low-resolution function provided by the uncertain- hundreds, if not thousands, of gal- spectroscopy in which any dips ties in the observed magnitudes. axies at once, there simply is not and spikes that may be present Different templates and different

VOL. 32.3, 2019 22 Figure 3. Estimated redshifts for 5,000 galaxies versus their observed redshifts. Multiple linear regression is used to relate five magnitude estimates for each galaxy to its observed redshift. (The adjusted R2 is 0.866.) The red line is the locus of “perfect estimation.” (Data: Jeffrey Newman and Rongpu Zhou of the University of Pittsburgh.)

redshifts are used within a grid if the best spectrum for predicted redshifts, respectively, and search until the template/redshift a given galaxy is a linear the diagonal red line is the line of combination that maximizes the combination of two, three, “perfect estimation.” likelihood is found. four, or more templates? Because the points largely fol- Template-based methods have low the line, with some amount of the virtue of being straightforward To avoid these issues, many intrinsic scatter (adjusted R2 0.866, to implement while not requiring astronomers turn to what they call RMSE 0.072), it can be inferred any training data. However, they empirical methods to estimate red- that a linear model does a good do suffer from a number of issues. shifts. In other words, instead of job of representing the underlying assuming complete knowledge of association between magnitudes • Model misspecification: galaxy spectra, we let the data them- and redshift. (There are issues, such Do we have a directory of selves inform the modeling process. as a substantial number of poorly template spectra that spans “Empirical” is astronomer-speak estimated redshifts, that astrono- all possible spectra? for learning statistical models that mers dub “catastrophic outliers,” relate a set of predictor variables • Computational intensity: and the fact that predictions are (e.g., the magnitudes; alternatively, If we have hundreds of on average too high at low red- the colors, or differences in magni- template spectra in a direc- shift and too low at high redshift, tudes, for each galaxy) to a response tory, we cannot apply them due to measurement error in the variable (the redshift). all quickly when fitting any galaxy magnitudes. This latter issue Figure 3 displays the result of single galaxy. How do we is dubbed attenuation bias. On the learning a multiple linear regression choose the right subset whole, though, it is fair to say that model that relates sets of five mag- of templates? the linear model represents the nitudes to the known redshifts for underlying association well.) • Discreteness: Templates 5,000 spectroscopically observed Once again, we reach what we represent a discrete set of galaxies. The x- and y-axes in this think is a stopping point: “Even if idealized galaxies. What figure represent the observed and you do not have spectra, you can

CHANCE 23 apply multiple linear regression to while reweighting improves the LSST will sample data from photometric data and do a good quality of predictions for galax- well beyond a redshift of 1. When job of predicting redshift. The end. ies that have similar properties as we analyze those data, we will run Right?” No, not quite so fast. the training-set galaxies, we still into two issues: Linear models will Look back at Figure 1 and cannot (or, at least, should not) no longer work so well, and we imagine that you are the astrono- extrapolate our models. will observe degeneracies—sets mer. The image that you see is filled Two other possibilities are data of magnitudes that can plausibly with galaxies that you would love augmentation and active learn- map to more than one redshift. to know the distances to, but you ing. The first involves increasing (We will need computationally know that you have limited tele- the amount of data used to train efficient algorithms that we can scopic resources and thus cannot a model by resampling from the apply to the tens of billions of gal- measure spectra for all of them. training data themselves, making axies that LSST will observe, but What will you do? use of the measurement error in that is an issue for another time and Chances are, you will focus your their magnitudes, while the second another place.) energies on the brighter galaxies, involves increasing the amount of Mitigating the first issue is rela- because they are the ones that have training data by collecting new tively straightforward, since we can to be observed for the least amount spectra from judiciously selected expand our suite of possible regres- of time to get a good spectral signal galaxies—ones that give the most sion models to include nonlinear relative to the background noise. “bang” (increased accuracy and ones such as random forest and But galaxies that are brighter are precision of estimates for fainter extreme gradient boosting. How- generally galaxies that are either galaxies) for the “buck” (the ever, these models are still, in the more massive or closer to us, or amount of telescope time involved end, regression models: Their aim both, than the average galaxy in collecting new spectra). is to estimate the average value the image. Choosing the brighter Both techniques expand the of the redshift for a given set of galaxies for subsequent spectro- applicability of a regression model magnitudes. When degeneracy is scopic observations introduces by expanding the domain of the present, that average may not be selection bias into the redshift esti- predictor variables. They will meaningful (see Figure 4). mation process. become increasingly vital in the To understand how to tackle The brighter galaxies are not a upcoming era of the Large Syn- degeneracies required introduc- random sample from the popu- optic Survey Telescope (LSST), ing the idea of conditional density lation of all the galaxies seen in when the number of photometri- estimation (CDE). A conditional Figure 1, and thus any model that cally observed galaxies will increase density is a probability density is learned using these galaxies can- by a factor of 100 with no com- function of the form f (z|x), where not necessarily be applied to the mensurate increase in the number z is the redshift and x represents others. Stated another way, to apply of galaxies with measured spectra. all the predictor variables (e.g., the your models to fainter galaxies, you Now that we have brought up galaxy magnitudes or colors). The would have to extrapolate them LSST, we can step back to the relationship between CDE and and, as we learn early in our statis- idea that all we have to do is apply regression is that in the former, tics education, extrapolation is A multiple linear regression to our we estimate the function f (z|x) Bad Thing to Do. magnitude and redshift data to directly while making no assump- How best to deal with selection create a good statistical model. tions about its shape, while in the bias is an open research question. The data shown in Figure 3 lie latter, we estimate the expected One possibility is to mitigate between redshifts 0 and 1, where value of z while assuming a its effects by applying weights to redshift 0 is now and redshift 1 is function shape (e.g., a normal the data used to train a regression equivalent to a light-travel time distribution with standard devia- model. Let’s say we have a training- of nearly 8 billion years. (Redshift tion , the typical assumption of set galaxy for which we have five and light-travel time are directly linear regression). magnitudes and a spectroscopi- but nonlinearly related: Redshift 2 Figure 4 shows an example of cally measured redshift. If there gets us to nearly 10.5 billion years a conditional density estimate are many other galaxies without in the past, while redshift 3 is just made using an extreme gradient measured redshifts in the sample over 11.5 billion years. In fact, the boosting-based algorithm applied with similar magnitudes, we would redshift diverges to infinity as we to simulated data from LSST’s give the training-set galaxy more get closer and closer to the Big Data Challenge 1 (DC1). The weight when learning models, but Bang some 13.7 billion years ago.) estimate indicates that given its

VOL. 32.3, 2019 24 Figure 4. Example of a conditional density estimate (CDE) made by fitting an extreme gradient boosting-based esti- mator to data simulated for the Large Synoptic Survey Telescope Data Challenge 1 (LSST DC1). There are two viable redshifts associated with input predictor variables x (e.g., galaxy magnitudes). The regression estimate is the average of the CDE and is shown in red. It lies between the two modes and thus is not meaningful.

observed data, there are two viable The first avenue uses more it appears to be, etc. (Those versed redshifts for the galaxy in question: information than only magnitudes in machine learning would call It may be a fainter, closer-to-us or colors. For instance, the gal- this “extracting image features.”) galaxy at a redshift of approxi- axy images themselves contain a We can apply these summary mately 1.31, the redshift of the left wealth of information. However, statistics, which conventionally mode, or a brighter, farther-from- working with images is difficult. number around five to 10, as us one at approximately 1.54, the An example is one of the relatively additional predictor variables in redshift of the right mode. Note large spiral galaxies shown in redshift estimation. the red line. Learning a regression Figure 1. Another way to reduce dimen- model and applying it to this galaxy The first step is to extract a small, sionality is to let an algorithm do “postage-stamp image” that is it for us, by feeding images directly would predict a redshift of approxi- centered on just that galaxy—an into a convolutional neural net- mately 1.4—a value not associated image that might have 64 pix- work (CNN) that extracts the with either mode. els on a side. The galaxy would image features on the fly. While The best empirical models, thus represent a single point in a the use of deep learning is intrigu- thus, are computationally efficient 4,096-dimensional space. ing and has the possibility to be a and produce conditional density Due to the so-called “curse of game changer in the era of LSST, estimates while dealing properly dimensionality,” it is not feasible it may be limited by a simple lack with selection bias (if we do not to analyze galaxy data in spaces of training data. There are still take care of that via data augmenta- with thousands of dimensions. We only so many galaxies for which tion or active learning approaches). need to greatly reduce the dimen- we know the redshift, and while But where do we go from here? sionality. One way to do this is that number will increase in There are many avenues to to compute summary statistics for the future, it will do so far more explore. Perhaps you will walk the galaxy that encode how con- slowly than the overall number of along these avenues in the future. centrated its light is, how clumpy observed galaxies.

CHANCE 25 Traveling along the first ave- redshift estimation can be com- galaxy properties to achieve opti- nue, there is more information bined with the estimation of other mal multivariate conditional than just magnitudes and galaxy galaxy properties. Astronomers density estimates? That is an open appearance. For instance, one call this the “p(z,) problem,” but question awaiting a solution. could try to mine information given the notation used above, Selection bias, gathering more- about a galaxy’s environment. If a what they really mean is that they informative predictor variables, galaxy sits in a part of space where want to estimate f (z,|x), where the quantification and propaga- the overall density of galaxies  collectively denotes one or more tion of uncertainties, etc.: These appears significantly higher than of those other properties, such as are among the myriad challenges average, then that galaxy may be mass or star-formation rate. that astronomers and statisticians part of a , and infor- Currently, astronomers use face when trying to answer the mation about that cluster could physics-based codes to estimate seemingly simple question of “Just be used to estimate the galaxy’s these properties, so estimation is how far away is that galaxy?” redshift. If we have spectra for one computationally expensive and or more galaxies in the cluster, then prone to model misspecification Further Reading that provides, in theory, approxi- (i.e., incomplete physics). In the Freeman, P.E., Izbicki, R., and mate redshifts for all galaxies in upcoming era of LSST, it would be Lee, A.B. 2017. A Unified the cluster. nice to bypass further implementa- Framework for Constructing, The key word above is “may,” tion of such codes and use learned because even if a galaxy appears to Tuning, and Assessing Pho- statistical models instead that are lie in a high-density environment, tometric Redshift Density trained on galaxies for which we it may be a chance superposition: Estimates in a Selection Bias currently have property estimates. The galaxy might lie far behind the Setting. Monthly Notices of Learning useful models will cluster or far in front of it. At best, the Royal Astronomical Society be tricky, though, because of esti- we can use the probability of clus- 468:4556. mator uncertainty. Spectroscopic ter membership to inform CDE. Large Synoptic Survey Telescope redshifts may reasonably be The second avenue to explore (LSST). www.lsst.org. is not so much about how one viewed as “ground truth” because the “spiky” nature of the emis- Pasquet, J., Bertin, E., Treyer, M., estimates redshift, but whether Arnouts, S., and Fouchez, D. sion and absorption lines used 2019. Photometric Redshifts to estimate them makes the from SDSS Images Using a redshift uncertainty generally About the Author Convolutional Neural Net- negligible. However, the same work. Astronomy & Astrophysics cannot be said for other galaxy Peter E. Freeman is an assistant teaching professor 621:A26. in Carnegie Mellon University's Department of Statistics properties. For instance, short & Data Science. Since receiving his PhD in astronomy of putting a galaxy on a scale, Pospisil, T., and Lee, A.B. 2018. and astrophysics from the University of Chicago, he has any mass estimate with have RFCDE: Random Forests concentrated his research in astrostatistics: the application of for Conditional Density Esti- significant uncertainty, both sta- cutting-edge statistical methods to problems such as source mation. https://arxiv.org/ detection, cosmic microwave background mapmaking, tistical and systematic. How does abs/1804.05753. making sense of galaxy morphologies, and estimating one properly estimate and then redshifts from photometric data. propagate uncertainties in derived Sloan Digital Sky Survey (SDSS). www.sdss.org.

VOL. 32.3, 2019 26 Image courtesty of Getty Images Statistics for Stellar Systems: From Globular Clusters to Clusters of Galaxies Gwendolyn Eadie

stronomers often want to But how do we go about “weigh- dynamical systems from small to know the physical mass ing” something that floats in the large scales are shown in Figure 1. of astronomical systems vacuum of space and is millions or The range of scale of these thatA they study. Some reasons are even billions of light years away? systems is…astronomical! The related to understanding the sys- What kinds of data are useful at the top right of tem’s structure and composition for estimating the masses of Figure 1 could be just one of the (What is the fraction of binary stars dynamical systems, and what limi- tiny-looking galaxies in the galaxy in a globular ? Is there a tations, uncertainties, biases, and/ cluster, and the at central black hole at the center of a or assumptions come into play? the top left is the smallest system ?), while others are shown—if it were living in the spiral related to answering fundamental Dynamical Systems galaxy (third from the left), it would questions of physics (How much in Astronomy barely be more than a point of light. dark matter is in a galaxy? How Globular clusters contain any- does dark matter affect the evolution The universe is full of dynamical where from tens of thousands to of galaxies?). systems, such as globular clus- hundreds of thousands of stars, Whatever scientific question ters, nuclear star clusters at the while spiral and elliptical galaxies astronomers are trying to answer, centers of galaxies, spiral and ellip- can contain millions to trillions they seek a trustworthy estimate tical galaxies, and galaxy groups of stars or more. Galaxy clusters, of the mass of the dynamical system. and clusters. Some examples of such as MCS J0416.1–2403 shown

CHANCE 27 Figure 1. Examples of dynamical systems in astronomy. Top row, left to right: globular cluster M80 (NASA, Hubble Heritage Team, STScI, AURA), dwarf galaxy NGC 5264 (ESA/Hubble and NASA), UGC 12158 (NASA/ESA Hubble Space Telescope’s Advanced Camera for Surveys), and elliptical galaxy IC 2006 (ESA/Hubble & NASA images: Judy Schmidt and J. Blakeslee (Dominion Astrophysical Observatory)). Bottom: galaxy cluster, MCS J0416.1–2403 (ESA/Hubble, NASA, HST Frontier Fields Mathilde Jauzac (Durham University, UK, and Astrophysics & Cosmology Research Unit, South Africa) and Jean-Paul Kneib (École Polytechnique Fédérale de Lausanne, Switzer- land)). Note: These objects vary in size by many orders of magnitude.

VOL. 32.3, 2019 28 Figure 2. The Swiss astronomer Fritz Zwicky (1898–1974) is credited with coining the term dark matter in 1933, when he used it to describe the unseen matter responsible for holding together the Coma Cluster of galaxies, shown here. The Coma Cluster contains more than 1,000 galaxies and more than 22,000 globular clusters (not visible in the image). (Image: NASA; ESA; J. Mack, STScI; J. Madrid, Australian Telescope National Facility.)

in Figure 1, can have tens to thou- Determining exactly how move within and around it. Thus, sands of galaxies within them, much dark matter is in dynamical one approach to estimate mass is to with each of those galaxies har- systems beyond the scale of globu- assume a parameterized model for boring millions to trillions of stars. lar clusters (beyond the scale of the gravitational potential, mea- Galaxies also host gas, dust, and globular clusters, which appear to sure the positions and velocities of compact objects (such as black lack dark matter entirely) is moti- orbiting stars, and then use these holes, neutron stars, and white vation for understanding the total data as kinematic tracers (or simply, dwarfs), and are understood to mass of these systems. Despite tracers) to constrain the gravita- sit within an enormous cloud of differences in size and number of tional potential parameters. Once dark matter: matter whose pres- stars, the masses of all of these dif- the total gravitational potential ence is inferred by gravitational ferent dynamical systems can be is estimated, the system’s mass interactions with ordinary matter, estimated using the same underly- can be estimated. Nonparametric but which cannot yet be measured ing physical principles. estimators of mass have also been directly (see Figure 2). The self-gravity of a dynamical explored by teams of astrophysi- The total matter in galaxies and system, sometimes combined with cists and statisticians (Wang, et galaxy clusters can be composed rotational support, holds the sys- al. 2008). of as much as 90% dark matter. A tem together in a tightly or loosely galaxy cluster can be so massive bounded state. A system’s com- Data for that it creates a gravitational lens pactness varies depending on the Kinematic Tracers to background sources of light. In specific system, the environment in the image of the galaxy cluster at which it lives, and its past interac- The type and number of kinematic the bottom of Figure 1, it is pos- tions with other systems. tracers depends on the type of sible to see the curved arcs of light The total gravitational potential dynamical system (Table 1); some around the cluster are created by of a dynamical system is directly systems have multiple tracer pop- the gravitational lensing of light related to its total mass, and also ulations (e.g., galaxies host stars, from background galaxies. determines how objects like stars a globular cluster population, a

CHANCE 29 Table 1—Dynamical Systems and Their Tracers

System Example Tracers Approx. Number of Tracers globular cluster individual stars 10,000s–100,000s nuclear star cluster individual stars 10,000 Milky Way outer (halo) stellar population 1,000s–10,000s planetary nebulae (halo) 10s–? globular clusters 150 dwarf galaxies 10–30 stellar streams 10 galaxies dwarf galaxies 10s–100s planetary nebulae 100s–1,000s globular clusters 100s–1,000s halo stars 1,000s–10,000s galaxy clusters individual galaxies 10s–1,000s globular clusters 10,000s

perspective, the velocity vector is accretion of gas onto supermas- A detailed branch measured in two parts: the veloc- sive black holes at the centers of of astronomy called ity in our line of sight (vlos), and extremely distant galaxies. For all specializes in the the projected velocity along the intents and purposes, a background measurement, calibration, plane of the sky called the proper quasar is stationary with respect to and uncertainty quantification motion ( µ ). The former can read- a Milky Way globular cluster. These of parallax, proper motions, ily be measured via the Doppler quasars are used as an anchor from and magnitudes (brightnesses) shift of known spectral lines; the which to measure the motions of of stars. amount of shift toward the red or stars within the cluster. The indi- blue end of the electromagnetic vidual motions of stars can be used spectrum is a function of the to study the globular cluster itself, tracer’s movement away or toward or the mean motion of the stars dwarf galaxy population, etc.), us. The , however, can be used to estimate the proper while others have only a single is harder to measure because the motion of the whole cluster. population (e.g., globular clusters tracers are so incredibly far away. Proper motions of tracers in host stellar objects). As can be seen In the Milky Way, useful tracers other galaxies are nearly impos- in the table, the size of the tracer like stars and globular clusters can sible to obtain, and so for these population is sometimes on the be more than 45,000 light years systems, astronomers (and astro- order of only 10 tracers, but can away. To measure their proper statisticians) must come up with also be as large as tens of thousands. motion requires taking images of ways to deal with the incomplete Every tracer in a dynamical sys- these tracers separated over a wide velocity data. tem has three-dimensional position span of time to observe any move- The three-dimensional posi- and velocity vectors with respect ment. It is best if the proper motion tions of tracers are a little easier to the center of the system, mak- is measured in reference to some- to obtain, since their place in the ing up what astrophysicists call thing stationary (or approximately sky is defined using an agreed- six-dimensional phase-space. It takes stationary) compared to the tracer. upon celestial grid of spatial all six phase-space components of For example, when measuring coordinates called right ascension a tracer to characterize its position globular clusters, astrometrists (RA) and declination. and trajectory, but measuring all often use background reference The distance to the tracer pro- components is difficult and some- objects like distant galaxies and vides the third spatial component. times not even possible. objects called quasars. Quasars In the Milky Way, the RA, declina- Take the velocity of a tracer. produce incredible amounts of tion, and distance make it possible From the Earth-centered light, thought to be created by the to transform the tracer’s apparent

VOL. 32.3, 2019 30 position to a coordinate system that is Galactocentric (Milky Way centered), by taking into account the sun’s position in the Galaxy1 with respect to the Galactic center, the location of the North Galac- tic pole, the movement of the sun through the solar neighborhood, and the rotation of the disk. The distances to individual stars in our Galaxy can be measured using their parallax. The parallax is the apparent change of an object’s position on the sky over the course of half a year, as the Earth makes its way from one side of the sun to the other. Using a small-angle approxi- mation, the parallax is inverted to provide a distance estimate. The Figure 3. Galaxy NGC 7457. Based on observations made with the NASA/ parallax of tracers around other ESA Hubble Space Telescope and obtained from the Hubble Legacy Archive, dynamical systems (e.g., other gal- a collaboration between the Space Telescope Science Institute (STScI/NASA), axies, galaxy clusters) are virtually Space Telescope European Coordinating Facility (ST-ECF/ESA), and Cana- undetectable. Other distance indi- dian Astronomy Data Centre (CADC/NRC/CSA). cators are used instead, like standard variable stars whose overall intrinsic brightness has a known relationship to the period at which they change a dynamical system with N trac- The estimator in Equation 1 has in brightness (e.g., RR Lyrae stars). ers is the projected mass estimator some nice properties; it is unbi- When it comes to study- first introduced by Bahcall & ased and has finite variance. On the ing dynamical systems outside Tremaine (1981): flip side, it makes the unrealistic the Milky Way, the available assumption that all mass in the sys- six-dimensional phase-space (1) tem is centered at a point, and that information is limited to three the tracers are “test” particles orbit- measured components: the two- ing around it. There is no way to dimensional projected distance where G is the gravitational include the shape of the dynamical and the line-of-sight velocity. The constant, vlos,i is the line-of-sight system or the possibility of a non- distances to the host dynamical th velocity of the i tracer, and Ri spherical gravitational potential. system are so far that obtaining is the projected distance of the Having to choose a value for high-precision distance measure- tracer from the center of the C is also problematic since it ments to their individual tracers is proj dynamical system. means deciding on a distribution very difficult. This estimator is popular partly for the type of tracer orbits (also because it does not require proper Statistical Inference known as the velocity anisotropy). motions to use it. Instead, the ana- Techniques for Different velocity anisotropy lyst must decide on the value for the assumptions also can give wildly Dynamical Systems constant Cproj, which is determined different results. An example is Because of incomplete data, by assuming something about the galaxy NGC 7457, for which the astronomers rely on point esti- distribution of the tracers’ orbits SLUGGS Survey has measured mators to determine the mass of (i.e., are they mostly circular orbits, the line-of-sight velocities and dynamical systems. A popular elliptical, or some (ani)isotropic projected distances for 40 globular statistic for estimating the mass of mixture of orbits?). clusters (Figure 3).

1 Astronomers give the Milky Way a capital G to distinguish it from other galaxies.

CHANCE 31 Improved Data Astronomers talk about the velocity anisotropy of the tracer population, which is a combination of the variances (or and Methods “dispersions”) of the velocity components of the tracer population. Like so many disciplines, astron- The velocity anisotropy parameter is given by: omy is experiencing a wave of Big Data. The European Space (2) Agency’s Gaia satellite, launched in 2013, has been performing a where N is the number of tracers, and the 2 values are the stellar census of the Milky Way for variances for each component of the velocity vectors in the past seven years. On April 25, spherical coordinates. 2018, the Gaia Collaboration, et al., publicly released measurements of parallax and proper motions of more than 1 billion stars in the Galaxy. These data required Because astronomers deal with astronomically large objects, they advanced astrometry techniques often report mass in units of solar mass, M, which is equal to and are revolutionizing the ability 30 approximately 2 × 10 kg. to study the stellar populations of the Milky Way (Figure 4). The Gaia satellite has also made it possible to estimate the velocity of dynamical systems both within and just outside the Milky Way. For example, the proper motions of Estimators such as Equation Bayesian analyses that aim to esti- more than 150 Milky Way globular 1 also do not take into account mate model parameters given data. clusters and more than 10 dwarf measurement uncertainty of vary- However, to derive pdfs for tracers galaxies have been estimated. ing degrees, and do not allow requires some assumptions about Researchers have been using these for multiple spatial distributions the dynamical system. tracer data to better understand the for different tracer populations To start, tracers are assumed Milky Way’s mass. (Table 1). The spatial distribution to be at equilibrium. This might Recently, we used the globular of tracer populations in the Milky not be true if the system recently cluster data in a hierarchical Bayes- Way is known to vary by popula- underwent an interaction with ian analysis to estimate the Milky tion, even though every population another dynamical system (e.g., a Way’s total gravitational potential is subject to the same total gravi- galaxy-galaxy collision). and mass. The hierarchical Bayesian tational potential. There is little Another assumption that helps method simultaneously accounts for reason to think other galaxies simplify the mathematics (and measurement uncertainty, includes would be different. statistics) is spherical symmetry. incomplete data (sampling the Astrophysicists have derived For some dynamical systems this unknown components as param- probability density functions is a good approximation—a globu- eters), and uses a pdf for the energies (pdfs) for the energy, E, and angu- lar momentum, L, of tracers in lar cluster, for example, is almost and angular momenta of the tracers a dynamical system, assuming a perfectly spherical—but for spiral based on a physical model. model for the gravitational poten- galaxies, the assumption obviously The summarized results are tial and a model for the spatial breaks down. Only when consider- shown in Figure 5 (adapted from distribution of the tracers. ing the much-larger halo of dark Eadie and Jurić, 2019). Rather than Many of the details for matter in which the galaxy resides a single point estimate for the mass, deriving pdfs are covered in a is spherical symmetry a more rea- we obtain a posterior distribution well-known book, Galactic Dynam- sonable assumption. Some pdfs and estimate the cumulative mass ics by Binney and Tremaine (2008). have been derived that allow for profile of the Milky Way with These pdfs are useful components triaxial shapes, but their known Bayesian credible regions (the gray of maximum likelihood and mathematical forms are limited. areas in the figure).

VOL. 32.3, 2019 32 Figure 4. An all-sky view of the Milky Way Galaxy created using data from the Gaia satellite’s measurements of more than 1.7 billion stars. Note that this is not a picture, but a figure. Dark portions in the plane of the galaxy correspond to dusty regions that block light, although they are only seen in this image because fewer stars were detected in these regions (ESA/Gaia/DPAC).

Figure 5. A cumulative mass profile estimate of the Milky Way Galaxy (grey regions are 50, 75, and 95% credible regions) using Gaia DR2 data, supplemented with additional data from the HST Proper Motion project. Other studies’ mass estimates are shown as points with error bars. (Adapted from Eadie & Juric´ . 2019.)

CHANCE 33 The SLUGGS Survey (http://sluggs.swin.edu.au) is the Study analyses can incorporate mea- of the Astrophysics of Globular Clusters in Extragalactic Systems surement uncertainties, include (SAGES) Legacy Unifying Globulars and GalaxieS Survey. This incomplete data, and simul- observational program used the Subaru/Suprime-Cam imager on taneously estimate unknown the Subaru Telescope, and the Keck/DEIMOS spectrograph on parameters such as the veloc- the Keck II Telescope, both of which are near the summit of Mauna ity anisotropy, using large data Kea on the Big Island of Hawai`i. sets. Computer simulations of galaxies combined with model Use these data in Equation 1 and assume isotropic orbits emulators can also enable  (Cproj = 16/ ), then do the same assuming extreme linear orbits approaches such as approximate  (Cproj = 32/ ). The former gives a total mass estimate of 5.68 × Bayesian computation. 10 10 M, while the latter doubles this value. When no information As these methods increase in is available about the tracer orbits, the mean of these two estimates popularity in astrophysics and the is sometimes taken as the mass estimate. public release of astronomical data continues, the field of astrostatis- tics will continue to grow.

Further Reading

Bahcall, J.N., and Tremaine, S. The Future of Big of the parallaxes measured by Gaia 1981. The Astrophysical Journal, Data in Astronomy are negative, and a simple inversion 244, 805. doi: 10.1086/ 158756. then implies a negative distance, and Dynamical Bailer-Jones, C.A.L., Rybizki, which is entirely unphysical. It Systems J., Fouesneau, M., Man- has therefore been recommended telet, G., and Andrae, R. As missions like Gaia improve and that inference techniques such as 2018. The Astronomical Jour- increase data collection, the limita- Bayesian analysis with prior infor- nal, 156, 58. doi: 10.3847/ tions to inference about the Milky mation be used to fully account 1538–3881/aacb21. Way and its dynamical systems for uncertainty in the nonlinear have more to do with statistical transformation from parallax to Binney, J., and Tremaine, S. 2008. techniques and physical models distance (see Further Reading). Galactic Dynamics 2nd ed. than with missing data. In short, even with Big Data, Princeton. At the same time, data from we must be careful about our sta- Eadie, G., and Jurić, M. 2019. large surveys have to be analyzed tistical procedures. The Astrophysical Jour- with scrutiny. For example, Bailer- In the next decade or two, nal 875, 159. doi: 10.3847/ Jones, et al. (2018) have shown enormous amounts of data 1538-4357/ab0f97. that a simple inversion of the on other galaxies and galaxy Gaia Collaboration, Brown, parallaxes provided by Gaia can clusters will become publicly A.G.A., Vallenari, A., et al. 2018. lead to biased and incorrect esti- available through other telescope Astronomy and Astrophysics, 616, mates of the distance to stars. Some surveys and space-based missions. A1. doi: 10.1051/0004-6361/ However, incomplete data and 201833051. measurement uncertainties will Wang, X., Walker, M., Pal, J., About the Author remain stumbling blocks to Woodroofe, M., and Mateo, measuring the masses of these M. 2008. Journal of the Gwendolyn Eadie is an assistant professor of dynamical systems beyond our American Statistical Associa- astrostatistics at the University of Toronto, jointly appointed Galaxy. Statistical techniques that between the Department of Astronomy & Astrophysics and tion, 103, 1070. doi: 10.1198/ the Department of Statistical Sciences. In 2018, she received can deal with these issues will 016214508000000652. the J.S. Plaskett Medal from the Canadian Astronomical become paramount. Statistical Association, Society for her PhD dissertation. She is also the program With high-performance com- 103, 1070. doi: 10.1198/ chair for the ASA Astrostatistics Interest Group and co-chair puting, methods like Bayesian of the American Astronomical Society’s Working Group on 016214508000000652. Astrostatistics & Astroinformatics.

VOL. 32.3, 2019 34 Image courtesty of Getty Images

ARIMA for the Stars: How Statistics Finds Exoplanets Eric D. Feigelson

Planet, Planets From a scientific perspective, Their discovery was revolu- Everywhere! all of this was fantasy. While tionary, giving birth to the field astronomers had established in of exoplanetary astronomy that is In the first century BC, the Roman the 19th century that the stars now a significant segment of all poet Lucretius wrote, “you are at night were luminous gaseous astronomical research effort. bound to admit that in other parts spheres like the sun, there was no Today, we know that most stars of the universe there are other evidence at all for planets orbit- seen in the nighttime sky have worlds inhabited by many different ing the stars—until 1995. At that planetary systems. Quantitative peoples and species of wild beasts.” point, Swiss astronomers Michel evaluations are still uncertain, but In the 17th century, Bernard de Mayor and Didier Queloz had the best estimates are that stars Fontenelle popularized the plu- measured subtle Doppler shifts in typically have about five planets rality of worlds with a best-seller the wavelengths of spectral lines and perhaps 1% to 10% of them among the nobility of Versailles. from the starlight of 51 Pegasi, a have Earth-like planets in Earth- A generation ago, astronomer Carl nearby sun-like star. The observed like orbits where life could exist on Sagan entranced millions with periodic sinusoidal pattern in the the surface. (This fraction is known  lyrical phrases like “Think of how star’s motion indicated the star is as , voiced as “eta-Earth.”) many stars, and planets, and kinds being pulled back and forth by an It can be inferred that there are of life there may be in this vast and invisible Jupiter-like planet orbit- hundreds of millions of habitable awesome universe.” ing every 4.2 days. planets in the Milky Way galaxy,

CHANCE 35 Figure 1. Diagram of an exoplanet transiting a star and the dip in brightness that occurs every orbit (left). Transit light curves from two planets found early in the Kepler mission, one a Neptune-size and the other Jupiter-size (right). Transits typically have amplitude of 0.01% to 1%, so other sources of brightness variations often overwhelm the planetary signal. (NASA/Kepler mission.)

the closest only a few lightyears because the signals are so weak. Figure 1 illustrates the dip in away. No evidence for extraterres- One effective method for discover- starlight produced by a planetary trial life—Lucretius’s wild beasts— ing exoplanets detects changes in transit. In addition to various moun- has emerged, but the discovery of the velocity of the star toward and tain-top telescopes, several satellite ubiquitous exoplanets motivates away from Earth as the planet orbits observatories are devoted to transit powerful research efforts directed the star. Earth pulls the sun only 9 searches: Kepler and TESS mis- toward this goal. centimeters per second, so detect- sions launched by the U.S. National ing its Doppler signature requires a Aeronautics and Space Agency and Astrostatistics spectrograph with better than 1-in- the Cheops mission from the Euro- and Exoplanets a-million precision measurements pean Space Agency. that is stable for years. Since only a small fraction Orbital models of multiplanet sys- This is still beyond current of orbits produce transits when tems rely on sophisticated Bayes- engineering capabilities, but the viewed from any particular ian nonlinear regression procedures latest astronomical spectrographs direction, many stars must be where astrophysical processes con- can detect Jupiter-mass planets out photometrically monitored for strain the prior distributions of to Earth-like orbits, or Earth-like months or years. A variety of model parameters. In part, the link planets out to Mercury-like orbits. large-scale transit surveys are now between astronomy and statistics A second effective method for underway, from both space-based arises from the overabundance of discovering planets is somewhat satellite observatories and ground- the underlying populations. Of the less-demanding on our instru- based mountaintop observatories. billions of planets in our galaxy, ments. When an orbiting planet we have so far sampled only a few passes (or “transits”) in front of The Challenge of thousand. Statistics is crucial here; a star, a small portion of the Planetary Transit a few years ago, two research groups starlight is briefly eclipsed. For Detection analyzing the same data set arrived a Jupiter-size planet, the light is  at estimates that differed by a diminished by roughly 1%; for an A decade ago, NASA launched factor of 7 due to different statistical Earth-size planet, the diminu- the Kepler mission with high analysis procedures. tion is around 0.01%. Measuring confidence it would find many Statistical inference is also star brightnesses (called “stellar Earth-like planets, thanks to its critical for the detection and photometry”) to this precision is instrumental precision (around characterization of exoplanets quite feasible today. 0.003%) and planned four-year

VOL. 32.3, 2019 36 Figure 2. A four-year light curve from NASA’s Kepler mission showing stellar brightness variations measured every half-hour during 2009–13. KID = Kepler identification number for this star, PDCFlux = star brightness after the “Pre-Search Data Conditioning” module is applied to reduce instrumental effects, and IQR = InterQuartile Range. In this star, the magnetic activity suddenly increased producing quasi-periodic variations from rotationally modulated starspots after about two years of observation. A low-dimensional ARFIMA model produced near-Gaussian white noise residuals but, as with most stars, no transiting planet was found here. (This plot and Figure 3 plots, obtained by Caceres, et al. 2019.)

photometric survey of ~200,000 transits in the frequency domain. This procedure is beset by four stars. I did identify several thou- The initial reduction of stellar big problems: sand larger planets—most Kepler variability in the time domain is 1. It takes a time domain discoveries are super-Earth-size generally performed with non- procedure that can treat a planets with orbital periods between parametric procedures: wavelet wide variety of unpredict- two and 100 days—but hopes were analysis (as in the official Kepler able behaviors. foiled for the smaller planets. pipeline) and high-dimensional It was the stellar variability that local regression procedures like 2. It requires a frequency caused the problem: More stars Gaussian Processes regression domain procedure to find exhibited brightness changes than are most-often applied. A few very faint transit-shaped expected. They mostly arise from researchers use advanced signal periodic dips produced by magnetic activity and convection as processing methods like Inde- small, Earth-sized plan- seen on the sun and manifested as pendent Component Analysis, ets. Periodograms have cool starspots, hot faculae, and flares. correntropy, Empirical Mode non-Gaussian power dis- While some stellar time series decomposition, or Singular Spec- tributions that can easily (called “light curves” by astron- trum Analysis. give rise to spurious spec- omers) can be quiet, others can Once the star brightness tral peaks. show complex nonstationarity changes have been reduced, statisti- variations. The temporal behaviors cal procedures for finding periodic 3. If the vetting stage is seen as a classification procedure, are not simple: stochastic autocor- dips in brightness due to a tran- then the classes are badly related variations from superposed siting planet are well-established. imbalanced, with dozens microflares, quasi-periodicities as Fourier transforms are not efficient of non-transiting cases for starspots rotate in and out of view, because the transit shape is not each transit. Even an occa- and explosive white light flares. sinusoidal; a “box least-squares” sional misidentification The phenomena cannot be mod- (BLS) matched filter for tran- of a spectral peak with eled realistically astrophysically sit-shape dips is used instead. A noise can flood the com- and are thus unpredictable. Figure BLS periodogram is constructed, munity with false alarms, 2 shows a typical nonstationary light curves are “folded” (plotted wasting valuable telescope Kepler light curve. modulo) to the most prominent time for astronomical fol- The typical procedure has two spectral peak and are “vetted” by low-up studies. stages: reducing the unwanted expert astronomers to make sure stellar variations in the time they exhibit properties expected of domain and searching for periodic transiting planets.

CHANCE 37 TERMS AND DEFINITIONS 4. Some stars have periodic double-spikes; he called this the variability for other reasons Transit Comb Filter (TCF). ARIMA: In AutoRegressive Integrated Moving and disguise themselves as The TCF periodogram is then Average (ARIMA) time series models, the AR transiting planets. These examined for peaks representing component treats dependence on recently include (quasi-)periodic repeated transit-like variations in past brightness values, MA component treats variations from convection the ARIMA residuals. dependence on recently past brightness or pulsations in some stars, The ARPS procedure then faces changes, and the I represents the differencing and eclipsing binary star a Big Data classification problem operation to reduce non stationarity. Fractional systems that contaminate where the vast majority of Kepler differencing in ARFIMA treats long-memory the target star light curve. light curves have no transit, and correlations corresponding to the astronomers’ Light curves with real, but the small fraction with true plan-  1/f “red noise.” A periodic transit can be non-planetary, periodici- etary transits. Fortunately, NASA’s added as an exogeneous variable to the ties are called astronomical Kepler Team recently released a aperiodic ARIMA model using the ARIMAX false positives. “golden” list of transiting planet formalism. We calculated model fits using the candidates, many confirmed with forecast CRAN package in the R statistical Penn State’s telescopic study by the broader astronomical community. These software environment developed by statistician Approach: serve as a “planet training set” for Rob Hyndman and colleagues. AutoRegressive Planet Search (ARPS) a multivariate classifier, while ran- Box Least Squares and Transit Comb Filter: dom light curves (supplemented We have tried a different approach The BLS and TCF periodograms are frequency with sets of confirmed false posi- to stellar variability reduction tives like eclipsing binaries) serve domain alternatives to the Fourier periodogram based on a classic technique as a “non-planet” training set. that are designed to detect periodic dip and from time series analysis known We developed a list of several spike shapes expected from planetary transits, as Box-Jenkins analysis. A low- dozen “features” for a Random rather than the periodic sinusoidal shape of dimensional, often-linear model is Forest multivariate classifier. The Fourier analysis. constructed where the brightness features are drawn from the origi- is not a function of time, but of nal light curve, ARIMA residuals, Random Forest: RF is an effective classifier using recent past brightness levels and TCF periodogram, folded light ensembles of decision trees based on “features” past changes (see sidebar). These curve, and various time series drawn from the data set and analysis by the are known as ARIMA models diagnostics. Interestingly, the scientist. Important advantages of RF for this and are generally fit by maxi- most-important feature emerged problem include: a metric-free design permitting mum likelihood estimation, with from an ARIMAX fit that gave a features with different scales and units; self- model complexity determined by unique measure of the statistical validation using out-of-bag sampling in each maximizing the Akaike Informa- significance of the transit depth. tree of the forest; and balanced treatment of tion Criterion. ARIMAX analysis (see sidebar) imbalanced training sets. However, the Kepler light curves is used most commonly in econo- are often nonstationary where metrics and had never before been Time series diagnostics: At each stage of the the mean levels change quasi- applied to astronomical problems. time series analysis, the analyst can examine periodically or in other peculiar We optimized a Random Forest how close the data are to white (uncorrelated) ways (Figure 1). ARIMA models (RF) decision tree classifier using Gaussian noise. These include the Anderson- with a differencing operation treat ROC curves. We carefully chose Darling test for normality, Ljung-Box portmanteau many forms of non stationarity. If a threshold of RF probabilities on test for autocorrelation, and adjusted Dickey- long-memory processes are pres- scientific grounds, balancing the Fuller test for stationarity. ent, ARFIMA models can be used. recovery of true positives with the The ARIMA residuals are then need to avoid the potentially over- searched for periodic transits, whelming number of false alarms but the differencing operator and false positives from the imbal- converts a box-like dip into anced classes. ingress and egress spikes. Gabriel Finally, we vetted the light Caceres, while a graduate student curves satisfying the RF thresh- at Penn State, developed a matched old for problems and identified filter for a periodic sequence of promising cases. Several dozen

VOL. 32.3, 2019 38 Figure 3. Light curve of a star with strong, choppy autocorrelation in units of brightness parts per million (ppm) (top). The TCF periodogram of ARIMA residuals shows a promising peak at period days (middle). The folded light curve, seen here as residuals of an ARIMAX model, shows the faint box-shaped dip expected from a planetary transit (bottom). The units of flux are ppm.

new planetary candidates emerge ppm corresponding to a planet a variant of Fourier analysis, and from this effort, most of them with 1% the radius of the host star. Random Forest classification has Earth-sized with very short orbital If real, this Earth-sized planet has been successful in many fields periods (0.2–10 days). These are a very hot, possibly molten surface. since the 2000s. But in astronomy, planets so close to the host star this is quite an unusual approach; that the rock surface itself is prob- The Future of ARPS astronomers have weak education ably molten. in applied statistical methodology Figure 3 shows one of these From the perspective of a time and, in particular, are not familiar promising discoveries. The middle series analyst, our procedure may with ARIMA modeling. panel shows a TCF spectral peak not be particularly innovative: The autoregressive planet search corresponding to an orbital period Box-Jenkins ARIMA-type mod- method outlined here can be of 15.8 days, and the bottom panel eling has been widely used since developed further to improve the shows a transit depth around 100 the 1970s, TCF can be viewed as census of small planets, increasing

CHANCE 39 sensitivity and reducing false We have simulated planet detec- statistics, and few statisticians are alarms and false positives. Non- tion from ground-based surveys familiar with the scientific issues linear ARFIMA and GARCH of this type and happily find that or are embedded in astronomi- models give better fits to stellar ARIMA and TCF methods work cal research teams. Astronomical variations in some cases. even with 70%–90% “missing data.” projects fund “software” and “data Preprocessing by outlier analysis” computation, but not removal can clean up some noisy The Future of development of the methodology periodograms. The TCF algorithm Statistics and that underlies the computing effort, can be extended to treat gradual, Exoplanets so cross-disciplinary collaboration rather than sudden, ingress and is still rare. egress spikes. Newer machine Exoplanetary research is in its third The field of astrostatistics is learning classifiers like XGBoost decade, but the work is still in its constantly improving, with expo- can replace the Random Forest early phases. Nearly every aspect of nential growth in the use of stage, and meta-classifiers might the observational studies requires machine learning and Bayesian reduce the need for expert vetting. advanced statistical methodology: approaches in the astronomical Research groups in Israel and • Algorithms for reducing research literature. Many chal- the U.S. are investigating the stellar variations, both from lenges are being effectively effectiveness of Deep Learning the observational condi- addressed, and many more will be neural networks where the full tions and intrinsic to the faced in the future. If we ever find database of light curves, rather star; the extraterrestrial “wild beasts” than selected features, are entered predicted by Lucretius, it is likely into the classifier. • Accurate subtraction stel- that statistics will play a crucial We have also investigated lar emission to reveal faint role in the discovery. applying the ARPS approach to planetary emission in time ground-based transit surveys where series, astronomical spec- Further Reading light curves of millions of stars are tra, and images; constructed. The problem here is Caceres, C.A., Feigelson, E.D., • High-dimensional para- the irregular spacing of the light Babu, G.J., Bahamonde, N., curve time series data. Typically, metric modeling of Cristen, A., Meza, C., and a star can be observed only about multiplanetary orbits fitted Cure, M. 2019. AutoRegres- eight hours per day at night, and to sparse data sets; sive Planet Search, Astro- is visible for only about six out of • Highly imbalanced classi- nomical Journal, in press 12 months per year. The daily gaps fication problems from big arxiv:1901.05116, 1905.03766 in observation can be mitigated surveys; and 1905.09852. by placing telescopes on different continents or at the South Pole. • Correcting observational Feigelson, E.D., Babu, G.J., and selection biases to quan- Caceres, G.A. 2018. Autore- tify underlying planetary gressive time series methods About the Author populations; and for time domain astronomy, Frontiers of Physics 6, #80. Eric Feigelson is professor of astronomy and • Searching for biosigna- arxiv:1901.08003. astrophysics, and of statistics, at Penn State University. tures. He has been working with G. Jogesh Babu and other Haswell, C.A. 2010. Transiting statisticians for 35 years developing curriculum, offering The field of astrostatistics Exoplanets. Cambridge, UK: summer schools, organizing meetings, and conducting Cambridge University Press. research in astrostatistics. must lead the way in future exo- planet studies, and this requires Hyndman, R.J., and Athanaso- a combined effort of astronomers poulos, G. 2018. Forecasting: and statisticians. Astronomers Principles and Practices, 2nd edi- traditionally have little training in tion. OTexts. otexts.com/fpp2.

VOL. 32.3, 2019 40 Identifying Milky Way Open Clusters With Extreme Kinematics using PRIM Mark R. Segal and Jacob W. Segal

he recent (April 25, 2018) second data release (DR2) of the European Space AgencyT satellite Gaia provides a dramatic increase in the extent and precision of stellar attributes for an unprecedented number of sources in the Milky Way (MW). For instance, data on position in the sky and proper motion are available for more than 1.3 billion stars, with a subset of more than 7 million of the brightest stars having radial veloc- Figure 1. Cartoon depicting a star’s velocity components. A star’s velocity, ity measurements. (Definitions of in three-dimensional space, has three components, one in each coordinate proper motion, , and direction. (Brews Ohare. https://commons.wikimedia.org/wiki/File:Proper other astronomical terms are pro- motion.JPG.) vided in the glossary.) Investigation of kinematics— star’s proper motion, µ, coupled with identifying open clusters (OCs). combined proper motion and distance to the star, d. In turn, this Open clusters comprise from doz- radial velocity—is central to our distance can be obtained as inverse ens to thousands of stars, formed exploratory analyses. The cartoon in Figure 1 illustrates how these parallax, if parallax is determined from the same molecular cloud components yield total stel- precisely enough. Proper motion around the same time and loosely lar velocities. Subsequently, we measures angular change in position bound by gravitational attraction. describe how to transform from over time, with position measured More than 1,000 OCs have been the depicted sun-centric viewpoint on the celestial sphere using right discovered in the MW, primarily to a Galactic Center viewpoint. The ascension (akin to longitude) and located in the Galactic disk. This systematic study of stellar veloci- declination (akin to latitude). contrasts with the rarer (<100), ties has been one of the primary The third component of velocity larger, more-luminous, accretion- applications of the Gaia DR2 is radial velocity, measured along the formed globular clusters that reside resource, with particular interest line of sight. The magnitude of the throughout the halo, velocity in the identification of hypervelocity three component velocity vector is measurements of which have pro- stars (HVS). Rewinding the paths shown as total velocity, designated vided evidence for the existence of of some of these stars has enabled as vtot and serving as the outcome dark matter. speculation regarding mecha- for these analyses. The sun-centric OCs provide markers of the nisms, often violent, about how perspective is transformed to a formation and evolution of the the extreme speeds were attained. coordinate system with origin at MW. Their ages span the entire Two of these components— the Galactic Center. history of the Galactic disk, rang- depicted in Figure 1 as transverse Another line of enquiry ing from the young thin disk (our velocity—can be obtained from the using Gaia DR2 data has been subsequent focus) to the old

CHANCE 41 thin disk components and their methods usd for identifying OCs parameters, previous research- attendant life cycles. These are and discuss associated issues. ers used Euclidean distance after important for explaining the cre- standardizing each parameter sep- ation and evolution of the MW Existing Approaches arately. However, as acknowledged, disk and, potentially, spiral galaxies to OC Detection this choice is one of convenience in general. Further, the spatial dis- and, for example, disregards the tribution and motion of OCs can Using the Gaia DR2 data release (i) strong dependencies between be informative with regard to per- for OC identification—which can proper motions and parallax, as turbations acting on the structure be framed as clustering or local have been accommodated in iden- and dynamics of the MW. mode detection based on stellar tifying HVSs, and (ii) differing Beyond serving as such trac- density estimates—is exceedingly variation and precision of the com- ers of MW structure and history, challenging. This is because of (i) ponent parameters. OCs are intrinsically interesting the large total numbers of stars, Moreover, by combining targets. As homogeneous groups as noted above; (ii) the relatively spatial and velocity components, of stars with the same age and small numbers of stars consti- the topology induced by this same initial chemical composition, tuting an OC; (iii) differential metric can potentially group stars they provide a useful testbed for precisions with which stellar posi- that are spatially distant, yet mov- studying individual tions and velocities are measured; ing together. and evolution. and (iv) the existence of extensive Identifying such groupings, Rather than search for indi- local MW structure, which makes which share a common origin vidual HVS or OCs irrespective the standard use of the uniform but have become gravitationally of velocity, these objectives can be distribution inappropriate as a null unbound and are termed stellar combined by seeking groups of stars referent distribution for declaring associations, is of interest, but our that exhibit (relatively) extreme a local mode. focus here is on detecting spatially velocities. The search focuses on Indeed, some authors have proximal clusters that also move the young thin disk because: (i) ascribed OC discovery to “ser- together; i.e., OCs. This region is where the bulk of endipity” resulting from close We achieve this by separating OCs reside; (ii) the expanded examination of small, pre-selected spatial and velocity components. Gaia DR2 data set has been rela- regions (often corresponding to Further, we restrict to stars with tively enriched in high magnitude known OCs) of the MW. Accord- radial velocity measurements, which (faint) stars, formerly obscured by ingly, they have invoked the need enables derivation of total velocity gas clouds in the young thin disk; for data mining methods to facili- per Figure 1. This permits recast- and (iii) the restriction reduces tate detection in an unbiased way ing the problem in a supervised the size of the search space, which on a galactic scale, which motivates learning framework, emphasizing has both computational and infer- our work in part. detection of clusters that are mov- ential benefits. The necessity for using data ing relatively rapidly or slowly. Since the stellar population of mining was made explicit in other We subsequently outline the a galactic disk exhibits (fluid-like) work, with the unsupervised clus- patient rule induction method dynamical rotation, with little ran- tering technique DBSCAN being a (PRIM)—the particular super- dom motion due to being bound by frontline technique. To address the vised methodology used—but first the disk’s gravitational potential, concerns about local MW struc- detail data sources and the pre- it can be anticipated that extreme ture and computational burden, processing steps needed to obtain velocity groups therein will not be researchers applied the clustering total velocity measurements in an common. Such groups, if identi- algorithm separately to predefined appropriate reference frame (coor- fied, may be especially revealing in regions that partition the Galactic dinate system). terms of originating perturbations. disk, a strategy we also adopt. None of this is to say that a As with all clustering meth- Data Acquisition, broader, halo-wide search is not ods, DBSCAN requires a measure Filtering, and purposeful. Since the search of similarity or distance between Processing process we use proceeds by suc- objects (stars). Operating in the cessively removing the (extreme) five-dimensional space comprising The Gaia DR2 data set con- groups detected, our approach can three parameters defining spatial tains 7,183,262 stars with both still recover conventional OCs. It position and two proper motion radial velocity and astromet- may help to describe some current (right ascension, declination) ric parameters (position on the

VOL. 32.3, 2019 42 Figure 2. Theoretic and observed rotation curves for the spiral Triangulum Galaxy (Messier 33). (Mario De . https://commons.wikimedia.org/w/index.php?curid=74398525.)

celestial sphere given by right We designate star position (pc; 1 pc  3.26 light years) ascension and declination, paral- using Galactocentric (rectangular) of the disk as measured by zGC. This lax, and proper motions). To obtain coordinates (xGC, yGC, zGC) with the restriction further reduces our data total velocities, it is necessary to GC as origin, since these are read- set to 1,702,932 stars, 27% of the convert parallax to distance so ily obtained from right ascension, (parallax) filtered sample. apparent motion of an object on declination, and parallax using The rotation curve of a spiral the celestial sphere can be trans- standard transformation from galaxy is a graph of stellar velocity formed to physical motion in space. spherical to Cartesian coordinates. versus radial distance from the GC, While this conversion is a mat- The distance of a star from the GC, here rGC. It was through study of ter of simple inversion, errors in rGC, is then just Euclidean distance rotation curves, and attendant dis- parallax measurement (including in this coordinate system. By con- crepancies between theoretic and struction, the x axis aligns the observed curves, that the existence negatives!) make corresponding GC GC and the sun. By virtue of the of dark matter was first postulated, filtering important. We follow sun-centric nature of Gaia DR2 as illustrated in Figure 2. previous work in restricting to sampling, and our focus on the To detect OCs and kinematic stars with relative parallax errors young thin disk, this axis essentially groups with relatively extreme between 0 and 20%, which reduces coincides with the first principal velocities requires accommo- the sample to 6,376,803. component of the matrix of stellar dation for systematic velocity We also inherit prior determi- positions in our filtered data set. variation. We effect this control nations, from a publicly available Similarly, yGC and zGC cor- by stratifying into predefined catalog, of total velocity (vtot) in a respond to the second and third regions based on rGC deciles and coordinate system with origin at principal components, respectively. 5th and 95th quantiles. Such the Galactic Center (GC). Deriv- This agreement has bearing on our stratification limits the ability to ing these total velocities involves subsequent analysis. detect clusters that straddle decile correcting observed radial veloci- Various rules have been boundaries, and arguably, in view ties and proper motions for the advanced to define the young thin of the largely flat rotation curve sun’s position and motion relative disk. Here, we use the specification for our Gaia DR2 subsample, is to the GC. that it consists of stars within 100 not necessary. However, by

CHANCE 43 Figure 3. Illustration of PRIM algorithm. The outcome takes two values, 0 and 1, indicated by the blue and red points, respectively. The procedure starts with a rectangle (broken black lines) surrounding all of the data, then peels away points along one edge, by a prespecified proportion  (here,  = 0.1), with the edge chosen to maximize the mean of the points remaining in the box. Starting at the top left panel, this figure shows the sequence of peelings, until a pure red region is isolated in the bottom right panel. (Hastie, T.J., Tibshirani, R.J., and Friedman, J.H. 2009. The Elements of Statistical Learning. New York: Springer.)

substantially reducing the com- to make the outcome averages in The number of observations putational burden, we are able each box as different as possible. considered for removal is governed  to attempt permutation-based Since partitioning into boxes is by the peeling tuning parameter ; inference as a way to validate recursive, the rules defining the throughout we use the common  detected clusters. boxes can be represented by a default value = 0.1. The peeling binary tree. Such representations procedure is then repeatedly reap- The Patient Rule enhance interpretability because plied to the reduced box until the Induction Method they mimic decision trees. number of points therein reaches PRIM also finds box-based a prescribed minimum mass value. We have arrived at a setup where, clusters in feature space, but seeks This minimum mass is specified as we have data on the location of each boxes in which the velocity out- a proportion of the initial sample star in each decile stratum (xGC, yGC, come average is high or low. The size; to elicit OCs with as few as zGC) in Galactocentric coordinates search for outcome maxima—fast 50 stars, we use a value of 0.0005. and its total velocity vtot. We treat star groups—is conducted as fol- There is provision for expand- velocity as an outcome and try lows, with obvious modifications ing the resultant solution box by to find local modes (“bumps”)— for finding slow groups. reverse peeling according to dis- velocity maxima or minima—in the The first step commences with tinct pasting tuning parameter if feature space defined by position a box containing all the data. This such expansion increases the box coordinates. Here, feature space is box is slightly reduced by remov- mean; we have not deployed such literally space! ing a small number of observations expansion. Observations in the Tree-based methods partition along one face of the box, with the resultant box are then removed the feature space, using binary face chosen for this peeling so the from the original data set and the splits on individual features, into mean outcome of the reduced box entire procedure begun anew, gen- rectangular (box-shaped) regions is maximized. erating a sequence of solution boxes.

VOL. 32.3, 2019 44 Figure 3 illustrates one round of methods are required to assess if group velocity averages) boxes this process. While this schematic the results are “real.” If the focus along with the median velocity  also uses peeling parameter = is solely on the first solution box box. Figure 4 pertains to searching 0.1, our application differs in that generated, then cross-validation for groups of slow moving stars in

(i) instead of two dimensions, we approaches can be used to deter- the rGC band defined from the 0th have three corresponding to star mine, on a predictive basis, an to 5th rGC quantiles, correspond- position coordinates; (ii) instead appropriate trade-off between box ing to 85,147 stars between 3,193 of a binary outcome, there is a con- mean (average vtot for stars in the and 6,663 pc from the GC. The tinuous (total velocity) outcome; box) and mass (number of stars in original PRIM analysis extracted and (iii) instead of 200 data points, the box) based on the sequence of a total of 1,432 boxes. there are, for example, 170,293 for peeling steps. The slowest observed group each decile partition. However, since we are interested (box 1, red vertical line) contains PRIM differs from tree-based in appraising all solution boxes as 44 stars moving with an average −1 partitioning methods in that the potential kinematic groups, an total velocity of vtot = 195 km s . box definitions are not described alternate approach is needed. It is This is appreciably less than the by a binary tree. This makes inter- crucial to account for the adaptive total velocities of the slowest boxes pretation of the collection of rules search underlying PRIM solutions obtained by PRIM after permuta- more difficult; however, by remov- in any such approach. We achieve tion (blue histogram). ing the binary tree constraint, these goals by use of permutation- Further, using t-tests to com- individual rules are often simpler. based inference, reapplying PRIM pare the velocity distribution of the Since we are interested in isolating to scrambled data sets where, within original slowest box with the veloc- select velocity clusters, as opposed each rGC partition considered, we ity distribution of its permuted to characterizing the velocity dis- permute star velocity values over counterparts obtains significant tribution of the entire MW, this the (fixed) spatial positions. results, with the p-value inter- latter simplicity is preferable. Doing this a great many times quartile range (IQR) being More important for our pur- generates the permutation distri- (0.04,0.08). The differences poses is the built-in patience of bution of the sequence of solution between original and permuted PRIM, reflected by each succes- box mean velocity values since we findings diminish when compar- sive peel only removing a small are re-applying the entire PRIM ing second and third boxes, and portion (here, 10%) of the cur- procedure, generating multiple are null by the median box. When rent sample, as opposed to the solution boxes per permutation. we seek fast-moving groups in this much-greater data fragmentation In addition, the individual per- rGC band, none of the results are inherent to the top-down sample muted velocities of each solution statistically significant. splitting that arises with tree- box are retained, facilitating com- Indeed, for most bands, and for structured methods. parisons—for example, via t-test— both fast- and slow-moving group By conducting peeling with with the velocities of the stars in detection, results are null. Figure 5 reference to the given coordinates the original (unpermuted) solution provides a representative illustra- (features), it is evident that PRIM boxes that incorporate within-box tion that pertains to searching for box solutions will depend on the velocity variation. fast-moving groups in the fourth coordinate system chosen. To The main drawback to this rGC decile (7,921 to 8,055 pc from both improve performance and strategy is computational, which the GC). As discussed above, such attain coordinate system invari- can be redressed at least partly findings are expected in view of ance, executing a principal by parallelization. stars in the young thin disk being components rotation to princi- bound by the disk’s gravitational pal axes has been proposed before Results potential. However, that does not applying PRIM. However, as dis- mean that the groups (boxes) iden- cussed above, no such rotation is Figures 4–8 provide a brief but rep- tified are not OCs; rather, they are necessary because (xGC, yGC, zGC) resentative survey of our findings. just not distinct with respect to already constitute principal axes The histogram panels depict total velocity. for this Gaia DR2 subsample. results from the original and Figure 6 showcases significant Like clustering algorithms, the permuted PRIM analyses for findings for fast-moving groups

PRIM procedure will generate a selected rGC partitions. In each from the 90th to 95th rGC quan- sequence of solution boxes regard- instance, we highlight findings tile band, corresponding to stars less of the input data. Accordingly, for the top three (most extreme between 8,828 and 9,349 pc from

CHANCE 45 Figure 4. PRIM results for select (top three, median) slow-moving groups (boxes) for the band between 0th and 5th rGC quantiles. The red vertical line indicates the observed average total velocity for the respective groups while the blue histograms give null distributions, as obtained by permuting velocities and re-applying the PRIM procedure.

Figure 5. As for Figure 4 for select fast-moving groups (boxes) for the band between 3rd and 4th rGC deciles.

VOL. 32.3, 2019 46 Figure 6. As for Figure 4 for select fast-moving groups (boxes) for the band between 90th and 95th rGC quantiles.

the GC. The fastest observed Figure 7 depicts the posi- In addition to isolating (rare) group (from a total of 1,566 tions of these extreme (fast and extreme kinematic groups, the groups) contains 45 stars moving slow) groups in the portion of the PRIM technique also deliv- with an average total velocity of Galactic disk covered by our Gaia ers numerous candidate OCs. −1 vtot = 258 km s (box 1, red verti- DR2 subsample. Because these Techniques for prioritizing and cal line), notably faster than the arise in outlying quantiles, the box interrogating these solutions total velocities of the fastest boxes extent is greater than for discover- deserve attention. The nature of obtained by PRIM after permuta- ies in more-central, confined rGC such interrogation depends on tion (blue histogram). bands. This is illustrated in Figure available data. For example, to fol- However, as expected, this group 8, which highlights the slowest low up on select HVS identified average velocity is dramatically less box from the 8th decile (8,343 to from Gaia DR2 total velocities, −1 than the velocity (568 km s ) of a 8,497 pc from the GC), containing use of targeted high-resolution recently characterized individual 87 stars, with average total velocity spectroscopy enabled determi- −1 HVS ejected from the inner disk, vtot = 215 km s against a restricted nation of chemical abundance while possibly the three fastest portion of the Galactic plane. patterns which, in turn, were stars in the MW—white dwarf revealing about their origins. companions of supernovae—have Future Work With putative kinematic groups velocities exceeding 1,000 km s−1. and OCs, analogous follow- The t-test comparisons with total There are many possibilities for through is essential for verification velocity distributions of permuted further study. In terms of PRIM, for beyond the purely statistical counterparts produce a p-value methods that use more-robust vtot and downstream interpretation. IQR of (0.01,0.03). summaries and simultaneously cap- While some headway along these Again, as expected, differences ture variation warrant consideration. lines is possible using the photo- between original and permuted Improvements in computational metric data available from Gaia findings diminish in progressing to efficiency are important for enabling DR2 and inspection of associated comparing second and third boxes expanded search and more-effective color-magnitude diagrams, detect- and are null for the median box. permutation-based inference. ing the characteristic signatures

CHANCE 47 Figure 7. Smoothed scatterplot of star density in the Galactic plane (zGC = 0) for Gaia DR2 subsample. Highlighted

are boxes indicating locations of the fastest (green) and slowest (red) kinematic groups in their respective rGC bands,

depicted via semi-circles: 90th to 95th rGC quantile (purple) and 0th to 5th rGC quantile (orange). The fast group appears to extend beyond its band because the box is (i) projected onto the Galactic plane and (ii) a bounding box as opposed to the convex hull of contained points. Also shown (black rectangle) is the region expanded in Figure 8.

Figure 8. Smoothed scatterplot of star density in a restricted portion of our Gaia DR2 subsample. The box indicates

the position of the slowest kinematic group in the band between the 70th and 80th rGC quantiles, depicted by the purple semi-circles.

VOL. 32.3, 2019 48 of OCs from these diagrams can Globular clusters: a spherical Further Reading be challenging with small cluster collection of tightly gravitationally sizes. Some preliminary attempts bound stars, residing in the halo, Cantat-Gaudin, T., Jordi, C., to impart an algorithmic basis and orbiting the Galactic Center. Vallenari, A., Bragaglia, A., (using artificial neural networks on Halo: an extended, roughly Balaguer-Núñez, L., Soubiran, a set of manually labeled diagrams) spherical component of a galaxy, C., Bossini, D., Moitinho, A., for such detections have been comprising sparsely scattered older Castro-Ginard, A., Krone- made—another area that warrants stars, globular clusters, and gas. Martins, A., Casamiquela, L., additional data analytic efforts. Hypervelocity stars: stars with Sordo, R., and Carrera, R. 2018. velocities greatly exceeding the A Gaia DR2 view of the open Glossary of normal velocity of stars in a galaxy. cluster population in the Milky Astronomical Terms Open clusters: stellar groups Way. Astronomy & Astrophysics comprising dozens to thousands 618, A93. These astronomical terms and of stars, formed from the same Castro-Ginard, A., Jordi, C., constructs are described, where molecular cloud around the same Luri, X., Julbe, F., Morvan, appropriate from the standpoint time and loosely bound by gravi- M., Balaguer-Núñez, L., and of stars in the Milky Way, because tational attraction. Cantat-Gaudin, T. 2018. A new they are the units of the analyses Parallax: difference in the appar- method for unveiling open clus- discussed here. ent position of a star when viewed ters in Gaia—New nearby open Celestial sphere: an abstract along two lines of sight, measured clusters confirmed by DR2. sphere that has an arbitrarily large by the angle of inclination between Astronomy & Astrophysics 618, radius and is concentric to the these lines, from which distance to A59. Earth on which all stars in the sky the star can be determined. Ferreira, F.A., Santos Jr., J.F.C., can be conceived as being projected. Photometry: measurement of the Corradi, W.J.B., Maia, F.F.S., Accordingly, a star’s position can apparent brightness of stars and and Angelo, M.S. 2019. Three be described via two coordinates informative with respect to their new Galactic star clusters analogous to latitude (declination) temperature, distance, and age. discovered in the field of the and longitude (right ascension). Proper motion: a star’s angular NGC 5999 with Color-magnitude diagram: change in position over time, mea- Gaia DR2. Monthly Notices of a plot of stars’ brightness versus sured in arc-seconds per year; it is the Royal Astronomical Society. color, usually for a cluster so the a two-dimensional vector (since it https://doi.org/10.1093/mnras/ stars are all at approximately the excludes the component in the line sty3511. same distance, from which various of sight direction) defined by the Friedman, J.H., and Fisher, N.I. attributes can be inferred. angular changes per year in the star’s 1999. Bump hunting in high- Declination: the angular dis- right ascension and declination. dimensional data. Statistics and tance of a star north or south of Radial velocity: the velocity of Computing 9, 123e143. the celestial equator, analogous a star along the line of sight of an Marchetti, T., Rossi, E.M., and to latitude. Declination and right observer. Brown, A.G.A. 2018. Gaia DR2 ascension, an east-west coordinate, Right ascension: the angular dis- in 6D: Searching for the fast- together define the position of a tance of a star measured eastward est stars in the Galaxy. Monthly star in the sky. along the celestial equator from the Notices of the Royal Astronom- Galactic Center: the rotational Sun at the Vernal equinox; analo- ical Society. https://doi.org/10. center of the Milky Way, approxi- gous to longitude. 1093/mnras/sty2592. mately 8,200 parsecs away from Stellar association: a loose star Earth, in the direction of the con- cluster containing up to 100 stars stellation Sagittarius. that share a common origin but, About the Authors Galactic disk: the (relatively) while gravitationally unbound, are Mark Segal is a professor in and head of the Division thin plane containing the bulk of still moving together through space. of Bioinformatics in the Department of Epidemiology and the Milky Way’s stars. Young thin disk: a structural Biostatistics at the University of California San Francisco. His Galaxy rotation curve: plot of component of the Milky Way current research focus is primarily in computational biology. orbital speeds of visible stars in the composed of stars, gas and dust Jacob Segal is a senior studying astrophysics at the galaxy versus their radial distance with a scale height of approxi- University of Colorado, Boulder. He works in a stellar variable from the galaxy’s center. mately 100 parsecs in the vertical research group. axis perpendicular to the disk.

CHANCE 49 Time Series Clustering Methods for Analysis of Astronomical Data David J. Corliss

luster analysis, often A Common Clustering and the temperature begins to referred to as segmentation Example: The rise again. in business contexts, is used Anderson Iris Data As with other clustering appli- to identify and describe subgroups cations, the data in each cluster C Cluster analysis is often taught of individuals with common char- share important characteristics acteristics that distinguish them using a data set of iris flowers col- that are distinct from the other from the rest of the population. lected by the American botanist clusters. In these time series data, While segments are often iden- Edgar Anderson. These data are however, these clusters are succes- tified using static characteristics, often known as the Fisher Iris sive, representing stages in a process. evolving systems may be better Data, due to their later use for The application of clustering meth- described by how things change discriminant analysis by the stat- ods to time series data can identify over time. A medical patient may istician Ronald Fisher. Figure 1 the steps in a process or stages in be classified by the amount of time shows a plot of these data, based evolutionary development. since an important event such as a on two characteristics that tend to Time series cluster analysis diagnosis, economic activity may be distinct by species: petal length can be applied to phenomena that be segmented by stages in an eco- and sepal width. repeat in a periodic manner, such nomic cycle, and neighborhoods as internet traffic over a 24-hour grouped by stages in generational Cluster Analysis period or seasons in a year. How- evolution. In astrostatistics, this Applied to Time ever, not every event of interest has technique is used to classify a a fixed duration and some may not supernova by how the amount of Series Data repeat at all. While some one-time light it produces changes over time. While most people in statistics are events last for a standard length of Depending on the method used, familiar with clustering similar to time, the most-general case is with clustering time series data can this standard example, cluster of events that follow an exact sequence but vary in overall duration. identify a series of steps or phases time series data can be very dif- within a time series, or identify In astrophysics, the violent ferent. As one example, a dynami- groups of time series with a similar stellar eruptions known as High cally evolving system often goes pattern (Similarity Analysis). Velocity Absorption (HVA) events through a series of distinct steps The use of cluster analysis in in some late B- and early A-type or stages as it changes. Figure 2 statistics to identify distinguish- supergiant stars appear to follow able subpopulations goes back to gives the example of heat applied a definite sequence but vary in the 1939s, with a textbook by the to ice to melt it, with the tempera- duration by an order of magnitude behavioral psychologist Robert ture recorded at periodic intervals. or more. This work in time series Tryon. Most statistical software When the data in this time series cluster analysis originally was systems support cluster analysis, are clustered, a succession of steps undertaken to identify the evolu- with output typically including can be seen. tionary stages of these one-time summary statistics on the final First, the temperature of the events of variable duration. clusters, metadata on the iterations ice increases. As the ice melts, the This analysis of events with need to create them, and an output material remains at the freezing variable duration re-normalizes data set with a field identifying the point. Once all the ice has melted, the time values using the difference cluster for each record. the resulting water is heated between two benchmark points in

VOL. 32.3, 2019 50 Figure 1. Typical clustering example with iris flowers observed by Edgar Anderson (1936) grouped by characteristics distinctive to each species.

Figure 2. Time series of melting water ice.

time. Often, as in the case of HVAs, beginning to end. This analysis of calculated and then used as the time the beginning and the end of the HVAs extracts the initial and final variable in the cluster analysis. event provide the overall duration. dates and then merges them with The number of clusters was The time data values are then trans- each record. The percent of the total identified by initially generating a formed to a percent of the time from time elapsed from the beginning is large number of clusters and then

CHANCE 51 Figure 3. Testing point-in-time and rate of change variables in five HVA events. These plots test which combination of observed quantities provides the best separation of the clusters and thus defines the successive stages of these events more clearly with distinct behaviors at each stage. The system with most distinct clusters, both point-in-time and rate-of- change variables (bottom plot), has the clusters in the following order:  Phase 1  Phase 2  Phase 3  Phase 4.

Figure 4. Time series cluster analysis of HVA events in the A0 supergiant HR 1040. The vertical scale is the number of standard deviations above the baseline mean during “quiescent” phases—with no stellar eruptions. The horizontal scale is time, given by the last five digits of the Julian date. This dating system, commonly used in astronomy time series analysis, avoids negative dates by recording the number of days since January 1, 4713 BCE, since there are few, if any, astronomical observations with individual dates before this point in time.

VOL. 32.3, 2019 52 Figure 5. Infrared light curves from five Type 1a supernovae.

“pruning”—combining clusters of evolutionary stages is seen in supernovae by their light curves with similar values. HVA events two HVAs in Figure 4. by how the amount of light they appear to go through up to four produce changes over time. distinct stages, so this was set as the Similarity Analysis: Figure 5 shows five light curves, number of clusters in the final run Classifying Time with each demonstrating the rise of the clustering algorithm. Series by Pattern and fall in brightness from a dif- While this study used the SAS ferent Type Ia supernova observed procedure FASTCLUS, other Similarity analysis is a statistical in the Sloan Digital Sky Survey clustering algorithms or packages method for comparing and clus- (SDSS) in infrared light. (David could be employed. tering different trends or patterns Cinabro of Wayne State University over time. It uses a distance mea- In this analysis, point-in-time provided a copy of the source data.) sure to quantify the difference in data was used to calculate the rate In this example, similarity form or shape between different of change of the variables since the analysis reads and quantifies the time series: Two series get a small previous observation. This case of differences in shape between the similarity distance if their pattern HVAs in late B- and early A-type five lines and creates two groups. is similar, but receive a large num- stars gave the clearest separation The first group, with lines 1 and 2, ber if the patterns are different. of clusters when both the observed The mutual differences between has a higher, faster peak and a sec- values and their rates of changes many sequences or time series can ondary peak later on. The second were included in the analysis be compiled in a similarity matrix, group, with lines 3, 4, and 5, has a (Figure 3). similar in form to a correlation slower initial growth rate and no The first stage in these events is matrix using multiple correlations. secondary peak. characterized by a rapid increase in When cluster methods are The difference in shape can the quantity and velocity of mate- applied to a similarity matrix, be captured and quantified by rial ejected from the star, while time series with similar patterns a process called Dynamic Time the absolute amounts remain fairly are grouped together and placed Warping (DTW). DTW involves small (Cluster #1), followed by a into clusters distinct from others drawing scale lines between two rise to a sharp peak (#2), and then containing times series with dif- time series, resulting in a need for a rapid decrease of 60%–70% (#3). ferent patterns. This allows data additional lines when difference in After an interval with little change, captured over time to be classi- shape occurs. The similarity dis- there is a final drop back down to fied into groups. This example tance is the sum of the length of zero intensity (#4). This sequence uses similarity analysis to classify the lines, so shapes with only a few

CHANCE 53 Figure 6. Dynamic time warping to calculate the similarity distance between two light curves, produced using the R package DTW.

Figure 7. Dynamic time warping to calculate the similarity distance between two light curves.

VOL. 32.3, 2019 54 Table 1—Similarity Matrix for the Infrared Light Curves of Six Supernovae in the SDSS Data

TS1 TS2 TS3 TS4 TS5 TS6 TS1 0 817 100 469 338 307 TS2 817 0 422 644 871 895 TS3 100 422 0 92 835 818 TS4 469 644 92 0 56 206 TS5 338 871 835 56 0 378 TS6 307 895 818 206 378 0

minor differences have a smaller This cost plot compares the same in time. Applications of similar- distance than time series with a two supernova events plotted in ity analysis in econometrics may greater difference in shape. The Figure 6. involve prices that are subject to “warping” refers to non-linear re- inflationary changes over time, scaling of the time axis to find the Preparing the Data requiring transformation to mon- optimal (shortest) path indepen- and Calculating the etary values at a specific point in dent of changes in speed. Similarity Distance time (e.g., 2019 dollars). For example, DTW works on Gasoline prices provide an EKG heartbeat time series data, Some quantities are highly vari- excellent example of all of the data even though heartbeats can have able, making classification using issues: They are highly volatile and different durations. Two software cluster analysis more difficult. often show autocorrelation over packages capable of performing Fields with larger variances receive the short term and inflation effects DTW are the SIMILARITY pro- more weight in determining clus- over the long term. With proper cedure in SAS, which was used in ters. Also, if the average value of a preparation of the data, modeling this study, and the DTW package field steadily increases or decreases the impact of events on a volatile in R. In addition to calculating over time, the weight given to that and sensitive commodity such as the similarity distance, these also field will also change. In these cir- gasoline can be performed using provide several important plots cumstances, a transformation of similarity analysis to compare the to visualize the data to facilitate volatile field to the number of stan- event to previous price shocks. analysis and interpretation. dard deviations above and below Classification using similarity A Warp Plot (Figure 6) shows the mean before similarity analysis analysis involves these steps: the lines describing the differ- corrects for weighting by normal- ence in shape between two time izing fields to the total amount 1. Calculate the similar- series. The sum of the lengths of of variation for each field in the ity distance between the the line segments gives the similar- data set. time series and placed it in ity distance. Other data issues may arise. The a similarity matrix. Data A Cost Plot (Figure 7) pro- time series of some quantities can visualizations support vides a different way of visualizing be autocorrelated. For example, investigate the differences the difference in shape between values in the short-term future between time series. two time series. Blue areas indicate may be correlated more strongly 2. Apply cluster analysis to areas of greater similarity while red to values in the recent past than to the similarity matrix to ones indicate more difference; for those in the more-distant past. As identify clusters of time example, at the end of these two in earlier examples, the time rate series with similar forms. time series, where the upper one of chance of a quantity may char- declines monotonically while the acterize the behavior better than Table 1 is the similarity matrix lower one shows a second peak. values observed at one moment containing the similarity distance

CHANCE 55 Figure 8. Additional cluster characteristics. Supernovae 3 and 4, with a second IR peak, also have a higher peak in the UV. All flux values scaled to maximum flux in g (visible) = 100.

calculated for each pair from six to describe the shape in detail—78 with a secondary peak in the far supernova events in the Sloan Dig- time series in all. infrared one without. When all ital Sky Survey. The matrix shows In the case of the infrared light the data are plotted in the panel, the amount of difference in shape curves from type 1a supernovae, additional characters of the clus- between any two time series. For two distinct clusters can be seen. ters can be identified. In this example, TS4 and TS5 are very One cluster has a distinct second case, the g, r, and i columns are similar in shape, so they have a peak in the infrared some 35 days very similar. However, the ultraviolet u column on the left small number (56). In contrast, after the initial, maximum peak brightness. The other cluster does shows a large peak in the first TS4 is very different from TS1, not have this second IT peak, fad- cluster with minimal increases in resulting in a very large number ing away monotonically (Figure 5). the u data for the other cluster. (469). A number of visualizations Once the clustering is com- It may be that the mechanism can be created as this stage as well, plete, additional characteristics or characteristics responsible for including warp and cost plots. of the clusters may be identified. the secondary far infrared peak This trellis plot (Figure 8) dis- that first attracted attention is also Clustering the plays data for the five supernovae responsible for the much-stronger Time Series by plotted earlier. In addition to the ultraviolet radiation seen in this Similarity Distance infrared light (z, in the column on sub-type. the right) used for the similarity Similarity analysis is an emerg- Once the similarity distances have analysis, it plots four other colors ing method for the classification been calculated, the time series of light: u (ultraviolet), g (green), of time series. It uses a distance can be clustered using standard r (red), and I (near infrared). The measure to quantify the differ- methods. Developing the clusters similarity analysis using all the ence in form or shape between the uses all time series in the SDSS time series with sufficient data two series, with a lower number data with sufficient observations yields two clusters: one for those indicating greater similarity. The

VOL. 32.3, 2019 56 Figure 9. Seasonality in gasoline prices. This set of clusters shows a post-holiday lull in the first week of the year (cluster #1), a winter season with low prices and abundant supply (#5), a run-up of prices and supply shortages from mid-March through the end of May when refineries are changing over from winter formulations to summer (#2), a summer driving season with significant supply but higher demand and occasional spikes (#3 and #6), and a gradual decline in prices from mid-September–December (#4).

Figure 10. U.S. recession and recovery data (quarterly GDP percent change).

similarity distances for a set of no secondary infrared peak. Clas- time series clustering examples, time series can be arranged into a sification of the time series in this developed to study hot, massive similarity matrix. Classification of way led to the discovery of other stars, can contribute to other time series can be achieved using differences between the two types, research. In econometrics, time standard clustering methods on including a larger peak in the ultra- series cluster analysis has been the similarity distance. violet in the first sub-type. applied to the evolution of market In the example, SDSS light shocks, identifying seasonality in curves from a set of Type 1a Time Series gas prices (Figure 9). Longitudi- supernovae indicates the pres- Cluster Analyisis: nal studies in time series cluster ence of two subtypes: one with a Other Examples analysis can investigate seasonal fast initial growth rate and a patterns in weather over time due secondary peak in the infrared In astrostatistics, analytic methods to climate change. z-range, while the second subtype often borrow from and contribute Similarity analysis has has slower initial growth rate and to other areas of research. These been used to classify cycles in

CHANCE 57 Table 2—Similarity Matrix of U.S. Recession/Recovery Cycles

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle1 0 193.67 185.09 184.12 222.47 341.67 392.48 532.5 Cycle2 193.67 0 71.34 233.86 34.16 280.74 79.89 115.73 Cycle3 185.09 71.34 0 229.11 57.79 287.56 121.01 160.87 Cycle4 184.12 233.86 229.11 0 176.47 317.19 108.08 133.58 Cycle5 222.47 34.16 57.79 176.47 0 281.18 35.96 85.93 Cycle6 341.67 280.74 287.56 317.19 281.18 0 256.74 219.47 Cycle7 392.48 79.89 121.01 108.08 35.96 256.74 0 65.64 Cycle8 532.5 115.73 160.87 133.58 85.93 219.47 65.64 0

Cycle 6, a “Double Dip” recession, is found to be an outlier due to high-similarity distances to all other events in this study.

unemployment and recession/ Best Practices for • Watch out for outliers, recovery by their pattern over time, Similarity Analysis which will appear as time with “Double Dip” recessions as a series clusters with very cluster (Figure 10 and Table 2). Best practices for using similarity few members. Applications are also found in analysis include: biostatistics. Growth rates of bio- • When clustering, examine EXPLORE MORE logical systems can be classified both point in time variables or segmented. Similarity analysis and rate of change data. Joseph M. Hilbe wrote an has been used to classify Ebola article about astrostatistics outbreaks in Africa, providing • When data are “unevenly for the January 2014 issue comparisons for future outbreaks spaced” (that is, the time of Amstat News. Titled by identifying similar events in interval between observa- “Astrostatistics: The Re- the past. tions is not constant), treat Emergence of a Statistical them as an evenly spaced Discipline,” it is an excellent time series with missing place to begin to learn more. About the Author data and impute the miss- ing values. The Center for Astrostatistics With a PhD in astrophysics from the University of Toledo, at Penn State University David J. Corliss performs time series analyses of • Where highly volatile vari- has a tremendous wealth evolving stars and stellar populations. He is active in the ables are present, consider of resources for this area of ASA, serving on the steering committee of the Conference on applying a data transforma- Statistical Practice and as past president of the Detroit chapter. research. They maintain a Statistical methods from his astrophysics research also find tion such as standardization web portal at asaip.psu. application in Data for Social Good, including the monthly before clustering. edu. Center leaders Eric column he writes for Amstat News. • Once the data have been Feigelson and G. Jogesh transformed, and, in the Babu have written a textbook, case of similarity analysis, Introduction to Astrostatistics the similarity distances and R, available from have been calculated, the Berkeley Center for standard clustering best Cosmological Physics. practices still apply.

VOL. 32.3, 2019 58 [The Odds of Justice] Mary W. Gray Column Editor

Alexa Did It!

e previously examined the dangers inher- ent in the ubiquity of algorithms,W particularly when their structure is difficult to discover and evaluate due to intellectual prop- erty protection. For example, how can a defense attorney ascertain the factors used in a secret sentenc- ing algorithm? What if, instead, we are dealing with the results of what a machine has learned? What, aside from a tech- EXPLORE MORE nical definition, is machine learning? Is it only new technology or does it raise questions about its form of life? Does it just broaden the scope and depth of issues such as privacy, security, and the unethi- cal and unlawful use of data, or are we facing a fundamental change campaign,1 but responsibilities? adversarial attacks occur—a few in species? Maybe to pay taxes, but not to be pixels can change the world? What is the character we assign guilty of murder? These questions go to the nature to the “machines” of machine Corporate structures in gen- of machine learning, a nature that is learning? We may have scoffed eral may be complex and obscured, also critical to the kind of intellec- at the renegade “Hal” of 2001, but at each step, at least in theory, tual property protection it receives. but where would the responsibil- The U.S. Constitution provides for we can discern a human hand. ity lie for resulting actions were patent and copyright protection: Now, though, we are talking about we faced with such a situation? To promote the progress of science autonomous diagnostic and treat- Leaving aside the controversial and useful arts, by securing for lim- question of when human person- ment tools, vehicles, identification ited times to authors and inventors hood begins, we know that some and selection programs, weap- the exclusive right to their respective decidedly non-human entities have ons—who knows what next? How writings and discoveries. been found by the U.S. Supreme will the machine cope with the But the history of exactly what Court to have certain “rights,” “unknown unknowns” of Donald protection should be accorded to such as to contribute to a political Rumsfeld? What happens when new forms of science and useful

1 Citizens United v. Federal Election Commission, 558 U.S. 310 (2010).

CHANCE 59 arts is not straightforward. Thou- Six years later,5 the court decided extent of the patent explosion is sands of already-granted patents that merely tying an algorithm the suit, later settled, by Amazon of genes were invalidated when, to a process did not make the pro- against Barnes and Noble for in 2013, the Supreme Court ruled cess patent-eligible; rather, the infringement of Amazon’s “1-click” that human genes cannot be pat- requirement was whether a trans- shopping method patent. In fact, ented because DNA is a “product formation of an article from one many thought that the prolif- of nature.”2 But DNA manipulated state to another occurred. Aha! eration of patents threatened the in a lab, specifically complementary How about a computer program innovation the patent system was DNA (cDNA), remained eligible. using a well-known formula (the supposed to promote. This synthetic DNA is produced Arrhenius Equation) to calculate As the pendulum swung back, from the messenger RNA that pro- when rubber was transformed from State Street and its progeny were vides the instructions for making uncured to cured?6 superseded by In re Bilski,9 where proteins. Is there an analogy in A series of cases followed, the U.S. Court of Appeals for the artificial neural networks? based on the notion that a novel Federal Circuit Court rejected To achieve patent protection, algorithm combined with a trivial Bilski’s business method patent an invention or discovery must be physical step constitutes a novel because it was based on an abstract novel and non-obvious, but also physical device—that is, load an idea that would largely have pre- involve patentable subject matter; algorithm on a computing device empted hedging as a business prac- “a useful process, machine, manu- to get a new patent-eligible tice (back to “you can’t patent the facture, or composition of matter, machine. In re Lowry7 held that a quadratic formula” days). In CLS or any new and useful improve- data structure on a computer’s hard Bank International v. Alice Corp.,10 ment thereof may obtain a patent drive is similarly patent-eligible. a business method patent involv- therefore, subject to the conditions Although the terminology used ing a computer-implemented and requirements of this title.”3 in these cases was mathematical escrow service was rejected, lead- Is machine learning patent- formula or algorithm, other rulings ing to widespread invalidation of eligible? A patent gives its owner seemed to indicate that algorithms patents of software even though the legal right to exclude others that manipulated something other the Supreme Court never ruled from making, using, selling, and than numbers (e.g., images) might explicitly on software patents. importing an invention for a lim- be treated differently. But wait: In January of this year, ited period of years, in exchange Finally, however, the United the USPTO issued a new set of for publishing an enabling public States Patent and Trademark Office rules that would make it easier to disclosure of the invention. The (USPTO) issued new guidelines patent software! These, too, though. disclosure requirement can be cir- that led to granting patents to a will be tested in the courts. cumvented by relying instead on wide variety of software. We ask again, is machine trade secret law, where the essence Then, in State Street Bank v. learning software and if so, is it of the protection is secrecy. Signature, the Federal Circuit8 held patent-eligible? Exactly what The tale of patents of software that a numerical calculation that might the claims of the patent is a tangled one. In 1972, the produces a “useful, concrete and application be? Supreme Court decided that an tangible result”—e.g., a price—is If neither patent nor trade algorithm for correcting binary- patent-eligible, opening the door secret protection, how about coded decimal numbers into pure to a deluge of so-called “busi- relying on copyright law, which binary numbers was an abstract ness method patents.” Perhaps protects “original works of author- idea and thus not patent-eligible.4 the example that best shows the ship, fixed in a tangible medium

2Association for Molecular Pathology v. Myriad Genetics, Ind., 569 U.S. 576 (2013). 335 U.S.C. §101. 4Gottschalk v. Benson, 409 U.S. 63 (1972). 5Parker v. Flook, 437 U. S. 583 (1978). 6Diamond v. Diehr, 450 U.S. (1981). 732 F.3d 1579 (Fed.Cir. 1994). 8149 F.3d 1398 (Fed.Cir. 1998). 9545 F.3d 943 (Fed.Cir. 2008) 10573 U.S. 206 (2014).

VOL. 32.3, 2019 60 including literary, dramatic, musical, A basic concept in law lets that they were purportedly devel- artistic, and other intellectual loss lie where it falls. The victim oped to prevent. To guard against works.” However, copyrights pro- goes uncompensated unless it can this is even more difficult in vide much less protection than do be shown that the manufacturer machine learning. How can we patents; copyright covers only the acted negligently or unreasonably; program our ethical and legal expression, not the idea. that is, was “at fault.” But with principles into the algorithms and While “tangible form” may be complicated electronic gear, sophis- be certain the machine will be a problematic requirement here as ticated vehicles, complex structures guided by them? At least recogni- well as in patent law, courts have of any kind, it is unreasonable to tion of the problem is a first step. had no trouble adapting intellec- expect that the end user can ascer- Nor has there been much tual property protection to other tain the source of the failure of a introspection about the enhanced media not contemplated by the product. Then the manufacturer, nature of privacy and security drafters of the Constitution. architect, designer, etc., has to protection required in the face of Product of nature, process, pay even though not at fault. This machine learning. For example, abstraction, machine in the his- strict product liability holds the can certain adversarial attacks on toric sense—whatever machine producer responsible when some- machine learning be prosecuted learning is, whether patent-eligi- thing goes wrong, even without under the Computer Fraud and ble or not, the question becomes showing that there was any negli- Abuse Act? How does synthesiz- who owns the patent, copyright, gence, as would be the case under ing data to facilitate research and or trade secret it represents—the contract law (and the existence of a enhance privacy affect intellectual one who had the idea, the one who contract is also not required in the property protections? Given the structured the algorithm and its case of product liability). nature of machine learning, who training, the one who wrote the Once again, we ask whether can be said to own the input and code or the “machine” itself and its software, and in particular machine the output? learning? Millions, if not billions, learning, is a product. Generally, What can be agreed is that can be at stake. “product” is defined as any movable developers definitely have to We can ask the same ques- good—but what is a good? Usually consider the implications of the tion about who is liable for the some physical presence is assumed, question of who or what is machine harm that might be caused when so as a component in a physical learning. the machine acts as expected or product, software would probably intended, or when it does not. be covered by product liability— The wrong person is identified; but perhaps not standing alone. About the Author the wrong treatment proscribed; Liability for risk prediction, diag- the black ice is not recognized; a nosis, and treatment by machine Mary Gray, who writes The Odds of Justice column, is Distinguished Professor of Mathematics and Statistics at woman is thought to be a car; the learning has been discussed in the American University in Washington, DC. Her PhD is from the applicant for housing, employ- medical field, but not much else- University of Kansas, and her JD is from Washington College ment or educational opportunity is where. Moreover, it may be difficult of Law at American University. She received the Elizabeth rejected on a discriminatory basis; for end users to invoke products Scott Award from the Committee of Presidents of Statistical privacy or security is breached; the liability law because there are no Societies and the Karl Peace Award from the American Statistical Association. Her research interests include statistics innocent is sentenced to death; common standards for industry and the law, economic equity, survey sampling, human rights, the wrong house blown up. Is practice in machine learning devel- education, and the history of mathematics. the developer of the algorithm, opment. Is it a product and if so, the organizer of the data, the where does the product liability lie? physician or other actor who It has been difficult to avoid implements the output, the coder, introducing the very discrimi- the machine itself liable? natory profiling into algorithms

CHANCE 61 LETTER TO THE EDITOR

Reconsidering the Human Face as Boxplot David C. Hoaglin

welcome Sarjinder Singh’s creativity (“Box Plot Versus Human Face,” CHANCE 32(2):28–29), but the boxplot does not align with a human Iface as closely as he suggests, and it does not classify observations as outliers. In the standard boxplot (in his notation), the inner fences are:

LIF = Q 1 – 1.5(Q 3 – Q 1)

UIF = Q 3 + 1.5(Q 3 –Q1) and the outer fences are:

LOF = Q 1 – 3(Q 3 – Q 1)

UOF = Q 3 + 3(Q 3 – Q 1)

(For Q 1 and Q 3 the boxplot uses a particular defini- tion, known as the “fourths.”) Importantly, all the data values are real, but some may merit investigation. As in Professor Singh’s construction, the left and right whiskers end at the most-extreme data values that are inside the LIF and UIF, respectively. Data values outside the inner fences are plotted individually, and those beyond the outer fences receive larger plotting symbols. Exploratory data analysis (EDA) does not auto- matically classify data values as “outliers.” It leaves that judgment to the analyst, after investigation. Data values between an inner fence and the corresponding outer fence are “outside,” and those beyond the outer fences are “far out.” Thus, Singh’s “mild outliers” are merely “outside” and his “extreme outliers” are “far out.” It is helpful to have an idea of how often samples of well-behaved (i.e., normal) data contain observa- tions that are outside or far out. In this null situation, the percentage of random samples that contain one

VOL. 32.3, 2019 62 or more observations beyond the inner fences is 33% A batch may consist of distinct clusters; a boxplot for n = 5, 20% for n = 10, 23% for n = 20, 36% for n may hint at such structure, but it is not designed to = 50, and 53% for n = 100; and it approaches 100% reveal it. If the outside data values appear at only one as n becomes large. The irregularities in this sequence end of the boxplot (often the upper end), a transfor- of percentages arise mainly from the definition of mation may make the batch more nearly symmetric, the fourths. For the outer fences, the corresponding suggesting that the presence of outside data values is percentages are 13.4%, 2.9%, 1.2%, 0.5%, and 0.2%. a consequence of the initial scale of the data. Thus, in well-behaved data, samples containing far out observations are fairly rare, but samples containing Further Reading outside observations are fairly common. Another characteristic of the boxplot is the per- Emerson, J.D., and Strenio, J. 1983. Boxplots and centage of observations that are outside (including batch comparison. In Understanding Robust far out), supplemented by the percentage of observa- and Exploratory Data Analysis. Hoaglin, D.C., tions that are far out—again, in well-behaved data. Mosteller, F., and Tukey, J.W., eds. New York: The percentage of observations beyond the inner John Wiley & Sons, 58–96. fences is 8.6% for n = 5, 2.8% for n = 10, 1.7% for Frigge, M., Hoaglin, D.C., and Iglewicz, B. 2014. n = 20, 1.15% for n = 50, and 0.95% for n = 100, Some implementations of the boxplot. The Ameri- and 0.70% in a normal population. For the outer can Statistician 43:50–4. fences, the corresponding percentages are 3.3%, Hoaglin, D.C., Iglewicz, B., and Tukey, J.W. 1986. 0.36%, 0.074%, 0.011%, 0.002%, and 0.00023%. Performance of some resistant rules for outlier Thus, even normal distributions produce occasional labeling. Journal of the American Statistical Associa- far out observations. tion 81:991–9. In published articles, boxplots usually summarize Tukey, J.W. 1977. Exploratory Data Analysis. Reading, part of a completed analysis. When they identify data MA: Addison-Wesley. values as outside, however, the analysis is generally at an early stage. Conceptually, an outlier comes from a different population or mechanism than the bulk of the data. In considering whether an outside data value About the Author may actually be an outlier, one can use a stem-and-leaf David C. Hoaglin is an independent consultant display or a dot plot to look closer at the whole batch based in Sudbury, Massachusetts. of data. An isolated outside or, especially, far out data value may have helpful background information.

CHANCE 63

HELP US RECRUIT THE NEXT GENERATION OF STATISTICIANS The field of statistics is growing fast. Jobs are plentiful, Site features:

opportunities are exciting, and salaries are high. • Videos of young So what’s keeping more kids from entering the field? statisticians passionate about their work Many just don’t know about statistics. But the ASA is • A myth-busting quiz working to change that, and here’s how you can help: about statistics

• Send your students to www.ThisIsStatistics.org • Photos of cool careers in statistics, like a NASA and use its resources in your classroom. It’s all biostatistician and a about the profession of statistics. wildlife statistician

• Download a handout for your students about careers • Colorful graphics in statistics at www.ThisIsStatistics.org/educators. displaying salary and job growth data

• A blog about jobs in statistics and data science If you’re on social media, connect with us at www.Facebook.com/ThisIsStats and • An interactive map www.Twitter.com/ThisIsStats. Encourage of places that employ your students to connect with us, as well. statisticians in the U.S.