Springer Series in Astrostatistics
Asis Kumar Chattopadhyay Tanuka Chattopadhyay Statistical Methods for Astronomical Data Analysis Springer Series in Astrostatistics
Editor-in-chief: Joseph M. Hilbe, Jet Propulsion Laboratory, and Arizona State University, USA
Jogesh Babu, The Pennsylvania State University, USA Bruce Bassett, University of Cape Town, South Africa Steffen Lauritzen, Oxford University, UK Thomas Loredo, Cornell University, USA Oleg Malkov, Moscow State University, Russia Jean-Luc Starck, CEA/Saclay, France David van Dyk, Imperial College, London, UK
Springer Series in Astrostatistics, More information about this series at http://www.springer.com/series/1432 Springer Series in Astrostatistics
Astrostatistical Challenges for the New Astronomy: ed. Joseph M. Hilbe Astrostatistics and Data Mining: ed. Luis Manuel Sarro, Laurent Eyer, William O’Mullane, Joris De Ridder Asis Kumar Chattopadhyay • Tanuka Chattopadhyay
Statistical Methods for Astronomical Data Analysis
123 Asis Kumar Chattopadhyay Tanuka Chattopadhyay Department of Statistics Department of Applied Mathematics University of Calcutta University of Calcutta Calcutta, India Calcutta, India
ISSN 2199-1030 ISSN 2199-1049 (electronic) ISBN 978-1-4939-1506-4 ISBN 978-1-4939-1507-1 (eBook) DOI 10.1007/978-1-4939-1507-1 Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2014945364
Springer Science+Business Media New York 2014 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publica- tion does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of pub- lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com) To the Astrostatistics Community
Preface
Universe is deadly adventurous and man’s eternal quest to unfold its mysteries will never cease. Astronomy, perhaps the oldest observational science, has got its spectacular emergence with the advent of theoretical astrophysics and it has started to spread in India after some important contributions by Indian scientists in 1920s and later. Astronomy in the recent past has developed a lot with the launch of several missions like GALEX (Galaxy Evolution Ex- plorer), Kepler Space Telescope, Hubble Space Telescope (HST), etc. through which terabytes of data are available for preservation. Hence several virtual archives like SDSS (Sloan Digital Sky Survey), MAST (Multimission Archive at STSCI), Vizier, EDD, LEDA, Chandra, etc. have been developed to pre- serve the one-time snapshots of various astronomical events. During the last two decades galaxy formation theory and their related star formation histories have drawn interests among the astrophysicists to a great extent to uncover these mysteries using the reach treasure of virtual archives. While digging the pathway, a new branch ASTROSTATISTICS (or Statistical Astronomy) has emerged since 1980s. It is a blending of statistical analysis of astronomical data along with the development of new statistical techniques useful to analyze astrophysical phenomenon. The target is not only to explore the formation and evolutionary history of galaxies but also to uncover the unknown facts related to star formation, gamma ray bursts, supernova and other intrinsic variable stars. Where much data are not available, model- based approaches may be adopted. In this book we have tried to introduce “Astrostatistics” as a subject just like biostatistics. Through the various chapters we have discussed the basic concepts of both Astrophysics and Statistics along with the possible sources of Astronomical data. Subsequently we have entered into different types of applications of statistical techniques already developed or specifically introduced for astrophysical problems. We have discussed on techniques like regression, clustering and classification, missing data problems, simulation, data mining and time series analysis. Along with the discussion, specific examples are given so that readers can easily digest the method and can apply it to their own problem. Finally we have included a chapter on the use of R package which is an open access software. There we have included several examples which will be helpful for the readers. In the appendix we have included some astronomical data sets which we have used in our examples. We are indeed grateful to many persons and organizations for helping us during the preparation of the manuscript. In particular, we are grate- ful to Professor Ajit Kembhavi, Director, IUCAA, India, for continuous vii viii Preface encouragement and providing new ideas related to “fundamental plane” concept. We are also deeply indebted to Professor Sailajananda Mukherjee, Retired Professor and Former Head, Department of Physics, North Bengal University, India, for carefully reading the Astrophysical part of the book and suggesting significant improvements. We will always remember the cooper- ation and support we have received from Professor Joseph Hilbe, Emeritus Professor, University of Hawaii, USA. We will also remain grateful to Pro- fessors Jogesh Babu and Eric Feigelson, Pensylvania State University, USA, whose contributions inspired us to start work in this area. We offer our heartiest thanks to our collaborators Didier Fraix Burnet, Emmanuel Davoust, Margarita Sharina, Ranjeev Misra and Malay Naskar. We feel proud of our students Saptarshi Mondal, Tuli De, Abisa Sinha, Pradeep Karmakar and Bharat Warule for their sincere efforts and dedication. Being faculty members of Calcutta University, India, we are really grate- ful to our Vice-Chancellor Professor Suranjan Das and Pro-Vice-Chancellor (Academic) Professor D.J. Chattopadhyay for their continuous support. Finally we thank Mr. Aparesh Chatterjee for carefully preparing the typescript of this book.
Calcutta, India Asis Kumar Chattopadhyay August 15, 2014 Tanuka Chattopadhyay Contents
1 Introduction to Astrophysics ...... 1 1.1 Light and Radiation ...... 1 1.2 Sources of Radiation ...... 7 1.3 Brightness of Stars ...... 8 1.3.1 Absolute Magnitude and Distance ...... 9 1.3.2 Magnitude–Luminosity Relation ...... 10 1.3.3 Different Photometry Systems ...... 11 1.3.4 Stellar Parallax and Stellar Distances ...... 12 1.3.5 Doppler Shift and Stellar Motions ...... 14 1.4 Spectral Characteristics of Stars ...... 15 1.5 Spectral Features and Saha’s Ionization Theory ...... 17 1.6 Celestial Co-ordinate Systems ...... 23 1.7 Hertzsprung–Russel Diagram ...... 26 1.8 Stellar Atmosphere ...... 27 1.9 Stellar Evolution and Connection with H–R Diagram ... 41 1.10 Variable Stars ...... 53 1.11 Stellar Populations ...... 62 1.11.1 Galactic Clusters ...... 62 1.11.2 Globular Clusters ...... 63 1.11.3 Fragmentation of Molecular Clouds and Initial Mass Function (IMF) ...... 65 1.12 Galaxies ...... 69 1.13 Quasars ...... 79 1.14 Pulsars ...... 81 1.15 Gamma Ray Bursts ...... 84 Appendix ...... 85 Exercise ...... 87 References ...... 88 2 Introduction to Statistics ...... 91 2.1 Introduction ...... 91 2.2 Variable ...... 92 2.2.1 Discrete-Continuous ...... 92 2.2.2 Qualitative–Quantitative ...... 92 2.2.3 Cause and Effects ...... 93 2.3 Frequency Distribution ...... 93 2.3.1 Central Tendency ...... 93 2.3.2 Dispersion ...... 94 ix x Contents
2.3.3 Skewness ...... 94 2.3.4 Kurtosis ...... 95 2.4 Exploratory Data Analysis ...... 95 2.4.1 Histogram ...... 96 2.4.2 Box Plot ...... 96 2.5 Correlation ...... 97 2.5.1 Scatter Plot ...... 98 2.6 Regression ...... 99 2.7 Multiple Correlation ...... 102 2.8 Random Variable ...... 102 2.8.1 Some Important Discrete Distribution ...... 103 2.8.2 Some Important Continuous Distributions ..... 107 3 Sources of Astronomical Data ...... 109 3.1 Introduction ...... 109 3.2 Sloan Digital Sky Survey ...... 109 3.3 Vizier Service ...... 114 3.4 Data on Eclipsing Binary Stars ...... 114 3.5 Extra Galactic Distance Data Base (EDD) (edd.ifa.hawaii.edu/ index.html) ...... 115 3.6 Data on Pulsars ...... 115 3.7 Data on Gamma Ray Bursts ...... 115 3.8 Astronomical and Statistical Softwares ...... 116 Exercises ...... 117 4 Statistical Inference ...... 119 4.1 Population and Sample ...... 119 4.2 Parametric Inference ...... 120 4.2.1 Point Estimation ...... 121 4.2.1.1 Unbiasedness ...... 121 4.2.1.2 Efficiency ...... 122 4.2.1.3 Maximum Likelihood Estimator (MLE) ...... 123 4.2.2 Interval Estimation ...... 123 4.3 Testing of Hypothesis ...... 124 4.3.1 p-Value ...... 125 4.3.2 One Sample and Two Sample Tests ...... 126 4.3.3 Common Distribution Test ...... 128 4.4 Empirical Distribution Function ...... 128 4.5 Nonparametric Approaches ...... 130 4.5.1 Kolmogorov–Smirnov One Sample Test ...... 130 4.5.2 Kolmogorov–Smirnov Two Sample Test ...... 131 4.5.3 Shapiro–Wilk Test ...... 132 4.5.4 Wilcoxon Rank-Sum Test ...... 133 4.5.5 Kruskal–Wallis Two Sample Test ...... 134 Reference ...... 135 Contents xi
5 Advanced Regression and Its Applications with Measurement Error ...... 137 5.1 Introduction ...... 137 5.2 Simple Regression ...... 138 5.3 Multiple Regression ...... 138 5.3.1 Estimation of Parameters in Multiple Regression . 139 5.3.2 Goodness of Fit ...... 141 5.3.3 Regression Line Through the Origin ...... 142 5.4 Effectiveness of the Fitted Model ...... 142 5.5 Best Subset Selection ...... 143 5.5.1 Forward and Backward Stepwise Regression .... 144 5.5.2 Ridge Regression ...... 144 5.5.3 Least Absolute Shrinkage and Selection Operator (LASSO) ...... 145 5.5.4 Least Angle Regression (LAR) ...... 145 5.6 Multicollinearity ...... 146 5.7 Regression Problem in Astronomical Research (Mondal et al. 2010) ...... 147 5.7.1 Regression Planes and Symmetric Regression Plane ...... 149 5.7.2 The Symmetric Regression Plane with Intercept ...... 152 References ...... 154 6 Missing Observations and Imputation ...... 155 6.1 Introduction ...... 155 6.2 Missing Data Mechanism ...... 155 6.2.1 Missingness Completely at Random (MCAR) . . . 155 6.2.2 Missingness at Random (MAR) ...... 156 6.2.3 Missingness that Depends on Unobserved Predictors and the Missing Value Itself ...... 156 6.3 Analysis of Data with Missing Values ...... 156 6.3.1 Complete Case Analysis ...... 156 6.3.2 Imputation Methods ...... 157 6.3.2.1 Mean Imputation ...... 157 6.3.2.2 Hot Deck Imputation (Andridge and Little 2010) ...... 157 6.3.2.3 Cold Deck Imputation (Shao 2000) .... 158 6.3.2.4 Warm Deck Imputation ...... 159 6.4 Likelihood Based Estimation: EM Algorithm ...... 159 6.5 Multiple Imputation ...... 161 References ...... 162 xii Contents
7 Dimension Reduction and Clustering ...... 163 7.1 Introduction ...... 163 7.2 Principal Component Analysis ...... 164 7.2.1 An Example Related to Application of PCA (Babu et al. 2009) ...... 167 7.2.1.1 The Correlation Vector Diagram (Biplot) ...... 169 7.3 Independent Component Analysis ...... 172 7.3.1 ICA by Maximization of Non-Gaussianity ..... 175 7.3.2 Approximation of Negentropy ...... 176 7.3.3 The FastICA Algorithm ...... 176 7.3.4 ICA Versus PCA ...... 177 7.3.5 An Example (Chattopadhyay et al. 2013) ..... 179 7.4 Factor Analysis ...... 182 7.4.1 Method of Estimation ...... 185 7.4.2 Factor Rotation ...... 188 References ...... 190 8 Clustering, Classification and Data Mining ...... 193 8.1 Introduction ...... 193 8.2 Hierarchical Cluster Technique ...... 193 8.2.1 Agglomerative Methods ...... 194 8.2.2 Distance Measures ...... 194 8.2.3 Single Linkage Clustering ...... 195 8.2.4 Complete Linkage Clustering ...... 195 8.2.5 Average Linkage Clustering ...... 196 8.3 Partitioning Clustering: k-Means Method ...... 196 8.4 Classification ...... 197 8.5 An Example (Chattopadhyay et al. 2007) ...... 200 8.5.1 Cluster Analysis of BATSE Sample and Discriminant Analysis ...... 201 8.5.2 Cluster Analysis of HETE 2 and Swift Samples . . 204 8.6 Clustering for Large Data Sets: Data Mining ...... 207 8.6.1 Subspace Clustering ...... 207 8.6.2 Clustering in Arbitrary Subspace Based on Hough Transform: An Application (Chattopadhyay et al. 2013) ...... 209 8.6.2.1 Input Parameters ...... 211 8.6.2.2 Data Set ...... 211 8.6.2.3 Experimental Evaluation ...... 211 8.6.2.4 Properties of the Groups ...... 212 References ...... 215 Contents xiii
9 Time Series Analysis ...... 217 9.1 Introduction ...... 217 9.2 Several Components of a Time Series ...... 218 9.3 How to Remove Various Deterministic Components from a Time Series ...... 219 9.4 Stationary Time Series and Its Significance ...... 220 9.5 Autocorrelations and Correlogram ...... 220 9.6 Stochastic Process and Stationary Process ...... 221 9.7 Different Stochastic Process Used for Modelling ...... 223 9.7.1 Linear Stationary Models ...... 223 9.7.2 Linear Non Stationary Model ...... 227 9.8 Fitting Models and Estimation of Parameters ...... 228 9.9 Forecasting ...... 230 9.10 Spectrum and Spectral Analysis ...... 232 9.11 Cross-Correlation Function (wcross(θ)) ...... 235 References ...... 240 10 Monte Carlo Simulation ...... 241 10.1 Generation of Random Numbers ...... 242 10.2 Test for Randomness ...... 245 10.3 Generation of Random Numbers from Various Distributions ...... 246 10.4 Monte Carlo Method ...... 256 10.5 Importance Sampling ...... 258 10.6 Markov Chain Monte Carlo (MCMC) ...... 261 10.7 Metropolis–Hastings Method ...... 261 References ...... 275 11 Use of Softwares ...... 277 11.1 Introduction ...... 277 11.2 Preliminaries on R ...... 277 11.3 Advantages of R Programming ...... 278 11.4 How to Get R Under Ubuntu Operating System ...... 279 11.5 Basic Operations ...... 279 11.5.1 Computation ...... 279 11.5.2 Vector Operations ...... 280 11.5.3 Matrix Operations ...... 281 11.5.4 Graphics in “R” ...... 285 11.6 Some Statistical Codes in R ...... 292 Appendix ...... 303 About the Authors ...... 341 Index ...... 343 Chapter - 1
Introduction to Astrophysics
1.1 Light and Radiation
The most important carrier of information from all kinds of heavenly bodies is light. Light is an electromagnetic wave which is characterized by an elec- tric field E and magnetic field H, and the direction of its propagation is at right angles to both these fields. Unlike mechanical waves, e.g. sound waves on the ocean, it can propagate through empty space. So, if Z-direction is the direction of its propagation (Fig. 1.1),thenY and X directions are the directions of electric and magnetic fields. The wave motion can be described by a sine curve and the distance between two consecutive peaks measures the wavelength λ, which is an important characterization of the propagating light.
The number of oscillations of electromagnetic wave per unit time at a given point is called the frequency ν of the electromagnetic radiation so that
λν = c (1.1) where c is the speed of light and its value in vacuum is 3×108 m/s. The entire electromagnetic spectrum is shown in Fig. 1.2. It consists of Gamma rays, X- rays, ultraviolet, visible light, infrared, microwaves and radio waves of which visible range covers a very small window (400–700 nm, 1 nm = 10−7 cm). For detection of spectra in different regimes appropriate detectors are required, e.g. for visible light optical telescope is required whereas X-ray or infrared telescopes are required to detect X-ray and infrared rays from astronomical sources.
Laws of Radiation in Thermodynamic Equilibrium
When a system filled with gas and radiation is kept isolated from its sur- roundings, then the system will eventually have equal temperature at all points and we say that the system is in thermal equilibrium. This happens because of frequent collisions among the constituents of the system leading to exchange of energy which helps the system to reach rapidly to a state
© Springer Science+Business Media New York 2014 1 A.K. Chattopadhyay, T. Chattopadhyay, Statistical Methods for Astronomical Data Analysis, Springer Series in Astrostatistics 3, DOI 10.1007/978-1-4939-1507-1 1 2 1 Introduction to Astrophysics of thermal equilibrium. In other words when mean free path, which is the path between two successive collisions, becomes sufficiently small (in case of electromagnetic radiation it is the mean free path of photons), the system soon attains thermal equilibrium.
A system is said to be in mechanical equilibrium if the velocity distribution follows Maxwellian velocity distribution.
E
Direction of propagation H
Figure 1.1 Electric field, magnetic field and direction of propagation
A system is said to be in chemical equilibrium if chemical activities have no net change over time.
A system is said to be in thermodynamic equilibrium (TE) when it is simul- taneously in thermal, mechanical as well as in chemical equilibrium.
Black Body Radiation
According to Kirchoff’s law the ratio of the emissive power to its absorption power at a given temperature T is the same for all bodies. So this ratio is universal function of temperature T of the body and the frequency ν of radiation falling on it. A black body (BB) is something which absorbs all the radiation falling on it. So, absorption power of BB is unity. This shows that the emissivity is universal function of ν and T provided the body is in TE, having a constant temperature T .
Specific Intensity (Iν )
It is the amount of energy crossing per unit area of the emitting surface, placed perpendicular to the direction of propagation of light in unit time, per unit solid angle, per unit frequency interval. 1.1 Light and Radiation 3
Figure 1.2 Electromagnetic spectrum
If θ be the angle between the direction of propagation p and normaln ˆ to the surface of area dσ,then(Fig.1.3)
dE I = ν (1.2) ν dσ cos θdwdν
Mean Intensity (Jν )
It is the mean intensity obtained by averaging Iν over all directions. 2π π Iν dw 1 Jν = = Iν sin θdθdφ (1.3) dw 4π φ=0 θ=0
For isotropic radiation (independent of φ and θ)
Jν = Iν = Bν (1.4) which is true for BB radiation.
Flux (Fν )
It is the total amount of energy crossing per unit area in unit time, per unit frequency in all directions. 4 1 Introduction to Astrophysics
dω p dσ n dθ
Figure 1.3 Propagation of electromagnetic radiation