Astr 630 Class Notes – Spring 2020 1 ASTRONOMY 630 Numerical and statistical methods in astrophysics
CLASS NOTES
Spring 2020
Instructor: Jon Holtzman Astr 630 Class Notes – Spring 2020 2
1 Introduction
What is this class about? What is statistics? Read and discuss Feigelson & Babu 1.1.2 and 1.1.3. The use of statistics in astronomy may not be as controversial as it is in other fields, probably because astronomers feel that there is basic physics that underlies the phenomena we observe. What is data mining and machine learning? Read and discuss Ivesic 1.1 Astroinformatics What are some astronomy applications of statistics and data mining? Classifica- tion, parameter estimation, comparison of hypotheses, absolute evaluation of hypoth- esis, forecasting, finding substructure in data, finding corrrelations in data Some examples:
• Classification: what is relative number of different types of planets, stars, galax- ies? Can a subset of observed properties of an object be used to classify the object, and how accurately? e.g., emission line ratios
• Parameter estimation: Given a set of data points with uncertainties, what are slope and amplitude of a power-law fit? What are the uncertainties in the parameters? Note that this assumes that power-law description is valid.
• Hypothesis comparison: Is a double power-law better than a single power-law? Note that hypothesis comparisons are trickier when the number of parameters is different, since one must decide whether the fit to the data is sufficiently bet- ter given the extra freedom in the more complex model. A simpler comparison would be single power-law vs. two constant plateaus with a break at a speci- fied location, both with two parameters. But a better fit does not necessarily guarantee a better model!
• Absolute evaluation: Are the data consistent with a power-law? Absolute as- sessments of this sort can be more problematic than hypothesis comparisons.
• Forecasting of errors: How many more objects, or what reduction of uncer- tainties, would allow single and double power-law models to be clearly distin- guished? Need to specify goals, and assumptions about data. Common need for observing proposals, grant proposals, satellite proposals ...
• Finding substructure: 1) spatial structure, 2) structure in distributions, e.g., separating peaks in RVs or MDFS, 3) given multi-dimensional data set, e.g. abundance patterns of stars, can you identify homogeneous groups, e.g. chemi- cal tagging? Astr 630 Class Notes – Spring 2020 3
• Finding correlations among many variables: given multi-dimensional data, are there correlations between different variables, e.g., HR diagram, fundamental plane, abundance patterns
More examples in Ivesic et al 1.1. Note general explosion of the field in the last decade (e.g., CCA and other insti- tutes). Familiarity with these topics is critical, and mastery even better. Clearly, numerical/computational techniques are an important component of as- trostatistics. Previous course, ASTR 575, attempted (with poor success?) to combine these.... Initial anticipation (!) of ASTR630 class topics:
• Intro: usage of data/statistical analysis in astronomy
• Basic probability and statistics. Conditional probabilities. Bayes theorem
• Statistical distributions in astronomy. Point estimators: bias, consistency, effi- ciency, robustness
• Random number generators. Simulating data.
• Fitting models to data. Parametric vs non-parametric models. Frequentist vs Bayesian. Least squares: linear and non-linear. Regularization.
• Bayesian model analysis. Marginalization. MCMC.
• Handling observational data: Correlated and uncorrelated errors. Forward mod- eling.
• Determining uncertainties in estimators/parameters: bootstrap and jackknife.
• Hypothesis testing and comparison
• non-parametric modeling
• Issues in modeling real data. Forward modeling. Data driven analysis
• Clustering analysis. Gaussian mixture models. Extreme deconvolution. K means.
• Dimensionality reduction: principal components, tSNE, etc.
• Deep learning and neural networks
• ?Fourier analysis (probably not) Astr 630 Class Notes – Spring 2020 4
Logistics: Canvas, class notes, grading TBD (homework, no exams?) Resources:
• Ivesic et al : Statistics, Data Mining, and Machine Learning in Astronomy (2014), with Python
• Bailer-Jones: Practical Bayesian Inference (2017), with R
• Feigelson & Babu: Modern Statistical Methods for Astronomy: With R Appli- cations (2012)
• Robinson : Data Analysis for Scientists and Engineers (2016)
• Press et al., Numerical Recipes (3rd edition, 2007)
Web sites:
• California-Harvard Astrostatistics Collaboration
• Center for Astrostatistics at Penn State: note annual summer school in Astro- statistics, also Astronomy and Astroinformatics portal
• International Computing Astronomy Group Astr 630 Class Notes – Spring 2020 5
2 Probability, basic statistics, and distributions
Why study probability?
• Probability and statistics: in many statistical analyses, we are talking about the probability of different hypotheses being consistent with data, so it is important to understand the basics of probability.
• In many circumstances, we describe variation in data (either intrinsic or exper- imental) with the probability that we will observe different outcomes
Different approaches to probability: classical, frequentist, subjective, axiomatic (see here):
• Classic Theory - Calculation of the ratio of favourable to total outcomes. Does not involve experiment. Computed by counting NE events of N total possible events, where all outcomes are equally likely. All outcomes must be equally likely. Cannot handle an infinite number of possible outcomes. Example: coins, dice, cards.
• frequentist: Perform an experiment n times, testing for the occurence of event E, and define: P (E) = lim n(E)/N n→∞ We must estimate from a finite number of trials. We postulate that n(E)/n approaches a limit as n goes to infinity. Not possible for single events! Example: a loaded die.
• subjective: judgement based on intuition; a combination of rational statistics, personal observations, and, potentially, superstition. Easily biased by non- rational use of data. In real-world situations, often all we have! Example: weather.
Axiomatic approach: provide a logically consistent set of axioms to work with probabilities, however they are determined. Define a random experiment H (with nondeterministic outcomes), a sample description space Ω (the set of all outcomes), and an event E (a subset of the sample description space which satisfies certain constraints). Consider the following definitions of axiomatic theory: For any outcome, E, 0 < P (E) < 1 P (E0) = 1 − P (E) Astr 630 Class Notes – Spring 2020 6 where E0 is the complement of E (i.e., not E). For mutually exclusive (no overlap) outcomes only: P (A or B) ≡ P (A ∪ B) = P (A) + P (B)
If Ei are the set of mutally exclusive and exhaustive outcomes (covers all possibilities), then: ΣP (Ei) = 1 These imply, for non mutally exclusive outcomes:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) where P (A ∩ B) ≡ P (A and B) ≡ P (A, B) in different notations, used in different books. Note graphical representation. If events A and B are independent, i.e., one has no effect on the probability of the other, then P (A ∩ B) = P (A)P (B) This serves as a definition of independence. Note that if more than two conditions are considered (e.g., A,B,C), then independence requires that all pairs of conditions must be independent. Caution : addition of probabilities only for mutally exclusive events! Often, it is convenient to think about the complement of events when approaching problems, and couch the problem in terms of “and” rather than “or”. Example: If there is a 20% chance of rain today and 20% chance of rain tomorrow, what is the probability that it will rain in the next two days? Example: we draw 2 cards from a shuffled deck. What is the probability that they are both aces given a) we replace the first card, and b) we don’t? Example: consider a set of 9 objects of three different shapes (square, circle, triangle) and three different colors (red, green, blue). What is probability we will get a blue object? A triangle? What is probability that we will get a blue triangle? Are these independent?
Understand and be able to use basic axioms of probability.
2.1 Conditional probabilities Sometimes, however, we have the situation where the probability of an event A is related to the probability of another event B, so if we know something about the other event, it changes the probability of A. This is called a conditional probability, P (A|B), the probability of A given B. If P (A|B) = P (A), then the results are independent. Astr 630 Class Notes – Spring 2020 7
What are some astronomy examples? Example: Imagine we have squares, circles and triangles again, but only blue triangles. What is the probability we will have a triangle if we only look at blue objects? Thinking about this in the classical approach:
n(blue) P (blue) = n n(blue triangles) P (blue ∩ triangle) = n n(blue triangles) P (triangle — blue) = n(blue) Generalizing to the axiomatic representation:
P (blue ∩ triangle) P (triangle — blue) = P (blue) which serves as a definition of conditional probability, P (A|B), the probability of A given B: P (A ∩ B) P (A|B) = P (B) Note that, by the same reasoning:
P (A ∩ B) P (B|A) = P (A)
But, certainly, P (A|B) 6= P (B|A)! Given the definitions,
P (B|A)P (A) = P (A|B)P (B) which is known as Bayes theorem. Note that this is a basic result that is not con- troversial, but some applications of Bayes theorem can be: the Bayesian paradigm extends this to the idea that the quantity A or B can also represent a hypothesis, or model, i.e. one talks about the relative probability of different models. If we consider a set of k mutually exclusive events Bi that cover all possibilities (e.g., red, green and blue), then the probability of some other event A is:
P (A) = P (A ∩ B1) + P (A ∩ B2) + ...P (A ∩ Bk)
P (A) = P (A|B1)P (B1) + P (A|B2)P (B2) + ...P (A|Bk)P (Bk) Astr 630 Class Notes – Spring 2020 8
Bayes theorem then gives:
P (A|Bi)P (Bi) P (A|Bi)P (Bi) P (Bi|A) = = P (A) P (A|B1)P (B1) + P (A|B2)P (B2) + ...P (A|Bk)P (Bk) This is useful for addressing many probability problems. Example (fake!): say we can classify spiral galaxies into either star forming galx- ies or AGN based on line ratios, since almost all AGN have R > Rcrit. Let A represent when R > Rcrit. However, the test is not perfect: If you observe an AGN, P (A|AGN) = 0.9, but 10% of AGN fail the test (false negative). At the same time, 1% of SF galaxies also satisfy A (false positive): P (A|SF ) = 0.01 (false positive). Also, there are many more star forming galaxies than AGN: P (AGN) = 0.01. If you observe a galaxy with R1 > Rcrit, what is the probability that it is an AGN? Note the significant problem posed by false positives for a test for a rare condition. Review: What is P (A ∪ B)? How is related to P (A) and P (B)? What is P (A ∩ B)? How is related to P (A) and P (B)? What is P (A|B)? What is Bayes theorem? Example: Monty Hall, or three doors, problem. You’re on a game show where you are asked to choose a door from 3 doors, labelled A, B, and C, with a car behind one of them. After you make your choice, the host, who knows where the car is, opens one of the remaining doors to show nothing, and asks whether you want to change your guess. Statistically, does it increase your chances to change or not?
Know what conditional probabilities are. Know and understand Bayes theorem, and be able to apply to probability problems.
2.2 Random variables and probability distributions When we talk about uncertainty in science, we are often talking about quantities whose values result from a measurement of a quantity that has some random varia- tions. These variations may results from variations of the quantity across a population (e.g., masses of stars), or they may result from uncertainties in the measurement of the quantity, or both. These quantities are called random variables. Random variables can be discrete or continuous: discrete random variables take on specific distinct values, whereas a continuous random variable can take on any value. In either case, we describe the probability of getting a value with a probabil- ity distribution function (PDF). A normalized probability distribution (or density) Astr 630 Class Notes – Spring 2020 9 function has the property: Z p(x)dx = 1 for a continuous function or Σp(xi) = 1 for a discrete function. A distribution function that cannot be normalized is called improper. Distribution are often described (incompletely) by a few characteristic numbers, e.g., identifying the location and width of the distribution. Often, moments of the PDF are used:
• Mean, or expectation value: µ = R xP (x)dx =< x >
• Variance: R (x − µ)2P (x)dx = R x2P (x) − µ2 (why?) √ • Standard deviation: σ = variance
1 R 3 • Skewness: = σ3 (x−µ) P (x)dx (beware multiple definitions) : measures asym- metry
1 R 4 • Kurtosis: = σ4 (x − µ) P (x)dx : measures strength of wings relative to peak • See here for some graphical illustrations of moments
Other descriptors can also be used
• Median: central value
• Mode: most common value (for discrete quantities)
• Mean absolute deviation
• percentiles: if one integrates the PDF, this gives the cumulative distribution function (CDF), i.e. the percentage of time a variable falls below a given value. This function ranges from 0 to 1. If one inverts this function, you get the quartile function, which gives the value below which some percentage of the values lie. The median is the quartile function evaluated at 0.5. A desriptor of the width of the PDF is given by the interquartile range, the difference between the quartile function evaluated at 0.75 and at 0.25: q75 − q25 Astr 630 Class Notes – Spring 2020 10
Note that if you have some function of the random variable, then the expectation value of that function is: Z < f(x) >= f(x)P (x)dx example: stellar masses and luminosities. Some useful relations for expectation and variance operators: Σx E(x) ≡ i N Σ(x − E(x))2 var(x) ≡ i N E(x + k) = E(x) + k E(x + y) = E(x) + E(y) E(kx) = kE(x) E(E(x)) = E(x) var(k + x) = var(x) var(kx) = k2var(x) For independent variables: E(xy) = E(x)E(y) var(X + Y ) = var(X) + var(Y )
Understand the concept of a probability distribution function (PDF), and its first several moments: mean, standard deviation, skewness, kurtosis. Know what is meant by an expectation value.
2.3 Estimators In most case, we need to estimate parent quantities from a finite sample. Various estimators can be used for different quantites. Ideally, we want estimators that are unbiased, consistent, efficient, and robust.
• unbiased: the expectation value of estimator is equal to the quantity being estimated
• consistent : with more data, it converges to the true value Astr 630 Class Notes – Spring 2020 11
• efficient: makes good use of the data, giving a low variance about the true value of the quantity
• robust: isnt easily thrown off by data that violate your assumptions about the pdf, e.g., by non-Gaussian tails of the error distribution
It is not always possible for any single estimator to be the most optimal in all of these characteristics! Given the definition of the mean and standard deviation as moments of the PDF, one might consider using sample moments to estimate these from a finite sample. In this case, the probability of each point is equal to 1/N: the PDF is just sampled by the relative number of points at each observed value. So, for the mean, one would use: Σx x¯ = ΣP x = i i i N (a familiar result!). To check if this is biased, we want to compute its expectation value, the average of this over multiple samples:
1 1 < x¯ >= Σx = Σ < x >= µ N i N i so it is unbiased. What about the variance of this estimator? Using the operators above, this is found to be: σ2 < (¯x − µ)2 >= N (see, e.g., Bailer-Jones 1.3 and 2.3, also here) and gives us the uncertainty in the mean. (Note we previously got this result in ASTR 535 via propagation of uncertain- ties): Σx x¯ = i N 1 σ2 σ2 = Σ σ2 = x¯ N 2 x N This result demonstrates that more data reduces the uncertainty! Note that this requires an estimator for the variance. Extending to an estimator for the variance, one might expect to use: 1 S(x) = ΣP (x − x¯)2 = Σ(x − x¯)2 i i N i Astr 630 Class Notes – Spring 2020 12
However, let’s calculate whether this is biased, by comparing it to the true variance. The expectation value of S(x) is (see here): N − 1 < S(x) >= σ2 N This arises because we have to use the sample mean rather than the true mean, so that removes one degree of freedom; note the case of only having one point (for which we clearly can’t get an estimate of the variance), or two points, for which our variance n estimate is clearly too low. So if we multiply our biased estimator by n−1 , then it would be unbiased, so an unbiased estimator of the sample variance is: 1 Sˆ(x) = Σ(x − x¯)2 N − 1 i which is a common result. Note, however, that the square root of this is not an unbiased estimator of the standard deviation, because the expectation value of the square root of a distribution is not equal to the square root of the expecation value of the distribution! This leads to the expression for the uncertainty in the mean, using our unbiased estimator for the variance: s ˆ s S(x) σˆ 1 2 SEM = = √ = Σ(xi − x¯) N N N(N − 1) One can also derive the uncertainty of the sample standard deviation: it is ap- proximately σˆ(x) p2(N − 1) (Bailer-Jones, section 2.4) Estimator of the skew: 1 γ = Σ(x − µ)3 Nσ3 i and kurtosis: 1 κ = Σ(x − µ)4 − 3 Nσ4 i Note that, while these estimators are unbiased and consistent, they are not neces- sarily the most efficient, nor are they necessarily robust. There is so single estimator that is guaranteed to be unbiased, most efficent, and most robust! Other estimators of the mean: midrange (average of lowest and highest point). See Bailer-Jones 2.3 for a demonstration that, for a uniform distribution, midrange is more efficient than mean! Astr 630 Class Notes – Spring 2020 13
The sample mean and variance are sensitive to outliers. The median is more π σ2 robust, but has larger uncertainty: variance ∼ 2 N , which is larger than the uncer- tainty in the mean: you would need more points to get the same uncertainty out of the median as from the mean. For a more robust estimator of the variance, the mean absolute deviation (MAD, average of the absolute value of the difference between observation and mean or median) is often used, but note that it is a biased estimator. Another robust estimate is the interquartile range, q75 − q25. Note that, for a Gaussian distribution, σ = 0.7413q75 − q25. A common method to get a robust and efficient estimator of the mean is to use so-called σ-clipping: determine a robust mean and variance and use this to reject outliers, then use an unbiased and efficient estimator to get robust estimators. Other estimators include least-squares estimators and maximum likelihood esti- mators. A least-squares estimator of the mean, for example, is the value,µ ˆLS that minimizes: 2 Σ(xi − µˆLS) A maximum liklihood estimator, is the value that maximizes
ΠP (xi|µMLEˆ ) where Π represents a product, in this case, of the probabilities of each individual point. The relative quality of different estimators depends on the distribution of the quantity for which the estimator is sought. For many simple distributions, the differ- ent estimators, e.g., for the mean, yield the same quantity, but this is not always the case. The unbiased estimator that is most efficient, i.e., gives the lowest variance, is called the minimum variance unbiased estimator (MVUE). Under some conditions, one can derive what the minimum variance of an estimator is: this is called the Cramer-Rao bound (see Feigelson & Babu 3.4.3).
Understand the concept of an estimator and the concepts of bias, consistency, efficiency, and robustness. Know the expressions for unbiased estimators of mean, standard deviation, and standard uncertainty of the mean.
2.4 Multivariate distributions Imagine there is some joint probability distribution function of two variables, p(x, y) (see ML 3.2; imagine, for example, that this could be distribution of stars as a func- Astr 630 Class Notes – Spring 2020 14 tion of effective temperature and luminosity!). Then, thinking about slices in each direction to get to the probability at a given point, we have:
p(x, y) = p(x|y)p(y) = p(y|x)p(x) This gives: p(y|x)p(x) p(x|y) = p(y) which is Bayes theorem. Bayes theorem also relates the conditional probability dis- tribution to the marginal probability distribution: Z p(x) = p(x|y)p(y)dy
Z p(y) = p(y|x)p(x)dx so p(y|x)p(x) p(x|y) = R p(y|x)p(x)dx The covariance characterizes the degree to which multiple variables are correlated. For a 2D distribution, it is defined by: ZZ Cov(x, y) = P (x, y)(x− < x >)(y− < y >)dxdy
Cov(x, y) =< xy > − < x >< y > For a finite sample, the sample covariance is (analagous to sample variance above): 1 Covd (x, y) = Σ(xi − x¯)(yi − y¯) N − 1 The variances and covariances can be summarized into a covariance matrix, in which the diagonal elements are the variances of the individual variables, and the off-diagonal elements are the covariances. By definition, this is a symmetric matrix. If multiple variables are combined to construct some new quantity, e.g. z = f(x, y), the variance in this quantity is related to the variance in the input variables by: ∂f 2 ∂f 2 ∂f ∂f σ2(z) = σ2 + σ2 + 2 Cov(x, y) ∂x x ∂y y ∂x ∂y This is the standard formula for error propagation. Note that the covariance can be negative, leading to reduced scatter in the derived quantity. Also be cautious about Astr 630 Class Notes – Spring 2020 15 the use of this formula: it strictly only applies in the limit of small uncertainties, and perhaps even more importantly, the shape of the distribution of uncertainties in the final variance may not be the same as the shape of the distribution in the input variables. The error propagation formula can be generalized in matrix form for an arbitrary number of variables. Given the covariance matrix and the vector of partial derivatives, D, 2 T σ (f) = D CxD Some examples of covariance or lack of it: • Variance from observational uncertainties – Consider multiple observations of an object in 2 bandpasses – Consider color as a function of magnitude – Consider the case of comparing two independent analyses of the same set of objects • Variance from physical correlations – Consider measurements of cluster distances using main sequence fitting – Consider measurements of stellar parameters from spectra
Understand the concept of covariance and its mathematical definition. Know the expression for error propagation.
2.4.1 Correlation coefficients Often the covariance is normalized by the standard deviation in each variable, leading to the correlation coefficient: Cov(x, y) ρ(x, y) = σxσy < xy > − < x >< y > ρ(x, y) = σxσy which is 1 for perfectly correlated variables, -1 for perfectly anti-correlated variables, and 0 for uncorrelated variables. This provides for a quantitative measurement of the amount of correlation be- tween two quantities. This correlation coefficient is sometimes known as Pearson’s correlation coefficient. Astr 630 Class Notes – Spring 2020 16
Note that a correlation between variables does not necessarily imply a causation between the two!
For a finite√ sample size, one can calculate the sample correlation coefficient, r. The quantity r√ n−1 is distributed as a t-distribution, so you can caluculate the significance 1−r2 of a given r. However, this does not allow for including information about uncertainty on measurements (as far as I can tell!). In that case, it is perhaps better to perform a fit to the data and assess the uncertainties on the fit coefficients. Pearson’s correlation coefficient tests for linear relations between quantities. A perfect nonlinear correlation will not give a value of one! Other tests of correlation include the noparametric correlation tests: Spearman’s and Kendall’s. Spearman’s correlation coefficient is Pearson’s coefficient as applied to the rank of the variables, rather than the value of the variable itself. The rank of a variable is the sequence numbers of the sorted values. Astr 630 Class Notes – Spring 2020 17
3 Common Distributions
• uniform: equal probability of obtaining all numbers between some limits
• binomial distribution: gives the probability for events that have two possible outcomes, e.g., a coin toss. The binomial distribution gives the probability that you’ll get a certain number of a particular outcome, k, in n trials, given the probability of observing the desired outcome in a single event is p (note that the probability of observing the other outcome is 1 − p. Consider the (unlikely!) case when the first k events are all the desired outcome, followed by n − k events with the other outcome. What is the probability of this?
pk(1 − p)n−k
But there are many other ways that we can get k events. To get the total probability we just need to multiply this probability by the number of different combinations that give k of the desired outcome. How many are there? First, consider the number of different permutations. For a set of n objects or trials, what is the number of different orders in which they can appear? This is given by n!. Now consider the number of different permutations if you draw k n! objects or trials. This is given by (n−k)! . Perhaps better: We want to know the number of different ways that k events can be placed into a sequence of n total events. Take the first of k events. How many positions out of n can it go into? How aboud the second one? the third? n! So, for k events, the number of permutations in n slots is (n−k)! . Finally consider the number of combinations there are if you draw k objects or trials, independent of the order. The number of permutations of the selected trials is k!, so the number of combinations is n! k!(n − k)!
Combining results gives us the binomial distribution n! P (k|n, p) = pk(1 − p)(n−k) k!(n − k)!
where P (k, n, p) is the probability of observing k outcomes given n trials, with p being the probability of getting the desired outcome in a single trial. Sometimes Astr 630 Class Notes – Spring 2020 18
a special notation is used for the number of combinations:
n P (k|n, p) = pk(1 − p)(n−k) k