Astr 630 Class Notes – Spring 2020 1 ASTRONOMY 630 Numerical and statistical methods in astrophysics

CLASS NOTES

Spring 2020

Instructor: Jon Holtzman Astr 630 Class Notes – Spring 2020 2

1 Introduction

What is this class about? What is ? Read and discuss Feigelson & Babu 1.1.2 and 1.1.3. The use of statistics in astronomy may not be as controversial as it is in other fields, probably because astronomers feel that there is basic physics that underlies the phenomena we observe. What is mining and machine learning? Read and discuss Ivesic 1.1 Astroinformatics What are some astronomy applications of statistics and data mining? Classifica- tion, parameter estimation, comparison of hypotheses, absolute evaluation of hypoth- esis, forecasting, finding substructure in data, finding corrrelations in data Some examples:

• Classification: what is relative number of different types of planets, stars, galax- ies? Can a subset of observed properties of an object be used to classify the object, and how accurately? e.g., emission line ratios

• Parameter estimation: Given a set of data points with uncertainties, what are slope and amplitude of a power-law fit? What are the uncertainties in the parameters? Note that this assumes that power-law description is valid.

• Hypothesis comparison: Is a double power-law better than a single power-law? Note that hypothesis comparisons are trickier when the number of parameters is different, since one must decide whether the fit to the data is sufficiently bet- ter given the extra freedom in the more complex model. A simpler comparison would be single power-law vs. two constant plateaus with a break at a speci- fied location, both with two parameters. But a better fit does not necessarily guarantee a better model!

• Absolute evaluation: Are the data consistent with a power-law? Absolute as- sessments of this sort can be more problematic than hypothesis comparisons.

• Forecasting of errors: How many more objects, or what reduction of uncer- tainties, would allow single and double power-law models to be clearly distin- guished? Need to specify goals, and assumptions about data. Common need for observing proposals, grant proposals, satellite proposals ...

• Finding substructure: 1) spatial structure, 2) structure in distributions, e.g., separating peaks in RVs or MDFS, 3) given multi-dimensional data set, e.g. abundance patterns of stars, can you identify homogeneous groups, e.g. chemi- cal tagging? Astr 630 Class Notes – Spring 2020 3

• Finding correlations among many variables: given multi-dimensional data, are there correlations between different variables, e.g., HR diagram, fundamental plane, abundance patterns

More examples in Ivesic et al 1.1. Note general explosion of the field in the last decade (e.g., CCA and other insti- tutes). Familiarity with these topics is critical, and mastery even better. Clearly, numerical/computational techniques are an important component of as- trostatistics. Previous course, ASTR 575, attempted (with poor success?) to combine these.... Initial anticipation (!) of ASTR630 class topics:

• Intro: usage of data/statistical analysis in astronomy

• Basic probability and statistics. Conditional probabilities. Bayes theorem

• Statistical distributions in astronomy. Point : bias, consistency, effi- ciency, robustness

• Random number generators. Simulating data.

• Fitting models to data. Parametric vs non-parametric models. Frequentist vs Bayesian. : linear and non-linear. Regularization.

• Bayesian model analysis. Marginalization. MCMC.

• Handling observational data: Correlated and uncorrelated errors. Forward mod- eling.

• Determining uncertainties in estimators/parameters: bootstrap and jackknife.

• Hypothesis testing and comparison

• non-parametric modeling

• Issues in modeling real data. Forward modeling. Data driven analysis

• Clustering analysis. Gaussian mixture models. Extreme deconvolution. K .

• Dimensionality reduction: principal components, tSNE, etc.

• Deep learning and neural networks

• ?Fourier analysis (probably not) Astr 630 Class Notes – Spring 2020 4

Logistics: Canvas, class notes, grading TBD (homework, no exams?) Resources:

• Ivesic et al : Statistics, Data Mining, and Machine Learning in Astronomy (2014), with Python

• Bailer-Jones: Practical (2017), with R

• Feigelson & Babu: Modern Statistical Methods for Astronomy: With R Appli- cations (2012)

• Robinson : Data Analysis for Scientists and Engineers (2016)

• Press et al., Numerical Recipes (3rd edition, 2007)

Web sites:

• California-Harvard Astrostatistics Collaboration

• Center for Astrostatistics at Penn State: note annual summer school in Astro- statistics, also Astronomy and Astroinformatics portal

• International Computing Astronomy Group Astr 630 Class Notes – Spring 2020 5

2 Probability, basic statistics, and distributions

Why study probability?

• Probability and statistics: in many statistical analyses, we are talking about the probability of different hypotheses being consistent with data, so it is important to understand the basics of probability.

• In many circumstances, we describe variation in data (either intrinsic or exper- imental) with the probability that we will observe different outcomes

Different approaches to probability: classical, frequentist, subjective, axiomatic (see here):

• Classic Theory - Calculation of the ratio of favourable to total outcomes. Does not involve . Computed by counting NE events of N total possible events, where all outcomes are equally likely. All outcomes must be equally likely. Cannot handle an infinite number of possible outcomes. Example: coins, dice, cards.

• frequentist: Perform an experiment n times, testing for the occurence of event E, and define: P (E) = lim n(E)/N n→∞ We must estimate from a finite number of trials. We postulate that n(E)/n approaches a limit as n goes to infinity. Not possible for single events! Example: a loaded die.

• subjective: judgement based on intuition; a combination of rational statistics, personal observations, and, potentially, superstition. Easily biased by non- rational use of data. In real-world situations, often all we have! Example: weather.

Axiomatic approach: provide a logically consistent set of axioms to work with probabilities, however they are determined. Define a random experiment H (with nondeterministic outcomes), a sample description space Ω (the set of all outcomes), and an event E (a subset of the sample description space which satisfies certain constraints). Consider the following definitions of axiomatic theory: For any outcome, E, 0 < P (E) < 1 P (E0) = 1 − P (E) Astr 630 Class Notes – Spring 2020 6 where E0 is the complement of E (i.e., not E). For mutually exclusive (no overlap) outcomes only: P (A or B) ≡ P (A ∪ B) = P (A) + P (B)

If Ei are the set of mutally exclusive and exhaustive outcomes (covers all possibilities), then: ΣP (Ei) = 1 These imply, for non mutally exclusive outcomes:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) where P (A ∩ B) ≡ P (A and B) ≡ P (A, B) in different notations, used in different books. Note graphical representation. If events A and B are independent, i.e., one has no effect on the probability of the other, then P (A ∩ B) = P (A)P (B) This serves as a definition of independence. Note that if more than two conditions are considered (e.g., A,B,C), then independence requires that all pairs of conditions must be independent. Caution : addition of probabilities only for mutally exclusive events! Often, it is convenient to think about the complement of events when approaching problems, and couch the problem in terms of “and” rather than “or”. Example: If there is a 20% chance of rain today and 20% chance of rain tomorrow, what is the probability that it will rain in the next two days? Example: we draw 2 cards from a shuffled deck. What is the probability that they are both aces given a) we replace the first card, and b) we don’t? Example: consider a set of 9 objects of three different shapes (square, circle, triangle) and three different colors (red, green, blue). What is probability we will get a blue object? A triangle? What is probability that we will get a blue triangle? Are these independent?

Understand and be able to use basic axioms of probability.

2.1 Conditional probabilities Sometimes, however, we have the situation where the probability of an event A is related to the probability of another event B, so if we know something about the other event, it changes the probability of A. This is called a conditional probability, P (A|B), the probability of A given B. If P (A|B) = P (A), then the results are independent. Astr 630 Class Notes – Spring 2020 7

What are some astronomy examples? Example: Imagine we have squares, circles and triangles again, but only blue triangles. What is the probability we will have a triangle if we only look at blue objects? Thinking about this in the classical approach:

n(blue) P (blue) = n n(blue triangles) P (blue ∩ triangle) = n n(blue triangles) P (triangle — blue) = n(blue) Generalizing to the axiomatic representation:

P (blue ∩ triangle) P (triangle — blue) = P (blue) which serves as a definition of conditional probability, P (A|B), the probability of A given B: P (A ∩ B) P (A|B) = P (B) Note that, by the same reasoning:

P (A ∩ B) P (B|A) = P (A)

But, certainly, P (A|B) 6= P (B|A)! Given the definitions,

P (B|A)P (A) = P (A|B)P (B) which is known as Bayes theorem. Note that this is a basic result that is not con- troversial, but some applications of Bayes theorem can be: the Bayesian paradigm extends this to the idea that the quantity A or B can also represent a hypothesis, or model, i.e. one talks about the relative probability of different models. If we consider a set of k mutually exclusive events Bi that cover all possibilities (e.g., red, green and blue), then the probability of some other event A is:

P (A) = P (A ∩ B1) + P (A ∩ B2) + ...P (A ∩ Bk)

P (A) = P (A|B1)P (B1) + P (A|B2)P (B2) + ...P (A|Bk)P (Bk) Astr 630 Class Notes – Spring 2020 8

Bayes theorem then gives:

P (A|Bi)P (Bi) P (A|Bi)P (Bi) P (Bi|A) = = P (A) P (A|B1)P (B1) + P (A|B2)P (B2) + ...P (A|Bk)P (Bk) This is useful for addressing many probability problems. Example (fake!): say we can classify spiral galaxies into either star forming galx- ies or AGN based on line ratios, since almost all AGN have R > Rcrit. Let A represent when R > Rcrit. However, the test is not perfect: If you observe an AGN, P (A|AGN) = 0.9, but 10% of AGN fail the test (false negative). At the same time, 1% of SF galaxies also satisfy A (false positive): P (A|SF ) = 0.01 (false positive). Also, there are many more star forming galaxies than AGN: P (AGN) = 0.01. If you observe a galaxy with R1 > Rcrit, what is the probability that it is an AGN? Note the significant problem posed by false positives for a test for a rare condition. Review: What is P (A ∪ B)? How is related to P (A) and P (B)? What is P (A ∩ B)? How is related to P (A) and P (B)? What is P (A|B)? What is Bayes theorem? Example: Monty Hall, or three doors, problem. You’re on a game show where you are asked to choose a door from 3 doors, labelled A, B, and C, with a car behind one of them. After you make your choice, the host, who knows where the car is, opens one of the remaining doors to show nothing, and asks whether you want to change your guess. Statistically, does it increase your chances to change or not?

Know what conditional probabilities are. Know and understand Bayes theorem, and be able to apply to probability problems.

2.2 Random variables and probability distributions When we talk about uncertainty in science, we are often talking about quantities whose values result from a measurement of a quantity that has some random varia- tions. These variations may results from variations of the quantity across a population (e.g., masses of stars), or they may result from uncertainties in the measurement of the quantity, or both. These quantities are called random variables. Random variables can be discrete or continuous: discrete random variables take on specific distinct values, whereas a continuous can take on any value. In either case, we describe the probability of getting a value with a probabil- ity distribution function (PDF). A normalized (or density) Astr 630 Class Notes – Spring 2020 9 function has the property: Z p(x)dx = 1 for a continuous function or Σp(xi) = 1 for a discrete function. A distribution function that cannot be normalized is called improper. Distribution are often described (incompletely) by a few characteristic numbers, e.g., identifying the location and width of the distribution. Often, moments of the PDF are used:

, or expectation value: µ = R xP (x)dx =< x >

: R (x − µ)2P (x)dx = R x2P (x) − µ2 (why?) √ • : σ = variance

1 R 3 • : = σ3 (x−µ) P (x)dx (beware multiple definitions) : measures asym- metry

1 R 4 • : = σ4 (x − µ) P (x)dx : measures strength of wings relative to peak • See here for some graphical illustrations of moments

Other descriptors can also be used

: central value

: most common value (for discrete quantities)

• Mean absolute deviation

: if one integrates the PDF, this gives the cumulative distribution function (CDF), i.e. the percentage of time a variable falls below a given value. This function ranges from 0 to 1. If one inverts this function, you get the quartile function, which gives the value below which some percentage of the values lie. The median is the quartile function evaluated at 0.5. A desriptor of the width of the PDF is given by the interquartile , the difference between the quartile function evaluated at 0.75 and at 0.25: q75 − q25 Astr 630 Class Notes – Spring 2020 10

Note that if you have some function of the random variable, then the expectation value of that function is: Z < f(x) >= f(x)P (x)dx example: stellar masses and luminosities. Some useful relations for expectation and variance operators: Σx E(x) ≡ i N Σ(x − E(x))2 var(x) ≡ i N E(x + k) = E(x) + k E(x + y) = E(x) + E(y) E(kx) = kE(x) E(E(x)) = E(x) var(k + x) = var(x) var(kx) = k2var(x) For independent variables: E(xy) = E(x)E(y) var(X + Y ) = var(X) + var(Y )

Understand the concept of a probability distribution function (PDF), and its first several moments: mean, standard deviation, skewness, kurtosis. Know what is meant by an expectation value.

2.3 Estimators In most case, we need to estimate parent quantities from a finite sample. Various estimators can be used for different quantites. Ideally, we want estimators that are unbiased, consistent, efficient, and robust.

• unbiased: the expectation value of is equal to the quantity being estimated

• consistent : with more data, it converges to the true value Astr 630 Class Notes – Spring 2020 11

• efficient: makes good use of the data, giving a low variance about the true value of the quantity

• robust: isnt easily thrown off by data that violate your assumptions about the pdf, e.g., by non-Gaussian tails of the error distribution

It is not always possible for any single estimator to be the most optimal in all of these characteristics! Given the definition of the mean and standard deviation as moments of the PDF, one might consider using sample moments to estimate these from a finite sample. In this case, the probability of each point is equal to 1/N: the PDF is just sampled by the relative number of points at each observed value. So, for the mean, one would use: Σx x¯ = ΣP x = i i i N (a familiar result!). To check if this is biased, we want to compute its expectation value, the average of this over multiple samples:

 1  1 < x¯ >= Σx = Σ < x >= µ N i N i so it is unbiased. What about the variance of this estimator? Using the operators above, this is found to be: σ2 < (¯x − µ)2 >= N (see, e.g., Bailer-Jones 1.3 and 2.3, also here) and gives us the uncertainty in the mean. (Note we previously got this result in ASTR 535 via propagation of uncertain- ties): Σx x¯ = i N 1 σ2 σ2 = Σ σ2 = x¯ N 2 x N This result demonstrates that more data reduces the uncertainty! Note that this requires an estimator for the variance. Extending to an estimator for the variance, one might expect to use: 1 S(x) = ΣP (x − x¯)2 = Σ(x − x¯)2 i i N i Astr 630 Class Notes – Spring 2020 12

However, let’s calculate whether this is biased, by comparing it to the true variance. The expectation value of S(x) is (see here): N − 1 < S(x) >= σ2 N This arises because we have to use the sample mean rather than the true mean, so that removes one degree of freedom; note the case of only having one point (for which we clearly can’t get an estimate of the variance), or two points, for which our variance n estimate is clearly too low. So if we multiply our biased estimator by n−1 , then it would be unbiased, so an unbiased estimator of the sample variance is: 1 Sˆ(x) = Σ(x − x¯)2 N − 1 i which is a common result. Note, however, that the square root of this is not an unbiased estimator of the standard deviation, because the expectation value of the square root of a distribution is not equal to the square root of the expecation value of the distribution! This leads to the expression for the uncertainty in the mean, using our unbiased estimator for the variance: s ˆ s S(x) σˆ 1 2 SEM = = √ = Σ(xi − x¯) N N N(N − 1) One can also derive the uncertainty of the sample standard deviation: it is ap- proximately σˆ(x) p2(N − 1) (Bailer-Jones, section 2.4) Estimator of the skew: 1 γ = Σ(x − µ)3 Nσ3 i and kurtosis: 1 κ = Σ(x − µ)4 − 3 Nσ4 i Note that, while these estimators are unbiased and consistent, they are not neces- sarily the most efficient, nor are they necessarily robust. There is so single estimator that is guaranteed to be unbiased, most efficent, and most robust! Other estimators of the mean: midrange (average of lowest and highest point). See Bailer-Jones 2.3 for a demonstration that, for a uniform distribution, midrange is more efficient than mean! Astr 630 Class Notes – Spring 2020 13

The sample mean and variance are sensitive to outliers. The median is more π σ2 robust, but has larger uncertainty: variance ∼ 2 N , which is larger than the uncer- tainty in the mean: you would need more points to get the same uncertainty out of the median as from the mean. For a more robust estimator of the variance, the mean absolute deviation (MAD, average of the absolute value of the difference between observation and mean or median) is often used, but note that it is a biased estimator. Another robust estimate is the , q75 − q25. Note that, for a Gaussian distribution, σ = 0.7413q75 − q25. A common method to get a robust and efficient estimator of the mean is to use so-called σ-clipping: determine a robust mean and variance and use this to reject outliers, then use an unbiased and efficient estimator to get robust estimators. Other estimators include least-squares estimators and maximum likelihood esti- mators. A least-squares estimator of the mean, for example, is the value,µ ˆLS that minimizes: 2 Σ(xi − µˆLS) A maximum liklihood estimator, is the value that maximizes

ΠP (xi|µMLEˆ ) where Π represents a product, in this case, of the probabilities of each individual point. The relative quality of different estimators depends on the distribution of the quantity for which the estimator is sought. For many simple distributions, the differ- ent estimators, e.g., for the mean, yield the same quantity, but this is not always the case. The unbiased estimator that is most efficient, i.e., gives the lowest variance, is called the minimum variance unbiased estimator (MVUE). Under some conditions, one can derive what the minimum variance of an estimator is: this is called the Cramer-Rao bound (see Feigelson & Babu 3.4.3).

Understand the concept of an estimator and the concepts of bias, consistency, efficiency, and robustness. Know the expressions for unbiased estimators of mean, standard deviation, and standard uncertainty of the mean.

2.4 Multivariate distributions Imagine there is some joint probability distribution function of two variables, p(x, y) (see ML 3.2; imagine, for example, that this could be distribution of stars as a func- Astr 630 Class Notes – Spring 2020 14 tion of effective temperature and luminosity!). Then, thinking about slices in each direction to get to the probability at a given point, we have:

p(x, y) = p(x|y)p(y) = p(y|x)p(x) This gives: p(y|x)p(x) p(x|y) = p(y) which is Bayes theorem. Bayes theorem also relates the conditional probability dis- tribution to the marginal probability distribution: Z p(x) = p(x|y)p(y)dy

Z p(y) = p(y|x)p(x)dx so p(y|x)p(x) p(x|y) = R p(y|x)p(x)dx The characterizes the degree to which multiple variables are correlated. For a 2D distribution, it is defined by: ZZ Cov(x, y) = P (x, y)(x− < x >)(y− < y >)dxdy

Cov(x, y) =< xy > − < x >< y > For a finite sample, the sample covariance is (analagous to sample variance above): 1 Covd (x, y) = Σ(xi − x¯)(yi − y¯) N − 1 The and can be summarized into a , in which the diagonal elements are the variances of the individual variables, and the off-diagonal elements are the covariances. By definition, this is a symmetric matrix. If multiple variables are combined to construct some new quantity, e.g. z = f(x, y), the variance in this quantity is related to the variance in the input variables by: ∂f 2 ∂f 2 ∂f ∂f σ2(z) = σ2 + σ2 + 2 Cov(x, y) ∂x x ∂y y ∂x ∂y This is the standard formula for error propagation. Note that the covariance can be negative, leading to reduced scatter in the derived quantity. Also be cautious about Astr 630 Class Notes – Spring 2020 15 the use of this formula: it strictly only applies in the limit of small uncertainties, and perhaps even more importantly, the shape of the distribution of uncertainties in the final variance may not be the same as the shape of the distribution in the input variables. The error propagation formula can be generalized in matrix form for an arbitrary number of variables. Given the covariance matrix and the vector of partial derivatives, D, 2 T σ (f) = D CxD Some examples of covariance or lack of it: • Variance from observational uncertainties – Consider multiple observations of an object in 2 bandpasses – Consider color as a function of magnitude – Consider the case of comparing two independent analyses of the same set of objects • Variance from physical correlations – Consider measurements of cluster distances using main sequence fitting – Consider measurements of stellar parameters from spectra

Understand the concept of covariance and its mathematical definition. Know the expression for error propagation.

2.4.1 Correlation coefficients Often the covariance is normalized by the standard deviation in each variable, leading to the correlation coefficient: Cov(x, y) ρ(x, y) = σxσy < xy > − < x >< y > ρ(x, y) = σxσy which is 1 for perfectly correlated variables, -1 for perfectly anti-correlated variables, and 0 for uncorrelated variables. This provides for a quantitative measurement of the amount of correlation be- tween two quantities. This correlation coefficient is sometimes known as Pearson’s correlation coefficient. Astr 630 Class Notes – Spring 2020 16

Note that a correlation between variables does not necessarily imply a causation between the two!

For a finite√ sample size, one can calculate the sample correlation coefficient, r. The quantity r√ n−1 is distributed as a t-distribution, so you can caluculate the significance 1−r2 of a given r. However, this does not allow for including information about uncertainty on measurements (as far as I can tell!). In that case, it is perhaps better to perform a fit to the data and assess the uncertainties on the fit coefficients. Pearson’s correlation coefficient tests for linear relations between quantities. A perfect nonlinear correlation will not give a value of one! Other tests of correlation include the noparametric correlation tests: Spearman’s and Kendall’s. Spearman’s correlation coefficient is Pearson’s coefficient as applied to the rank of the variables, rather than the value of the variable itself. The rank of a variable is the sequence numbers of the sorted values. Astr 630 Class Notes – Spring 2020 17

3 Common Distributions

• uniform: equal probability of obtaining all numbers between some limits

• binomial distribution: gives the probability for events that have two possible outcomes, e.g., a coin toss. The binomial distribution gives the probability that you’ll get a certain number of a particular outcome, k, in n trials, given the probability of observing the desired outcome in a single event is p (note that the probability of observing the other outcome is 1 − p. Consider the (unlikely!) case when the first k events are all the desired outcome, followed by n − k events with the other outcome. What is the probability of this?

pk(1 − p)n−k

But there are many other ways that we can get k events. To get the total probability we just need to multiply this probability by the number of different combinations that give k of the desired outcome. How many are there? First, consider the number of different permutations. For a set of n objects or trials, what is the number of different orders in which they can appear? This is given by n!. Now consider the number of different permutations if you draw k n! objects or trials. This is given by (n−k)! . Perhaps better: We want to know the number of different ways that k events can be placed into a sequence of n total events. Take the first of k events. How many positions out of n can it go into? How aboud the second one? the third? n! So, for k events, the number of permutations in n slots is (n−k)! . Finally consider the number of combinations there are if you draw k objects or trials, independent of the order. The number of permutations of the selected trials is k!, so the number of combinations is n! k!(n − k)!

Combining results gives us the binomial distribution n! P (k|n, p) = pk(1 − p)(n−k) k!(n − k)!

where P (k, n, p) is the probability of observing k outcomes given n trials, with p being the probability of getting the desired outcome in a single trial. Sometimes Astr 630 Class Notes – Spring 2020 18

a special notation is used for the number of combinations:

n P (k|n, p) = pk(1 − p)(n−k) k

n where k is “n choose k”. Plot of binomial distribution Examples:

– How many poker hands are there (without a draw)? – What’s the probability of getting 3 heads in 10 coin tosses? – What’s the probability of getting 3 sets of doubles in 3 tosses of a pair of dice? – Assume that 0.1% of the objects within a subsample are quasars. You examine 100 spectra from the subsample in detail. What is the probability that one of the spectra will be that of a quasar? – What is the probability that three will be quasars?

Mean of the binomial distribution can be derived and is:

µ = np

Variance of the binomial distribution can also be derived:

σ2 = np(1 − p)

See here for the derivations.

• Poisson distribution: consider situations in which we count objects, like detected photons. This is a one-sided proposition: we know the number we detect, but not the number we don’t detect! However, determining the number we detect can be derived from the binomial distribution. Consider the case where we split the detection interval into very small pieces so that there are a very large number n of “trials”. The probability, p, of detecting an event in a trial is very small, but from collecting data over a period, we can estimate λ = np, the mean rate of detections. The binomial distribution is then n! λr  λn−r P (r|λ/n, n) = 1 − r!(n − r)! n n Astr 630 Class Notes – Spring 2020 19

In the limit of n → ∞, n! → nr (n − r)! and  λn−r  λn 1 − → 1 − = e−λ n n (a definition of e). In the limit n → ∞, we get the Poisson distribution:

e−λ P (r|λ) = λr r!

Using the results for the binomial distribution in the limit p << 1:

µ = np = λ

σ2 = np = λ which is the key result for a Poisson process. Plot of Poisson distribution Example of possible interest: derivation of the gain from a stack of images, considering what is a robust estimator of the variance (but not the standard deviation!).

• Gaussian (normal) distribution Many physical variables and some instrumentation uncertainties are distributed according to a Gaussian, or normal, distribution:

2 1 − (x−µ) P (x, µ, σ) = √ e 2σ2 2πσ This is a symmetric function with width characterized by σ. Note that as- tronomers often characterize the√ width by the full width at half maximum FWHM, where FWHM = 2σ 2 ln 2 = 2.354σ The integral of a Gaussian is called the error function (sometimes called erf(x)). This yields the well-known properties that, for a Gaussian, 68% of the points should fall within plus or minus one σ from the mean, and 95.3% between plus or minus two σ from the mean. The Gaussian may be common because of the , which says that the sum of many independent variables will have a distribution that is Astr 630 Class Notes – Spring 2020 20

Gaussian is form, regardless of the distributions of each of the indepen- dent variables. A related distribution is the lognormal distribution, which is a Gaussian distri- bution in the logarithm of a variable, x = log y. This would arise from the cen- tral limit theorem for variables that are combined via multiplicative operations. An astronomical example might be in a Gaussian distributions of magnitudes, e.g., in globular cluster luminosities.

• Beta distribution (useful distribution that gives values between 0 and 1)....

• χ2 distribution

• Students t-distribution

Understand what uniform, binomial, Gaussian, and Poisson distributions look like. Ideally, know the formulae; certainly, know the mean and standard deviation for Gaussian and Poisson distributions. Astr 630 Class Notes – Spring 2020 21

4 Data simulation

Generating artificial data can be a powerful way to test your data analysis techniques, and is highly recommended. Note that getting good results from simulated data is a necessary, but not suffi- cient, condition for testing your results: real data may contain things that you are not simulating because you might not be aware of them.

4.1 Random number generation In many simulation cases, you may wish to simulate some population of objects, drawing your simulated sample from some underlying distribution. Perhaps you want to see whether you can recover the underlying distribution from samples of different sizes. Be aware that making random samples is not always the best approach! Care- fully consider whether random is required. An example I have run into is comparing actual color-magnitude diagrams with model CMDs: if you make random samplings to make a model CMD, you have to use a very large number of stars to random sample low density (short lifetime) regions in the CMD. But one can do a statistical comparison with a model CMD that is constructed analytically from stellar isochrones and an IMF! To make simulated samples requires the ability to draw a “random” sample from some distribution, so we need to understand what it means to generate a random number with a computer, and how to generate random numbers from distributions that may not be built-in to standard libraries. Random numbers also play a significant role in numerical simulations. There exists considerable literature on the subject, and significant improvements of the past several decades, because random number generation is a critical component of encryption algorithms. Be aware that lousy random number generators exist! So if it is important, research your random number generator. See, e.g., diehard tests. There is a good discusion of random number algorithms in Numerical Recipes. A simple example of a random number algorithm – but not a good one! – is the linear congruential generator, which generates a sequence of integers xi using the recurrence relation

xi+1 = Mod(axi + b, m) = (axi + b)//m where a, b, and m are large integers. Clearly, this requires a starting point, the “seed”, and this is generally true of all computer random number generators. The most common modern random number generator is the Mersenne-Twister algorithm. Astr 630 Class Notes – Spring 2020 22

As is evident here, the computer random number generator is really a “pseudo- random” number generator, because computers are deterministic. If you start from the same point, you get the same sequence! Most random number generates have a method for choosing the starting point, e.g., the clock time (perhaps in milliseconds), so you get a different sequence each time you call the random number generator. Although this sounds good, it is actually recommended that you set the seed of the random number generator yourself, as you may actually want repeatibility: as you modify your code and assess improvements, you don’t want to have to worry about whether the improvements (or lack of them) comes from a different set of inputs or from the different code. Of course, you also probably want to check your code with a variety of “random” sequences. Lowest level random number generators give uniform deviates, i.e., equal proba- bility of results in some range (usually 0 to 1 for floats). E.g., python random.random, random.seed, numpy.random.random , numpy.random.seed, general random number class numpy.random.RandomState Exercise: generate sets of increasing larger uniform deviates, plot . For many common distributions, accurate and fast ways of generating random deviates have been developed. Python implementations: numpy.random.normal, numpy.random.poisson More generally: numpy RandomState object But what about for others? Consider the cumulative probability distribution func- tion, which is the integral of the PDF. What does it look like? Transformation method: consider cumulative distribution of desired function. This is a function that varies between 0 and 1. Choose a uniform random devi- ate between 0 and 1, and look up what value of the cumulative distribution this corresponds to, and you have your deviate in your function! This does require that you can integrate your function, and then invert that integral. See NR 7.3.1 In-class exercise: generate random deviates for a “triangular” distribution: p(x) ∝ x, for 0 <= x < 1 and 0 otherwise. What is constant of proportionality to make this a probability distribution? Use the relation to generate random deviates, and plot them with a . What if you can’t integrate and invert your function? Rejection method: choose a function that you can integrate and invert that is always higher than your desired distribution. Choose a random deviate as before and look up corresponding value in the comparison function. Calculate value of both desired function and comparison function. Choose a uniform deviate between 0 and c(x): if it is larger than f(x), reject your value and start again! This method requires two random deviates for each attempt, and the number of attempts before you get a deviate in your desired dis- tribution depends on how close your comparison function is to your desired function. See NR 7.3.2 for a graphical representation of the rejection method. Astr 630 Class Notes – Spring 2020 23

Alternatively, if your function is hard to integrate/invert, just integrate it numer- ically, tabulate it, and use interpolation to do the inversion.

Understand how to use and implement the transformation method for getting deviates from any function that is integrable and invertible.

Lots of other clever ways to generate deviates in desired functions, see Numerical Recipes, chapter 7 Common distributions: Gaussian, Poisson:

2 1 − (x−µ) P (x, µ, σ) = √ e 2σ2 2πσ µx exp−µ P (x, µ) = x! For these common distributions, accurate and fast ways of generating random deviates have been developed. Python implementations: numpy.random.normal, numpy.random.poisson More general: numpy RandomState object Possible in-class exercises: • simulate some data, e.g. a linear relation (x=1,10, y=x) with Gaussian (mean=0, sigma=various) scatter. Plot it up. • take a uniform distribution of stellar magnitudes, e.g., imagine a set of standard stars. Simulate magnitude difference between observed and input mags, for different choices of zeropoint: m = −2.5loge/s + z

• CMD from isochrones with Poisson scatter in colors and magnitudes.

Know how to use canned functions for generating uniform deviates, Gaussian deviates, and Poisson deviates.

4.2 Data simulation There are three basic components in data simulation: an underlying physical model, adding the effects from observation (e.g., atmospheric and instrumental effects), and finally, adding noise. Common simulations include imaging and spectroscopic simulations. Astr 630 Class Notes – Spring 2020 24

4.2.1 Underlying physical model For imaging simulations, you might model an underlying distribution of objects (e.g., number density profile of stars in a star cluster, galaxies in a galaxy cluster, etc.), as well as the luminosity distribution of individual objects (point sources for stars, extended objects with radial brightness profiles for galaxies, etc.). For spectral simulations, you might use synthetic spectra from stellar atmosphere calculations, population synthesis calculation, emission line calculations, etc. The underlying image/spectrum should be sampled at sufficiently high resolution to accommodate modeling whatever it is you might hope to extract from the simu- lated data. This is probably higher resolution than you might eventually provide the simulated data at.

4.2.2 Effects of observation For imaging observations, want to account for smearing of the spatial distribution, e.g., from seeing, diffraction, aberrations, etc., which might be a function of the location in the field of view. This is done using a characterization of the point spread function (PSF), which gives the flux distribution of the system for a point source. You might also want to consider the light loss as a function of bandpass due to absorption in the Earth’s atmosphere and perhaps, the addition of sky “background”. For spectral observations, you want to account for spectral smearing due to the slit width, diffraction, seeing etc. This is done using a characterization of the line spread function (LSF), which gives the flux distribution with wavelength of a monochromatic source. Note that the LSF may or may not take a simple analytical form, although often, a simple form like a Gaussian might be used. In practice, one would derive an LSF from observation of a monochromatic source, and recognize that it might be a function of wavelength. (Aside: note that there might be more than one thing contributing to the width of lines, e.g., the instrumental profile and also a velocity distribution of the object(s) doing the emitting.) Sample the smoothed spectrum at the desired scale of a simulated instrument: you may need to interpolate.

4.2.3 Noise Add background spectrum if significant. Add noise: Poisson for signal (source+background), Gaussian for readout noise. Astr 630 Class Notes – Spring 2020 25

4.3 Convolution The process of smoothing an input by an instrumental profile (PSF or LSF) is a process of convolution. Convolution of two functions is defined as: Z g(x) ∗ h(x) = g(x0)h(x − x0)dx0

Usually, convolution is seen in the context of smoothing, where h(x) is a normalized function (with a sum of unity), centered on zero; convolution is the process of running this function across an input function to produce a smoothed version. Note that the process of convolution can be computationally expensive, especially in multiple di- mensions; at each point in a data series, you have to loop over all of the points in the n-dimensions that contribute to the convolution integral. In some cases it can be ad- vantageous to consider convolution in the Fourier domain, because of the convolution theorem, which states the convolving two functions in physical space is equivalent to multiplying the transforms of the functions in Fourier space. Multiplication of two functions is faster than convolution! Note that convolution does not change the sampling of the underlying function, and that the smoothing function must have the same “scale” as the raw function. Of course, after smoothing, one may find that one would not want to sample the smoothed function at a fine spacing, so, e.g., if one was designing a spectrograph, you’d choose the dispersion to match the sampling of the smoothed spectrum. But to simulate spectra, you need to start with underlying spectra at sufficient sampling so that the intrinsic features are well sampled. Numerical routines exist to do simple convolution, e.g. Python: numpy.convolve, scipy.signal.convolve, but note that these generally assume a constant smoothing ker- nel, which might not be the case for real astronomical instruments. If you are coding your own convolution, you can think of the convolution in two ways:

• for each input pixel, the flux gets smeared into a range of output pixels, with different amounts according to the kernel. In this case, the edges of the input will get smeared some outside of the original range, so you have to check for that and ignore those or produce an output array that is larger (thats the full option in numpy.convolve)

• for each output pixel, the flux comes from a range of input pixels. The input pixel with the shortest wavelength that contributes adds flux according to the rightmost element of the kernel: pixels of larger wavelength contribute accord- ing the elements moving leftward in the kernel. This is the reversed kernel part. Astr 630 Class Notes – Spring 2020 26

You can do this with an explicit loop (thats the slow option) or with a numpy array multiplication (the fast option) where you multiply all of the input pixels that contribute to the output pixel by the reversed kernel and sum them up.

Discussion of sampling theorem and critical sampling...?

4.4 Cross correlation 4.5 Forward vs inverse modeling

Understand how you can go about simulating data. Know what convolution is and how to implement/use it. Know what the convolution theorem is. Astr 630 Class Notes – Spring 2020 27

5

Let us turn to statistical inference, the process of drawing conclusions from data. This is often done in the context of fitting a model to data.

5.1 Overview: frequentism vs Bayesian Given a set of observations/data, one often wants to summarize and get at underlying physics by fitting some sort of model to the data. The model might be an empirical model or it might be motivated by some underlying theory. In many cases, the model is parametric: there is some number of parameters that specify a particular model out of a given class. The general scheme for doing this is to define some merit function that is used to determine the quality of a particular model fit, and choose the parameters that provide the best match, e.g. a minimum deviation between model and data, or a maximum probability that a given set of parameters matches the data. When doing this one also wants to understand something about how reliable the derived parameters are, and also about how good the model fit actually is, i.e. to what extent it is consistent with your understanding of uncertainties on the data. There are two different “schools” about how to think about model fitting: fre- quentism and Bayesian. In the frequentist approach (sometimes called the classical approach), one considers how frequently one might expect to observe a given data set given some underlying model (i.e., P (D|M), the probability of observing a data set given a model); if very infrequently, then the model is rejected. The model that “matches” the observed data most frequently (based on some calculated ) is viewed as the correct underlying model, and as such, gives the best parameters, along with some estimate of their uncertainty. Here, uncertainty refers to the range of different observed data sets that are consistent with a given model. The Bayesian approach considers the probability that a given model is correct given a set of data (i.e. P (M|D), the probability of a model given a dataset). It allows for the possibility that there is external information that may prefer one model over another, and this in incorporated into the analysis as a prior. It considers the probability of different models, so talks about probability distribution functions of parameters. Examples of priors: fitting a Hess diagram with a combination of SSPs, with external constraints on allowed ages; fitting UTR data for low count rates in the presence of readout noise. The frequentist paradigm has been mostly used in astronomy up until fairly re- cently, but the Bayesian paradigm has become increasingly widespread. In many cases, they can give the same result, but with somewhat different interpretation, but in some cases, results can differ. Astr 630 Class Notes – Spring 2020 28

Bayesian vs. Frequentist Statistics Suppose we have measured the mean mass of a sample of G stars, by some method, and say: at the 68% confidence level the mean mass of G stars is a b. What does this statement mean? Bayesian answer: There is some true mean mass, M, of G stars, and there is a 68% probability that ab < M < a + b. More pedantically: The hypothesis that the true mean mass, M, of G stars lies in the range a b to a + b has a 68% probability of being true. The probability of the hypothesis is a real-numbered expression of the degree of belief we should have in the hypothesis, and it obeys the axioms of probability theory. In classical or frequentist statistics, a probability is a statement about the fre- quency of outcomes in many repeated trials. With this restricted definition, one cant refer to the probability of a hypothesis it is either true or false. One can refer to the probability of data if a hypothesis is true, where probability means the fraction of time the data would have come out the way it did in many repeated trials. So the statement means something like: if M = a, we would have expected to obtain a sample mean in the range a b 68% of the time. This is the fundamental conceptual difference between Bayesian and frequentist statistics. Bayesian: Evaluate the probability of a hypothesis in light of data (and prior information). Parameter values or probability of truth of a hypothesis are random variables, data are not (though they are drawn from a pdf). Frequentist: Evaluate the probability of obtaining the data more precisely, the fraction of times a given statistic (such as the sample mean) applied to the data would come out the way it did in many repeated trials given the hypothesis, or parameter values. Data are random variables, parameter values or truth of hypotheses are not. An opinion: The Bayesian formulation corresponds better to the way scientists ac- tually think about probability, hypotheses, and data. It provides a better conceptual basis for figuring out what to do in a case where a standard recipe does not neatly apply. But frequentist methods sometimes seem easier to apply, and they clearly capture some of our intuition about probability. Criticism of Bayesian approach: The incorporation of priors makes Bayesian meth- ods seem subjective, and it is the main source of criticism of the Bayesian approach. If the data are compelling and the prior is broad, then the prior doesnt have much effect on the posterior. But if the data are weak, or the prior is narrow, then it can have a big effect. Sometimes there are well defined ways of assigning an uninformative prior, but sometimes there is genuine ambiguity. Bayesian methods sometimes seem like a lot of work to get to a straightforward answer. In particular, we sometimes want to carry out an absolute hypothesis test without having to enumerate all alternative hypotheses. Criticism of frequentist approach: doesnt correspond as well to scientific intuition. We want to talk about the probability of hypotheses or parameter values. The choice Astr 630 Class Notes – Spring 2020 29 of which statistical test to apply is often arbitrary. There is not a clear way to go from the result of a test to an actual scientific inference about parameter values or validity of a hypothesis. Bayesians argue that frequentist methods obtain the appearance of objectivity only by sweeping priors under the rug, making assumptions implicit rather than explicit. Frequentist approach relies on hypothetical data as well as actual data obtained. Choice of hypothetical data sets is often ambiguous, e.g., in the stopping problem. Sometimes we do have good prior information. It is straightforward to incorporate this in a Bayesian approach, not so in frequentist. Frequentist methods are poorly equipped to handle nuisance parameters, which in Bayesian approach are easily han- dled by marginalization. For example, the marginal distribution of a parameter x p(x) = Z p(x—a, b, c)da db dc can only exist if x is a random variable. In practice, frequentist analysis yields parameters and their uncertainties, while Bayesian analysis yields probability distribution functions of parameters. The latter is often more computational intensive to calculate. Bayesian analysis includes explicit priors. For some more detailed discussion, see http://arxiv.org/abs/1411.5018, https://en.wikipedia.org/wiki/Bayesian probability, Tom Loredo’s Bayesian Reprints In the context of Bayesian analysis, we can talk about the probability of models, and we have: P (D|M)P (M) P (M|D) = P (D) More accurately, we might want to talk about the probability of parameters of a given model, i.e. P (D|θM)P (θ|M) P (θ|DM) = P (D|M) where we might also consider different models with different parameters. P (M|D) is called the . This gives the probability distribution of the model or model parameters. P (D|M) is the likelihood, the probability that we would observe the data given a model. Calculation of likelihood, P (D|M), is sometimes straightforward, sometimes difficult. The background information may specify assumptions like a Gaussian error distribution, independence of data points. Important aspect of Bayesian approach: only the actual data enter, not hypothetical data that could have been taken. All the evidence of the data is contained in the likelihood. P (M) is the prior. In the absence of any additional information, this is equivalent for all models (i.e. a noninformative prior). But what equivalent means if the model has parameters is ambiguous: for example, note that a uniform prior is not necessarily invariant to a change in variables, e.g. uniform in the logarithm of a variable is not Astr 630 Class Notes – Spring 2020 30 uniform in the variable! The Bayesian would say that even a “non-informative” prior is itself a prior. In the noninformative case, the Bayesian result is the same as the frequentist maximum likelihood. In the Bayesian analysis, however, we’d calculate the full probability distribution function for the model parameter, µ. More on that later. Note that a uniform prior in any variable is an improper probability, i.e., it does not normalize to unity. Such a prior causes problems for interpretation of the posterior, because it then cannot be normalized. If the posterior can be normalized, this allows one to consider not just the peak of the posterior PDF (the most likely model), but also the mean, interquartile range, etc. When the PDF is normalizable, Bayesian analysis often talks about the “credible region,” which is comparable to the “confidence intervals” used in frequentist analysis. P (D) is the evidence. In many cases, this can just be viewed as a normalization constant so that the posterior sums to unity. However, it can be important if we consider the comparison of different models, where we might distinguish between parameters of one model vs those of another; in this case with might write P (θ|M,D), etc. where θ are the parameters of model M. The global likelihood of the data, P (D|M) is the sum (or integral) over all hypotheses. Often P (D|M) doesnt matter: in comparing hypotheses or parameter values, it cancels out. When needed, it can often be found by requiring that p(M|D) integrate (or sum) to one, as it must if it is a true probability. The Bayesian approach forces specification of alternatives to evaluate hypotheses. Frequentist assessment tends to do this implicitly via the choice of statistical test.

Understand the basic conceptual differences between a frequentist and a Bayesian analysis. Know Bayes’ theorem. Understand the practical differences.

A simple example of where one might see the value of the Bayesian approach: stellar parallaxes with uncertainties. See Bailer-Jones 3.5: Consider the situation where we have a measurement of a parallax: how do you get the distance? Now, however, consider that the measurement has some uncertainty, so we want to determine the range of distances that are consistent with the observation. An initial approach might be to give the most probable distance, and estimate its uncertainty by propagation of uncertainties: 1 d = p

 1 2 σ2 = σ2 d p2 p Astr 630 Class Notes – Spring 2020 31

However, it’s clear that if the uncertainty in p is Gaussian, the uncertainty is d is not. Furthermore, for the most distant objects, it’s not impossible that one would measure a negative parallax: what do you do with that, recognizing that it does contain some information? Bailer-Jones Fig 3.4 demonstrates the situation: the real uncertainties are not Gaussian. And even with the real uncertainties, you can end up with negative dis- tances! Posing this as a Bayesian inference problem allows an alternate path.

P (p|d)P (d) P (d|p) = P (p) If we use a uniform prior (all distances are equally likely), then this is an improper prior. In this case, we can’t normalize the posterior probability distribution function, so all we can really use from it is the maximum P (d|p), which will be d = 1/p. Bailer-Jones 3.5 But a uniform prior is not realistic! For one, we know we can’t have negative distances, so at a bare minimum, one wants to set P (d) = 0 for d < 0. But this doesn’t help that much, as we wtill have a unnormalizable posterior. But consider that there must be some selection function to enable a parallax measurement at all, e.g., a star must be brighter than some limit for it to be detected and measured. Given that we know that stars have maximum intrinsic luminosities, this yields a maximum distance. Furthermore, we know that stars are not uniformly distributed in space. Finally, the volume element for a star grows as d2. For the case of a uniform prior in r but with a maximum distance, rlim, see Bailer-Jones 3.6

6 Maximum likelihood and least squares

Let’s start with an analysis of the likelihood, which is a common part of a frequentist analysis and also a component of the Bayesian analysis. Given a set of data and some model (with parameters), one considers the probability that the set of data will be observed if the model is correct, and we will prefer the set of parameters that maximizes this probability. Consider the example of fitting a straight line to a set of data. How is the best fit line usually determined? Where does this come from? Consider that observed data points have some measurement uncertainty, σ. For now consider that all points have the same uncertainty, i.e. homoschedastic uncertainties, and that the uncertainties are distributed according to a Gaussian distribution. Astr 630 Class Notes – Spring 2020 32

What is the probability of observing a given data value from a known function y = ax + b, given an independent variable xi, and a known Gaussian distribution of uncertainty in y? What’s the probability of observing a series of two independently drawn data values? The probability of observing a series of N data points is, given some model y(xi|aj), where the aj represent a set of M parameters:

N−1  2  Y −0.5(yi − y(xi|aj)) P (D|M) ∝ exp ∆y σ2 i=0 where aj are some number of parameters of a model, and ∆y is some small range in y. Maximizing the probability is the same thing as maximizing the logarithm of the probability, which is the same thing as minimizing the negative of the logarithm, i.e. minimize:

N−1  2  X 0.5(yi − y(xi|aj)) − log P (D|M) = − Nlog(∆y) + const σ2 i=0 If σ is the same for all points, then this is equivalent to minimizing

N−1 X 2 (yi − y(xi|aj)) i=1 i.e., least squares.

Understand how least squares minimization can be derived from a maximum like- lihood consideration in the case of normally-distributed uncertainties.

Consider a simple application, multiple measurements of some quantity with the same uncertainty, σ, on all points (homoschedastic uncertainties), so the model is y(xi) = µ where we want to determine the most probable value of the quantity, i.e. µ. What’s the answer? Minimizing gives:

N−1 d(− log P (D|M)) X = 2 (y − µ) = 0 dµ i i=0

X yi µ = N Astr 630 Class Notes – Spring 2020 33 which is a familiar result! This validates the concept of using the calculation of the sample mean. What about the uncertainty on the mean? Just use error propagation, for x = f(u, v, ...),  2  2 2 2 ∂µ 2 ∂µ σ(µ) = σy0 + σy1 + ... ∂y0 ∂y1 X σ2 σ2 σ2 = = µ N 2 N σ σµ = √ N What about the case of unequal uncertainties on the data points, i.e. heteroschedas- tic uncertainties? Then

N−1  2  X (yi − y(xi|aj)) − log P (D|M) = + const σ2 i=0 i Here one is minimizing a quantity X2:

N−1  2  X (yi − y(xi|aj)) X2 = σ2 i=0 i which is often referred to as χ2, but perhaps incorrectly. An important thing about χ2 is that the probability of a given value of χ2 given N data points and M pa- rameters can be calculated analytically, and is called the chi-square distribution for ν = N − M degrees of freedom; see, e.g. scipy.stats.chi2, e.g. cumulative density function (cdf), which you want to be in some range, e.g., 0.05-0.95. (Note, however, that there may be some question whether the X2 as defined above necessarily follows the statistical χ2 distribution, see discussion in Feigelson & Babu, also bottom of https://asaip.psu.edu/forums/astrostatistics-forum/811087908. The maximum like- lihood part is still fine, it’s just the evaluation of the quality using a χ2 test that might not be). Often, people qualititatively judge the quality of a fit by the reduced χ2, or χ2 per degree of freedom: χ2 χ2 = ν N − M which is expected to have a value near unity (but really, you should calculate√ the 2 2 probability of χ , since the spread in χν depends on ν (standard deviation is 2ν). It is important to recognize that this analysis depends on the assumption that the uncertainties are distributed according to a normal (Gaussian) distribution! Astr 630 Class Notes – Spring 2020 34

Using χ2 as a method for checking uncertainties, or even determining them (at the expense of being able to say whether your model is a good representation of the data!). See ML 4.1

Know specifically what χ2 is and how to use it. Know what reduced χ2 is, and understand degrees of freedom.

Back to the case of unequal uncertainties on the data points. For our simple model of measuring a single quantity, we have:

d(− log P (D|M)) X (yi − µ) = 2 2 = 0 dµ σi

X yi X 1 2 − µ 2 = 0 σi σi P yi 2 σi µ = P 1 2 σi i.e., a weighted mean. Again, you can use error propagation to get (work not shown):

−1/2 X 1  σµ = 2 σi Let’s move on to a slightly more complicated example: fitting a straight line to a set of data. Here, the model is

y(xi) = a + bxi where we now have two parameters (a and b) and χ2 is:

N  2  X (yi − a − bxi) χ2 = σ2 i=0 i It should be clear that this is quadratic surface as a function of the two parameters, a and b. Again, we want to minimize χ2, so we take the derivatives with respect to each of the parameters and set them to zero:

2 ∂χ X (yi − a − bxi) = 0 = −2 2 ∂a σi Astr 630 Class Notes – Spring 2020 35

2 ∂χ X xi(yi − a − bxi) = 0 = −2 2 ∂b σi Separating out the sums, we have:

X yi X 1 X xi 2 − a 2 − b 2 = 0 σi σi σi 2 X xiyi X xi X xi 2 − a 2 − b 2 = 0 σi σi σi or aS + bSx = Sy

aSx + bSxx = Sxy where the various S are a shorthand for the sums: X 1 S = 2 σi

X xi Sx = 2 σi X xiyi Sxy = 2 σi X xixi Sxx = 2 σi X yi Sy = 2 σi which is a set of two equations with two unknowns. Note that you could equivalently write the two equations in matrix format:  SS  a  S  x = y Sx Sxx b Sxy where you can note that the square matrix is the sum of the product of derivatives with respect to each parameter, and the column vector is the sum of the product of the derivative with respect to each parameter times the observed value. Solving algebraically, you get:

SxxSy − SxSxy a = 2 SSxx − Sx

SSxy − SxSy b = 2 SSxx − Sx Astr 630 Class Notes – Spring 2020 36

Know how to derive the least squares solution for a fit to a straight line.

We also want the uncertainties in the parameters. Again, use propagation of errors to get (work not shown!): 2 Sxx σa = 2 SSxx − Sx

2 S σb = 2 SSxx − Sx Finally, we will want to calculate the probability of getting χ2 for our fit, in an effort to understand 1) if our uncertainties are not properly calculated and normally distributed, or 2) if our model is a poor model. Note that there can be covariances between the parameters! Consider the case where the values are far from the origin. The covariance is given by :

2 σab = −σb x¯

It can be elimated by shifting the coordinate system so thatx ¯ = 0.

6.1 General linear fits We can generalize the least squares idea to any model that is some linear combination of terms that are a function of the independent variable, e.g.

y = a0 + a1f1(x) + a2f2(x) + a3f3(x) + ... such a model is called a , because it is linear in the parameters (but not at all necessarily linear in the independent variable!). The model could be a polynomial of arbitrary order, but could also include trignometric functions, etc. We write the model in a simple form:

M−1 X y(x) = akfk(x) k=0 where there are M parameters. We will still have N data points and, as you might imagine, we require N > M. We can write the merit function as:

N−1 " PM−1 2 # X (yi − akfk(xi)) X2 = k=0 σ2 i=0 i Astr 630 Class Notes – Spring 2020 37

Then, minimizing X2 leads to the set of M equations:

N−1 " M−1 # X 1. X 0 = y − a f (x ) f (x ) σ2 i j j i k i i=1 i j=0 where k = 0, ...., M − 1. Note that the fk(x) are the derivatives of the function y(x) with respect to each of the k parameters. Separating the terms and interchanging the order of the sums, we get:

N−1 M−1 N−1 X yifk(xi) X X ajfj(xi)fk(xi) = σ2 σ2 i−0 i j=0 i=0 i

Define: N−1 X fj(xi)fk(xi) α = jk σ2 i=0 i N−1 X fj(xi)yi β = j σ2 i=0 i then we have the set of equations:

M−1 X αjkaj = βk j=0 for k = 0, ....M − 1. The form of these equations can be simplified even further by casting them in terms of the design matrix, A, which consists of N measurements of M terms:

fj(xi) Aij = σi with N rows and M columns. Along with the definition:

yi bi = σi we have: α = AT · A which is an M by M matrix, β = AT · b Astr 630 Class Notes – Spring 2020 38 which is a vector of length M, and

T T A Aaj = A b where the dots are for the matrix operation that produces the sums. This is just another notation for the same thing, introduced here in case you run across this language or formalism. In fact, this can be even further generalized to allow for covariances, σij in the data:

Aij = fj(xi) 2 Cii = σi

Cij = σij

bi = yi T −1 T −1 A C AaJ = A C b For a given data set, we can calculate α and β either by doing the sums or by setting up the design matrix and using matrix arithmetic. We then need to solve the set of equations for ak.

Know the equations for a general linear least squares problem, and how they are derived. Know what a design matrix is, and how to formulate the problem using matrix notation.

6.2 Solving linear equations This is just a linear algebra problem, and there are well-developed techniques (see Numerical Recipes chapter 2). Most simply, we just invert the matrix α to get

−1 ak = αjk βk or, with the design matrix formalism

T −1 T ak = (A A) (A B)

A simple algorithm for inverting a matrix is called Gauss-Jordan elimination, but there are others. In particular, NR recommends the use of singular value decomposi- tion for solving all but the simplest least-squares problems; this is especially important if your problem is nearly singular, i.e. where two or more of the equations may not Astr 630 Class Notes – Spring 2020 39 be totally independent of each other: fully singular problems should be recognized and redefined, but it is possible to have non-singular problems encounter singularity under some conditions depending on how the data are sampled. See NR 15.4.2 and Chapter 2! As an aside, note that it is possible to solve a linear set of equations significantly faster than inverting the matrix through various types of matrix decomposition, e.g., LU decomposition and Cholesky decomposition (again, see NR Chapter 2). There may be linear algebra problems that you encounter that only require the solution of the equations and not the inverse matrix. However, for the least squares problem, we often actually do want the inverse matrix C = α−1, because its elements have meaning. Propagation of errors gives:

2 σ (ak) = Cjj i.e., the diagonal elements of the inverse matrix. The off-diagonal elements give the covariances between the parameters, and the inverse matrix, C, is called the covariance matrix. A Python implementation of matrix inversion and linear algebra in general can be found in scipy.linalg; linear least squares is implemented in scipy.linalg.leastsq(), but is really just an implementation of the matrix formulation. For the specific linear case of a polynomial fit, you can use numpy.polyfit(). For the formulation of a design matrix for a polynomial fit, see numpy.vander(). Linear least squares is also implemented in astropy in the modeling module us- ing the LinearLSQFitter class (which uses numpy.linalg); astropy has a number of standard models that you might use, or you can provide your own custom model. from astropy.modeling import models, fitting fit_p = fitting.LinearLSQFitter() p_init = models.Polynomial1D(degree=degree) pfit = fit_p(p_init, x, y)

Astropy has a number of common models, but you can also define your own model us- ing models.custom model, by supplying a function that returns values, and a function that returns derivatives with respect to the parameters. Let’s do it. Simulate a data set, fit a line, and output parameters, parameter uncertainties, chi2, and probability of chi2. See leastsquares.ipynb

Be able to computationally implement a linear least squares solution to a problem. Astr 630 Class Notes – Spring 2020 40

6.3 Linear problems with lots of parameters Not all linear problems are just functional fits with small numbers of parameters A good example is spectral extraction for a multiobject fiber spectrograph. How is this problem formulated in the context of least squares? What does the α matrix look like? What needs to be solved? Note spectro-perfectionism example! Note that for problems with large numbers of parameters, the linear algebra can become very computationally expensive. In many cases, however, the matrix that rep- resents these problems may only be sparsely populated, i.e. so-called sparse matrices (see Numerical Recipes 2.7, Figure 2.7.1). In such cases, there are methods that allow the equations to be solved significantly more efficiently. See LAPACK, etc.

6.4 Nonlinear fits However, not all fits are linear in the parameters .... which of the following are not?

−x • y = a0x + a1e

• y = a0(1 + a1x)

• y = a0 sin(x − a1)

2 • y = a0 sin x

(−(x−a1)2/a2) • y = a0e 2

In a linear fit, the χ2 surface is parabolic in parameter space, with a single mini- mum, and linear least squares can be used to determine the location of the minimum. In the non-linear case, however, the χ2 surface can be considerably more complex, and there is a (good) possibility that there are multiple minima, especially when there are multiple parameters, so one needs to be concerned about finding the global minimum and not just a local minimum; see here for some examples. Because of the complexity of the surface, the best fit can’t be found with a simple calculation as we could for the linear fit. A conceptually simple approach would be a grid search, where one simply tries all combinations of parameters and finds the one with the lowest χ2. This can extremely computationally expensive, especially for problems with more than a few parameters! One is also forced to decide on a step size in the grid, although you might imagine a successively refined grid as you proceed. Better approaches attempt to use the χ2 values to find the minimum more effi- ciently. They can generally be split into two classes: those that use the derivative of Astr 630 Class Notes – Spring 2020 41 the function and those that don’t. If you can calculate derivatives of χ2 with respect to your parameters, then this provides information about how far you can move in parameter space towards the minimum. If you can’t calculate derivatives, you can evaluate χ2 at several different locations, and use these values to try to work your way towards the minimum. Around the final minimum, the χ2 surface can be approximated as a parabola, and it is possible to correct the solution to the minimum solution if one can arrive at a set of parameters near to the final minimum, where the parabolic approximation is good. Instead of solving for the parameters, one solves for corrections to a set of initial parameters, where the data matrix is now the different between the observed data and the model with the set of initial parameters. For details, see NR 15.5 or Robinson Chapter 6. One arrives at a set of equations closely parallel to the equations for the linear case:

M−1 X αklδal = δβk i=0 similar to what we had for the linear problem, but note that the data array (δβ) is now the difference between the data and the current guess model, and we are calculating corrections to the current guess parameters rather than being able to solve directly for the parameters.

2 N−1 1 ∂χ X (yi − y(xi|a)) ∂(y(xi|a)) δβ = − = k 2 ∂a σ2 ∂a k i=0 i k

2 2 N−1  2  1 ∂ χ X 1 ∂y(xi|a) ∂y(xi|a) ∂ y(xi|a) α = = − (y − y(x |a)) kl 2 ∂a ∂a sigma2 ∂a ∂a i i ∂a ∂a k l i=0 i k l k l

The matrix αkl is known as the curvature matrix. In most implementations, the second derivative term is dropped because it is negligible near the minimum because the sum over the yi − y(xi|a) term should cancel; see NR 15.5.1 for a partial explanation. The standard approach then uses:

N−1   X 1 ∂y(xi|a) ∂y(xi|a) α = kl sigma2 ∂a ∂a i=0 i k l To implement this, you choose a starting guess of parameters, solve for the cor- rections δal to be applied to a current set of parameters, and iterates until one arrives at a solution that does not change significantly. Astr 630 Class Notes – Spring 2020 42

Far from the solution, this method can be significantly off in providing a good correction, and, in fact, can even move parameters away from the correct solution. In this case, it may be necessary to simply head in the steepest downhill direction in χ2 space, which is known as the method of steepest descent. Note that while this sounds like a reasonable thing to do in all cases, it can be very inefficient in finding the final solution (see, e.g. NR Figure 10.8.1). A common algorithm switches between the method of steepest descent and the parabolic approximation and is known as the Levenberg-Marquardt method. This is done by using a modified curvature matrix:

M−1 X αkl0δal = βk i=0 where αjj0 = αjj(1 + λ)

αjk0 = αjk(forj 6= k) When λ is large, this gives a steepest descent method, with a larger λ correspond- ing to a smaller step; when small, a parabolic method. This leads to the following recipe: 1. choose starting guess of parameter vector (a) and caluculate χ2(a)

2. calculate correction using model with small λ, e.g., λ = 0.001, and evaluate χ2 at new point

3. if χ2(a + δa) > χ2(a), increase λ and try again

4. if χ2(a + δa) < χ2(a), decrease λ, update a, and start the next step

5. stop when convergence criteria are met

6. invert curvature matrix to get parameter uncertainties This method, along with several others, is implemented in scipy.optimize.curve fit(), where one provides input data and a model function. The Levenberg-Marquardt method in also implemented in astropy in the model- ing module using the LevMarLSQFitter which is used identically to the linear least squares fitter described above. Two key issues in nonlinear least squares are finding a good starting guess and a good convergence criterion. You may want to consider multiple starting guesses to verify that you’re not converging to a local minimum. For convergence, you can look at the amplitude of the change in parameters or the amplitude in the change of χ2. Astr 630 Class Notes – Spring 2020 43

Example: fitting the width and/or center of a gaussian profile Example: crowded field photometry: given an observation of a field with mulitple stars that are sufficiently close to one another that light from different stars overlaps, how do we infer the brightnesses of individual stars? The basic idea is that if we can measure the shapes (PSFs) of an isolated star and make the assumption that all stars have the same shift, and we will also assume that we can identify distinct stars, but not their exact positions. Given these, we can formulate this as a parameterized inference problem. The observed data is the distribution of intensities across all of the pixels. What is the model? See leastsquares.ipynb.

Understand the concepts of how a nonlinear least squares fit is performed, and the issues with global vs local minima, starting guesses, and convergence criteria. Know how to computationally implement a nonlinear least squares solution.

6.5 Multivariate fits For many problems, we may have data in more that one dimension, i.e. a multivariate problem, where we might want to look for relations between the different variables. In this case, rather than fitting a line to a paired set of points, we might want to fit a plane to a series of data points f(x, y), or even some higher number of dimensions. Note that the general linear least squares formulation can be applied to problems with multiple independent variables, e.g., fitting a surface to a set of points; simply treat x as a vector of data, and the formulation is exactly the same. See leastsquares.ipynb.

Understand how to approach a multivariate fitting problem.

6.6 Least squares with uncertainties on both axes It is sometimes the case that one might want to look for a relationship between two variables, each of which has measurement uncertainty or intrinsic scatter. treats one variable as independent (and error-free) and one as dependent. Note Tully-Fisher example. If you fit the model as x(y) rather than y(x) you do not in general get the same solution. There are several possible ways to deal with the situation. One is to do the forward and inverse fits and take the line that bisects them. Another is to fit for the minimum least-squares deviation perpendicular to the fitted relation (see Figure). Astr 630 Class Notes – Spring 2020 44

See Feigelson and Babu 7.3.2. Note the discussion there about astronomers inventing a statistic ... which turns out to be OK in this case! Alternatively, use a Bayesian approach ....

6.7 Uncertainties and the Cramer Rao bound The matrix α that we constructed for the linear least squares problem is a special case of the Hessian matrix, which is defined as the second partial derivatives of the log- with respect to the parameters of the model:

∂2log L Hessianjk(a) = ∂aj∂ak In the case of a linear model, the second partial derivatives are just the product of the first partial derivatives. The Hessian gives information about the curvature of the likelihood surface. The inverse of the Hessian is the covariance, or error, matrix, which gives the 2 uncertainties on the parameters: the diagonal elements are σ (aj), the the off-diagonal elements are the covariances. The Fisher information matrix, I(a), is related to the Hessian: the observed infor- mation is the negative of the Hessian evaluated at the maximum likelihood estimate, and the expected information is the expectation value of the negative of the Hessian, i.e., the Hessian averaged over many data sets. The latter is important because it sets the minimum variance that one can expect for derived parameters. This is known as the Cramer-Rao bound: the minimum variance is the inverse of the expected information. The basic idea is straighforward: if the curvature is large, then the peak of the likelihood can be determined to higher accuracy than if the curvature is smaller. The Cramer-Rao bound seems to be coming up with increasing in dis- cussions and in the literature, often in the context of whether people are extracting maximal information from existing data, e.g., abundances and RV examples.

Understand the basic concept of the Cramer-Rao bound. Know what the Hessian matrix and the Fisher information matrix are.

6.8 Scatter, parameter uncertainties and confidence limits Fitting provides us a set of best-fit parameters, but, because of uncertainties and limited number of data points, these will not necessarily be the true parameters. One Astr 630 Class Notes – Spring 2020 45 generally wants to understand how different the derived parameters might be from the true ones. Scatter in the data may arise from measurement uncertainties, but also from intrinsic scatter in the relationship that is being determined. In the presence of (unknown) intrinsic scatter, the rms around the fit will always be larger than that expected from measurement uncertainty alone. The covariance matrix gives information about the uncertainties on the derived parameters in the case of well-understood uncertainties that are strictly distributed according to a Gaussian distribution. These can be used to provide confidence levels on your parameters; see NR 15.6.5. In some cases, however, you may not be in that situation. Given this, it is common to derive parameter uncertainties by numerical techniques. A straightforward technique if you have a good understanding of your measure- ment uncertainties is to use Monte Carlo simulation. In this case, you simulate your data set multiple times, derive parameters for each simulated data set, and look at the range of fit parameters as compared with the input parameters. To be completely representative of your uncertainties, you would need to draw the simulated data set from the true distribution, but of course you don’t know what that is (it’s what you’re trying to derive)! So we take the best-fit from the actual data as representative of the true distribution, and hope that the difference between the derived parameters from the simulated sets and the input parameters is representative of the difference between the actual data set and the true parameters. I believe this procedure is known by the terminology parametric bootstrap. If you don’t have a solid understanding of your data uncertainties, then Monte Carlo will not give an accurate representation of your parameter uncertainties. In this case, you can use multiple samples of your own data to get some estimate of the parameter uncertainties. A common technique is the non-parametric bootstrap technique, where, if you have N data points, you make multiple simulations using the same data, drawing N data points from the original data set – with replacement (i.e. same data point can be drawn more than once), derive parameters from multiple simulations, and look at the distribution of these parameters. Another technique is called jackknife resampling. Here, one removes a single data point from the data set each time, and recalculates the derived parameters, and looks at the distribution of these. However, you may need to be careful about the interpretation of the confidence intervals determined by any of these techniques, which are based on a frequentist interpretation of the data. For these calculations, note that confidence intervals will change for different data sets. The frequentist interpretation is that the true value of the parameter will fall within the confidence levels at the frequency specified by your confidence interval. If you happen to have taken an unusual (infrequent) data set, Astr 630 Class Notes – Spring 2020 46 the true parameters may not fall within the confidence levels derived from this data set! There is some interesting discussion about the misapplication/misconception of the bootstrap method by astronomers in Laredo 2012 (which addresses several other misconceptions as well). The specific misconception here is “variability = uncer- tainty”... See Feigelson and Babu 3.6.2, 3.7.1, and 7.3.3 for additional discussion. See the fitting uncertainties notebook

Understand how uncertainties can be estimated from least-squares fitting via the covariance matrix, Monte Carlo simulation, and the bootstrap method, and how these techniques work.

6.9 Minimization of other merit functions Applicability of least-squares is limited to the case where the uncertainties are Gaus- sian. For other models, one wants to maximize the likelihood (or, usually, mini- mize -log(likelihood)). If the likelihood can be formulated analytically, this can be accomplished by numerical methods for minimizing a function, with respect to its parameters.

6.9.1 A minimizer without derivatives For some problems, the model may be sufficiently complex that you cannot take the derivative of the model function with respect to the parameters. One example might be the case of fitting model spectra to an observed spectra. It is always possible to compute derivatives numerically and use these, however, this is not always the best solution, and one may want to consider methods of finding the minimum in parameter space that does not require derivatives. A common minimization method that does not use derivatives is the “downhill simplex” or Nelder-Mead algorithm. For this technique, the function is initially eval- uated at M + 1 points (where M is the number of parameters). This makes an M-dimensional figure, called a simplex. The largest value is found, and a trial eval- uation is made at a value reflected through the volume of the simplex. If this is smaller than the next-highest value of the original simplex, a value twice the distance is chosen and tested; if not, then a value half the distance is tested, until a better value is found. This point replaces the initial point, and the simplex has moved a step; if a better value can’t be found, the entire simplex is contracted around the best point, and the process tries again. The process in then repeated until a convergence Astr 630 Class Notes – Spring 2020 47 criterion is met. See NR Figure 10.5.1 and section 10.5. The simplex works its way down the function surface, expanding when it can take larger steps, and contracting when it needs to take smaller steps. Because of this behavior, it is also known as the “amoeba” algorithm. Python/scipy implementation of Nelder-Mead

6.9.2 Minimizers with derivatives https://scipy-lectures.org/advanced/mathematical optimization/index.html https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html https://www.benfrederickson.com/numerical-optimization/ https://docs.mantidproject.org/v3.7.1/concepts/FittingMinimizers.html

6.9.3 scipy routines for minimization Python/scipy summary of optimization routines in general scipy.optimize.minimize() Note that, for methods that use derivatives, you can supply these analytically or you can have the routine calculate them numerically. The latter is generally compu- tationally more expensive. Note that I’ve just presented a few possibilities for minimizing non-linear func- tions. There are many more and a very large literature on the subject. If you find yourself having a non-trivial non-linear problem, you are highly advised to consider different possible approaches, as some may lead to significantly more efficient solutions than others! See the minimization notebook

6.9.4 Genetic algorithms Another set of methods that does not involve derivates are so-called genetic algo- rithms, see e.g. this article. The basic idea is to start with lots of different locations in parameter space, evaluate the likelihood, and then “breed” the parameters of the best solutions with each other to find better solutions; see e.g, Charbonneau 1995, see also PIKAIA. The basic steps involve encoding the parameter information and then the operations of crossover and mutation; see here for some elaboration. These seem to be most useful for exploring parameter spaces that are very complex with multiple minima, and may provide an efficient way to find a global mininum. From what I read, it seems that they may not be particularly efficient in honing in on an accurate location for the global minimum once it is located. https://towardsdatascience.com/genetic-algorithm-implementation-in-python-5ab67bb124a6 A variant, differential evolution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential evolution.html Astr 630 Class Notes – Spring 2020 48

Understand the concepts of how to formulate a maximum likelihood problem as a problem in minimization. Be aware of and know how to use some minimization routines.

7 Bayesian inference

Review material above, also see astroML, NR and http://arxiv.org/abs/1411.5018 As discussed above and in additional references, Bayesian analysis has several sig- nificant differences from the frequentist/likelihood approach. First, Bayesian analysis calculates the probability of model given data, rather than the probability of data given a model.

P (D|M)P (M) P (M|D) = P (D) where P (M|D) is the posterior PDF, P (D|M) is the likelihood, P (M) is the prior, and P (D) is the evidence. The frequentist approach concentrates on the likelihood, P (D|M), while the Bayesian approach allows for the possibility of specifying an explicit prior on the model parameters. The Bayesian analysis may give the same result as the frequentist analysis for some choices of the prior, but it makes the prior explicit, rather than hidden. However, there is a fundamental difference in the interpretation of probabilities. If we just consider the likelihood, P (D|M), this is interpreted as the frequency that the observed data set will be obtained for a particular model. The model itself is fixed. In the Bayesian interpretation, the probability P (M|D) represents information about the state of our knowledge, so we can meaningfully discuss the probability of models (or their parameters). The evidence can be viewed as a normalization for the problem of determining model parameters, but is important when it comes to comparing different models. If we consider that a model may have some number of parameters, θ, and explictly express the prior as being based on some independent information, I, then we have:

P (D|M, θ, I)P (M, θ|I) P (M, θ|D,I) = P (D|I) This formalism makes it explicit that we can also consider different models with different parameters. The Bayesian analysis produces a posterior probability distribution function of the model parameters. The frequentist analysis produces a most probable set of Astr 630 Class Notes – Spring 2020 49 parameters and a set of confidence intervals. In this regard, the Bayesian analysis provides additional information with the full posterior PDF. In some cases, the PDF may be simplified, e.g. by specifying the maximum posterior probability (MAP), the posterior mean, or a credible region/interval, i.e. the range in the posterior PDF that encompasses the probability within some level, which is a similar concept to a confidence interval, but again, with a different interpretation. The Bayesian analysis can be computationally more challenging because you may actually have to numerically compute the probability for different model parameters. This is essentially performing a grid search of the full parameter space, which can be computationally intractable for large numbers of parameters. Fortunately, compu- tational techniques have been developed to search through the full parameter space concentrating only on the regions with non-negligible probability; as discussed below, Markov Chain Monte Carlo is foremost among such techniques.

Understand the basic concepts of Bayesian inference and be able to supply and use a Bayesian formulation of a problem.. Understand the terms posterior probabil- ity distribution function, prior, likelihood, and evidence. Know what the maximum posterior probability means and the what the credible region means.

7.1 The prior Priors in Bayesian analysis reflect external knowledge about parameters that exist independent of the data being analyzed. If there is no external knowledge, then you want to use a noninformative prior. Often, this is just a statement that all parameter values are equally likely. However, beware that model parameters could be expressed with a transformation of variables, and a flat distribution in one variable may not be a flat distribution under a variable transformation, and could lead to different results. The prior often only makes a significant difference when the likelihood is broad, i.e. for small amount of data or large scatter/uncertainty. In real world problems, parameters may describe physical quantities, and this can help to determine an appropriate prior. For example, if the parameter being inferred is a , one would want the inferred value of the parameter to be insensitive to the choice of origin of coordinate system, i.e., at x1 = x0 + C. We want the probability of the parameter to be the same in either coordinate system:

P (x0)dx0 = P (x1)dx1 = P (x0 + c)d(x0 + c) = P (x0 + c)dx0

The only way for this to be true is if P (x) is constant, i.e., a uniform prior. Astr 630 Class Notes – Spring 2020 50

But the situation is different for a , i.e., one that describes the size of something (or, e.g., the width of a distribution). In this case, we want the inferred value to be independent of the units used. If we have x1 = ax0, then

P (x0)dx0 = P (x1)dx1 = P (ax0)d(ax0) = P (ax0)adx0

To satisfy this, P (x) must be proportional to 1/x, as 1 1 dx0 = adx0 x0 ax0 This is equivalent to P (ln x) = constant, i.e. a uniform prior in ln x. These are special cases of what is known as the Jeffreys prior. In general, the Jeffreys prior is a prior that prior ensures that the inferred value of a parameter in invariant to any smooth change of variables. The Jeffreys prior is proportional to the square root of the determinant of the Fisher information matrix, which we have mentioned previously as the expectation value of the second derivative of the likelihood. See Bailer-Jones 5.3.2 for more detail. Note that a situation that appears to call for an uniform prior may not best be handled that way, e.g., consider the fitting parameters for a straight line, in particular the slope (taken from http://jakevdp.github.io/blog/2014/06/14/frequentism-and- bayesianism-4-bayesian-in-python/#The-Prior. See example in https://arxiv.org/pdf/1411.5018.pdf. Interestingly, this turns out to be a problem with a significant amount of historical contoversy, see, e.g. Jaynes 1999 Real world priors incorporate independent information about a model or param- eters. Note that the word “prior” really just means independent, as this information does not necessarily have to be obtained temporally before the information provided by a separate data set. In some cases, one may have some information about the prior distribution without knowing the shape of this distribution, e.g., you might have some prior information about the mean value of a parameter. In this case, one can use the principle of maximum entropy (see Ivesic et al 5.2.2) to derive a prior. If the mean of parameter is known, then the prior is 1 θ P (θ|µ) = exp µ µ If the mean and variance are known, then the maximum entropy prior is a gaussian with those values of mean and variance. For some distributions of the likelihood, there are certain prior distributions that yield a posterior PDF that has the same functional form as the prior PDF. These are are called conjugate priors. These are of interest because in such cases, it can be possible to calculate the posterior PDF algebraically. For example, if the likelihood Astr 630 Class Notes – Spring 2020 51 function is a Gaussian, a Gaussian prior is a conjugate prior because the posterior PDF is also a Gaussian with mean:

2 2 µp/σp +x/s ¯ µ0 = 2 2 1/σp + 1/s wherex ¯ and s2 are the mean and variance of the likelihood, and the standard deviation of the posterior: 2 2 σ0 = (1/σp + 1/s ) This case demonstrates the basic of the prior and the likelihood: if the width of the likelihood is smaller than that of the prior, then the mean of the posterior is closer to the mean of the likelihood, and vice versa. In the real world, priors are determined by independent information. If there is no independent information, you might consider how you can use the data to set the prior. Clearly, you have to be careful of circular reasoning: you cannot start with a uniform prior, determine the posterior from the data, and then use this as the prior! But you can parameterize the posterior from an initial iteration with an uninformative prior, and use this parameterization as a prior for a subsequent calculation of the posterior. This is known as empirical Bayes, and the parameters of the initial posterior are known as hyperparameters. Example: Baseball batting averages Example: Feuillet et al age analysis: Based on observed quantities such as Teff, log g, and absolute magnitude, given stellar models, one can estimate the age of a star, i.e. determine a posterior PDF for the age. However, this depends on some prior assumptions, e.g., the underlying distribution of stellar ages and masses. If one derives the age posterior PDF for a significant number of objects, these inform the underlying distribution. In this case, we parameterize the distribution of ages based on the individual PDFs (with a new set of hyperparameters). We then use this This is an approximation to a hierarchial Bayes analysis. If we parameterize the prior with a set of parameters, α, then:

P (θ, α|D) ∝ P (D|θ)P (θ|α)P (α) A hierarchical model would marginalize (see below!) over the set of hyperparam- eters α.

Understand the concept of a prior and how it might be specified. Know what an uniformative prior is, and what to use for uninformative priors for the case of location and scale parameters. Understand the concept of a Jeffreys prior. Know Astr 630 Class Notes – Spring 2020 52

what a conjugate prior is and why they can be useful. Understand the concepts of empirical and hierarchical Bayesian analysis.

7.2 Marginalization For multi-dimensional problems, the posterior PDF gives the probability as a function of all of the parameters. For many problems, only some of the parameters are of in- terest, while others fall into the category of so-called nuisance parameters, which may be important for specifying an accurate model (which is fundamentally important!), even though they may not be of interest for the scientific question you are trying to answer. In such cases, Bayesian analysis uses the concept of marginalization, where one integrates over the dimensions of the nuisance parameters to provide probability distribution functions of only the parameter(s) of interest. If you have a multivariate posterior probability distribution, P (x, y, z, ...|D), then the marginal PDF for variable x is just Z P (x|D) = P (x, y, z, ...|D)dydzd...

Marginalization is used to give probability distribution functions of individual pa- rameters, i.e. without regard to the values of other parameters. However, one needs to beware that parameter values may be correlated with each other, and marginal- ization can hide such correlations. Of course, one can choose to marginalize only over some subset of variables to retain some covariance information. Marginalization will generally broaden the posterior PDF over the case where the other parameters are known perfectly. Example: inferring the parameters of a Gaussian. Say we have a set of measure- ments of some quantity in a population where this quantity is distributed normally, but with unknown mean and standard deviation. We want to be able to infer these quantities in the Bayesian framework. So we want P (µ, σ|D) = P (D|µ, σ)P (µ, σ) The likelihood, P (D|µ, σ|D) is 1 −(x − µ)2 P (D|µ, σ) = Π exp i σp(2π) 2σ2 Factoring out the terms that are the same for all data points:

1 1 2 P (µ, σ|D) = exp − Σ(xi − µ) σN p(2π)N 2σ2 Astr 630 Class Notes – Spring 2020 53

We can write the sum in terms of the sample mean and (biased) sample variance: 1 x¯ = Σx N i 1 V = Σ(x − x¯)2 N i to get 1 N P (D|µ, σ) = exp − ((¯x − µ)2 + V ) σN p(2π)N 2σ2 What about the prior? µ is an example of a location parameter, and σ an example of a scale parameter, so the prior, P (µ, σ) is: 1 P (µ, σ) = σ Combining these gives the multivariate posterior PDF shown in Bailer-Jones 6.1, for one particular choice ofx ¯, V , and N. If we were just interested in µ, we would marginalize over σ. It turns out that this integral can be done analytically, and yields a distribution called a Student’s t distribution, where ν+1 ν+1  2 − 2 Γ( 2 ) t P (t) = √ ν 1 + νπΓ( 2 ) ν where ν is the degrees of freedom (N − 1), and x¯ − µ t = pV/(N − 1)

For large N, this approaches a Gaussian. Bailer-Jones 6.2 For σ, the marginal distribution is described by a different distribution. Bailer- Jones 6.3 Bailer-Jones 6.4 In the frequentist analysis, we estimated µ and σ and their uncertainties, but we did not characterize the full distributions. For these analytic cases, we can choose N, V , andx ¯. For a randomly chosen distribution of points, see bayes.ipynb Note some more detail discussion of Bayesian inference of mean, variance, and standard deviation of a Gaussian distribution here Example: Gaussian plus background, with Poisson noise. See bayes.ipynb, also Bailer-Jones 6.5 Displaying multivariate results: corner plots Astr 630 Class Notes – Spring 2020 54

Marginalization can also be used to derive the probability distribution of some desired quantity that depends on other quantities given the PDF of these other quan- tities: m = m(x, y, z) Given P (x, y, z), you can construct P (m, x, y, z) and marginalize over x, y, z. Example: determining distance of stars given a set of observations, e.g. of spectro- scopic parameters and/or apparent magnitudes. Here, stellar evolution gives predic- tions for observable quantities as a function of age, mass, and metallicity; in conjunc- tion with an initial mass function, you get predictions for number of stars for each set of parameters. Given some priors on age, mass, and/or metallicity, you could com- pute the probability distribution of a given quantity by marginalizing over all other parameters. Observational constraints would modify this probability distribution. Distance depends on apparent magnitude and observed stellar parameters (ignor- ing extinction here). Split into distance as function of absolute mag and absolute mag as function of observed stellar parameters:

P (d|m, obsTe, obslogg, obs[F e/H], isochrones) = P (d|Mx, m)P (Mx|obsTe, obslogg, obs[F e/H])

Probability of absolute mag given observed parameters:

P (Mx|obsTe, obslogg, obs[F e/H]) = P (Mx|Te, logg, [F e/H])P (Te, logg, [F e/H]|obsTe, obslogg, obs[F e/H]

P (Te, logg, [F e/H]|obsTe, obslogg, obs[F e/H]) = G(obsTe, T e, σ(Te))G(obslogg, logg, σ(logg))G(obs[F e/H], [F e/H], σ([F e/H])P (Te, logg, [F e/H])

Probability of absolute mag given true parameters comes from stellar models: Z P (Mx|Te, logg, [F e/H]) = P (Mx,Te, logg, [F e/H])dTedloggd[M/H]

Understand what marginalization means, and why and when it can be used.

7.3 Markov Chain Monte Carlo (MCMC) Calculating a multi-dimensional probability distribution function or a marginal prob- ability distribution from it can be very intensive to calculate, because of the curse of dimensionality. One technique for multi-dimensional integration is Monte Carlo integration. Here, you just choose a (large) number of points at random within some specified volume Astr 630 Class Notes – Spring 2020 55

(limits in multiple dimensions), sample the value of your function at these points and estimate the integral as: Z fdV = V < f > where 1 X < f >= f(x ) N i We could use this to calculate marginal probability distribution functions, but it is likely to be very inefficient if the probability is small over most of the volume being integrated over. Also, for most Bayesian problems, we do not have a proper probability density function because of an unknown normalizing constant; all we have is the relative probability at different points in parameter space. To overcome these problems, we would instead like to place points in a volume proportional to the probability distribution at that point; we can then calculate the integral by summing up the number of points. This is achieved by a Markov Chain Monte Carlo analysis. Here, unlike Monte Carlo, the points we choose are not statis- tically independent, but are chosen such that the sample the (unnormallized) proba- bility distribution function is proportional to its value. This is achieved by setting up a Markov Chain, which means a process where the value of a sampled point depends only on the value of the previous point. To get the Markov Chain to sample the (unnormalized) PDF, π, the transition probability has to satisfy:

p(x |x ) π(x ) 2 1 = 2 p(x1|x2) π(x1) where p is a transition probability to go from one point to another. Once the Markov Chain has converged and run for some time, the PDF is just determined by the number of points at each set of parameters. To do a marginal PDF, we just sum the number of points in the chain at the different parameter values. There are multiple schemes to generate such a transition probability. A common one is called the Metropolis-Hastings algorithm. In this algorithm, from a given point, generate a candidate step from some proposal distribution, q(x2|x1),that is broad enough to move around in the probability distribution in steps that are not too large or too small. If the candidate step falls at a larger value of the probability function, accept it, and start again. If it falls at a smaller value, accept it with probability proportional to π(xc)q(x1|x2c)

π(x1)q(x2c|x1) For a symmetric Gaussian proposal definition, note that the q values cancel out. Astr 630 Class Notes – Spring 2020 56

For a simple example, see mcmc.ipynb Note that you need to choose a starting point. If this is far from the maximum of the probability distribution, it may take some time to get to the equilibrium distri- bution. This leads to the requirement of a burn-in period for MCMC algorithms. You also have to be careful about choosing an appropriate proposal distribution. You need to have enough independent samples to get a good estimation of the PDF, but also map out the PDF well. This is related to the choice of step size: with too small steps, the points will be correlated and you will have less information for the amount of calculation; if the steps are too large, you may not sample the PDF finely enough. Clearly, if you just take more steps, you can improve independently of the step size, but this can lead to very long computation times! For a multi-parameter problem, you could provide a step size that depends on the values of the parameters by giving a covariance matrix for the proposal distribution. So, while straightforward in principle, in practice there are a number of choices that need to be made: starting guesses, step sizes, burn in period, and number of steps to run. Given that you may have a multi-parameter problem, this is a lot of adjustable parameters! To help to determine whether an MCMC run is giving a good representation of the posterior PDF, there are a number of diagnostics that you can (and should) use: • looking at chains: inspection of the values of the parameters at each step of the chain can provide information about the burn-in and appropriate step size. You are looking for a chain that looks like random noise around a converged value. • acceptance fraction: you can determine what fraction of the proposed steps were accepted. If this is very large, the step size is too small; if very small, a smaller step size might be needed. A general rule of thumb seems to be to shoot for an acceptance fraction of 25-50%. • function of chains. You can cross-correlate the chain against itself to determine the degree to which steps are independent. You would like for the auto-correlation length to be small to be getting independent samples of the posterior PDF. If it is large, then you can adjust the step size larger, or you can thin the chain, i.e., by taking every Nth point (although this means you are taking lots of extra steps at the price of calculation time!). See this paper for some practical considerations and cautionary notes on MCMC. There are many implementation of MCMC, and Metropolis-Hastings certainly may not be the best choice: one limitation is that of fixed step size. There are also a number of adjustable parameters, in particular, the choice of step sizes (proposal dis- tribution) for each of the parameters. Other common algorithms are Gibbs sampling, affine invariance sampling, etc.... Astr 630 Class Notes – Spring 2020 57

Implementations of MCMC are available: see, e.g.,emcee (paper, and pymc, among others. This blog post compares three packages. Note that emcee uses an algorithm that is different from Metropolis-Hastings, that is designed to function more efficiently in a case where the PDF may be very narrow, leading to very low acceptance probabilities with Metropolis-Hastings. Partly this is accomplished by using a number of walkers, i.e. multiple different Markov chains, but these are not independent: knowledge from other walkers are used (somehow) to help explore the parameter space more efficiently. Note that since the walkers are not independent, the chain from a single walker does not sample the PDF proportionally; the combined chain from all of the walkers is what needs to be used. See mcmc.ipynb for a simple example of emcee.

Understand what Markov Chain Monte Carlo analysis accomplishes, and at least qualitatively how it works. Be able to implement an MCMC analysis of a problem.

7.4 Bayesian/MCMC examples Working example of MCMC implementation: fitting a line (but note that this is just an example: it does not need to be done numerically!). See Jupyter notebook from Dan Foreman-Mackey: git clone https://github.com/LocalGroupAstrostatistics2015/MCMC.git Example: fitting a straight line including determination of uncertainties (Bailer- Jones 9.1): even if uncertainties on data points are unknown, we can consider them as a parameter. Example: outlier model and rejection (Bailer-Jones 9.3). Mixture model:

Ltotal = (1 − p)Lmodel + pLoutlier where we will determine p, the outlier probability. We need to specify a model for the outliers to be able to calculate Loutlier. What we choose might depend on what we think the nature of the outliers are: if they are from poor measurements (with underestimated uncertainties), the distribution might be centered around the model, but with larger scatter, while if they are just junk, the distribution might just be broad or uniform. We then determine the posterior PDF for the parameters and the outlier fraction, p. Once we have these, we can look for outliers by checking for each point the ratio of the outlier likelihood to the total likelihood. See mcmc.ipynb Example: uncertainties on both coordinates (Bailer-Jones 9.4). Now to calculate the likelihood we want L(x, y|θ, φ), where x, y are the observed coordinates, θ are the parameters, and φ is the noise model for both coordinates. If we consider the true coordinates, x0, y0: 0 x = x + x Astr 630 Class Notes – Spring 2020 58

0 y = y + y ZZ L(x, y|θ, φ) = P (x, y|x0, y0φ)P (x0, y0|θ)dx0dy0

0 0 0 0 For each point xi, yi, we integrate over x and y . However, given x , y is just the value of function given parameters θ. We use the noise model to weight the different 0 0 x and to evaluate the probability of getting yi given y . See mcmc.ipynb

7.5 Predictive probabilities One use of statistical analysis is to make predictions for independent observations. Instead of just using the maximum likelihood parameters to make such a prediction (which doesn’t take uncertainties of the parameters into account), we can consider the posterior predictive PDF, P (yp|xp,D). Given that we have the posterior PDF P (y|θ), we can marginalize over the parameters θ to get: Z P (yp|xp,D) = P (yp|θ, xp,D)dθ

Z P (yp|xp,D) = P (yp|xp, θ)P (θ|D)dθ which is the integral of the likelihood times the posterior PDF. In other words we are just getting the weighted average of the likelihoods for different parameters.

Understand what predictive probabilities are and how to calculate them.

8 Model comparison and number of parameters

In many circumstances, we may not know in advance what the correct, or best, pa- rameterization of a problem might be, i.e., what is the best model. If you have a good understanding of the uncertainties in the data (observational and/or intrinsic) you can try different models and rule some out, e.g., on the basic of χ2 where that applies. However, this may not be useful if you have multiple models that are consistent with the data: χ2 only has the capability to reject models. Note that one can consider comparisons for two different situations: nested mod- els, in which the different models are subsets of the others, e.g., a polynomial model with increasing numbers of terms, and non-nested models, which have distinct func- tional forms. Comparison of nested models may be easier than comparing non-nested models. Astr 630 Class Notes – Spring 2020 59

A basic principle that is generally applied is one of simplicity, i.e., Occam’s Razor: a less complex model that fits the data is preferred to a more complex model that fits the data equally well (but not necessarily preferred to a more complex model that fits the data better!). Practically speaking, this would prefer a model with less free parameters. In the presence of scatter in the data, one also does not want to allow a model too much flexibility so that it is trying to fit the scatter as part of the model. This is known as overfitting. In contrast underfitting does not allow enough flexibility in the model to fit the underlying trend. Another way to think about is to consider the bias and variance in model fitting. The least squares metric that we are minimizing is the sum of the square of the resiuduals. It can be shown that: RMS2 = bias2 + variance2 where the bias is the expectation value of the difference between model and data. Consider a case where you do a fit on a data set and consider its application to other sets of data drawn from the same distribution. For these you can calculate the mean offset of the data from the model, which is a measure of the bias, and the spread of these residuals, which is the variance. If you have a high bias, this suggests underfitting, while a low bias with high variance suggests overfitting. Fits to polynomials of different orders provide a good example. See modelcom- parison.ipynb. Note how important it can be to understand sources of uncertainty.

8.0.1 Information Criteria Some simple criteria can be used to compare models: The Akaike Information Criterion (AIC) compares models by their likelihood but with an additional term that penalizes a model for the number of paramters:

AIC = −2lnLmax + 2J where J is the number of parameters. For small sample sizes, the AIC can prefer models with more parameters, so there is a correction for small samples: 2J(J + 1) AIC = −2lnL + 2J + max N − J − 1 One would prefer a model with a smaller AIC. The Bayesian Information Citerion (BIC) is another similar metric for comparing models, with some penalization for number of parameters:

BIC = −2lnLmax + J ln N Astr 630 Class Notes – Spring 2020 60

8.0.2 Cross validation Cross validation testing: determine parameters from a subset of the data and apply the derived fit to another subset of the data. One way to apply this is to split the sample into 3 parts: a fit sample (∼> 50% of points), a cross-validation sample and a test sample. Using these to choose optimal number of parameters: calculate rms for cross-validation sample for various different models and choose model that gives minimum rms in cross-validation sample. Check using test sample. See Fig 8.14 in Ivesic et al There are various ways to implmement cross validation, e.g. leave-one out, in which one does the cross-validation N times, leaving one point out and determining the training and cross validation errors by averaging across the multiple samples, and similar variations with different subsample sizes.

8.0.3 Cross validation and learning curves You can also use cross-validation to help to decide whether more data is likely to improve the model, vs needing to try a new model, if you have some idea of the expected uncertainties: if rms from cross-validation sample approaches that of fit sample, then additional similar data not likely to help. If the observed scatter is larger than expected, increase the model complexity. See Fig 8.15 in Ivesic et al

Understand the concepts of underfitting and overfitting. Know what cross- validation is and how to implement it. Know what the AIC and BIC are and how they are used.

8.1 Regularization The presence of noise in data can lead to overfitting. While one might choose a simpler model, some circumstances may require a large number of parameters. Another way to avoid overfitting is to require some level of “smoothness” in the derived data. This can be done by adding constraints to the derived parameter set. For example, in the case of polynomial fitting, one can require that the parameters (of normalized variables) be kept small, so long as the data are still fit. Constraining parameters can be done with what is know as regularization. Here one adds a term to the likelihood that penalizes more complex models before solving for the parameters. In the case of noiseless data, this will not produce an unbiased, minimum variance model, but it can be more physical in the presence of noise. A simple method of regularization is called ridge regularization, or L2 regulariza- 2 tion (note definition of L2, (yi − f(xi)) , and L1, |yi − f(xi)| loss functions). This Astr 630 Class Notes – Spring 2020 61 penalizes models on the basis of 2 Σaj so that we now minimize:

2 2 Σ(yi − f(xi|aj)) + λΣaj where λ is the regularization parameter that has to be chosen. For ridge regression, the solution to the least squares problem is simple: using the formalism above, it is given by: a = (AT A + λI)−1AT y where I is the identity matrix (ones on the diagonal). Note that, as a practical issue for regularization, one wants all of the parameters to have comparable amplitudes, so one generally works in standardized variables, i.e. an independent variable that has its mean subtracted and its amplitude scaled to the variation in the variable: (x − x¯) xs = σx See Bailer-Jones 12.2 and fit.ipynb Another regularization implementation is called LASSO (least absolute shrinkage and selection operator), or L1, regularization. Here one penalizes models on the basis of Σ|aj| This has the effect of preferring models where some parameters are eliminated over reducing the amplitude of all parameters. See Ivesic Figure 8.3 for a graphical representation of ridge and LASSO regularization, and Ivesic Figure 8.4 Examples of the usage of least squares fitting and regularization: The Cannon

Understand the concept of regularization and why and when it might be used. Know what L2 and L1 regularization are.

8.2 Bayesian model selection The Bayesian framework has a direct way to evaluate the comparison of two different models to a given set of data that differs significantly from the frequentist concept of model selection. In the frequentist paradigm, one uses likelihoods to reject a model, independently of consideration of other models. In the Bayesian framework, you don’t reject a model unless there is an alternative model as well: if there is only one model that you have, you cannot reject it. Astr 630 Class Notes – Spring 2020 62

See interesting discussion in Bailer-Jones 10.6, in particular, discussion of falsifi- cation. Example: drawing aces from a deck of cards In the Bayesian framework, we can compare the probability of different models. The basic Bayesian formulation is:

P (D|M)P (M) P (M|D) = P (D)

To calculate P (D), we need to sum over all possible models. In some cases, this may be possible, e.g., if the “model” comprise two mutually exclusive hypotheses. We have seen this before in some basic probability problems, where the “models” were just two complementary hypotheses, e.g., whether a person has a disease or not. However, in the more general case, specifying all possible models may not be possible. However, we can still compare the probability of two different models, because P (D) cancels out. To compare two different classes of models, we calculate the posterior : P (D|M )P (M ) R ≡ 1 1 P (D|M2)P (M2) where P (M1) and P (M2) are prior probabilities of the different models. If these are equal, then the odds ratio is given by the :

P (D|M ) BF ≡ 1 P (D|M2) When we talk about comparison of models, recognize that a given model can have a set of parameters that might take on different values; we have been discussing to get the posterior PDF for such parameters. We want to compare the ability of different models, possibly with different number and character of parameters, to fit a given data set; for example, we previously considered polynomial fits to data with different order polynomials. For a model with parameters, we have

P (D|θ, M)P (θ|M) P (θ|M,D) = P (D|M) The denominator P (D|M) is called the evidence, and this is what enters into the odds ratio. We have previously ignored it by just normlizing the posterior PDF. To compare models, we want to calculate it by integrating over all possible choices of parameters: Z P (D|M) = P (D|θ, M)P (θ|M) Astr 630 Class Notes – Spring 2020 63

Given this definition, the evidence is sometimes called the marginal likelihood or the global likelihood. It is the integral of the likelihood over all possible parameters, weighted by the prior on the parameters. To calculate the evidence, we must use normalized likelihoods and priors. Given the calculation of an odds ratio, one may be able to prefer one model over another. Of course, there is some judgement involved about what odds ratio represents a significant preference for one model over another: typically, people do not claim a “strong” discrimination unless the odds ratio is greater than 10 (or less than 0.1), with a “decisive” discrimination when the odds ratio is greater than 100. One nice thing about using the Bayesian odds ratio is that it naturally accounts for the possibility that the models have different levels of complexity, e.g., different numbers of parameters. For models with more parameters, the maximum likelihood of the best fitting model with be larger than the maximum likelihood of the best fitting model with less parameters; however, the marginal likelihood will not necessarily be larger, since one has to integrate the likelihood over all sets of parameters, weighted by the prior on the parameters. This naturally penalizes models with more parameters if the prior is sufficiently broad. On the other hand, this also means that the choice of prior can signficantly affect the odds ratio. Example: is a coin fair (Bailer-Jones 11.2)? Given the occurance of r heads (or tails) in a sample of n coin tosses, how can we assess whether the coin is fair? Consider the comparison of two models: one (M1) with a fair coin (p = 0.5) and one (M2) with an unfair code with unknown p of getting heads in a single toss. In both cases, the number of heads is given by the binomial distribution, but in M1, p is fixed at 0.5, while in M2 we integrate over all choices of p given some prior distribution. Z n! P (D|M) = P (r|n, M) = pr(1 − p)n−rP (p|M)dp r!(n − r)!

For M1, the prior is a delta function with p = 0.5, while for M2 we adopt a uniform prior. Doing the integrals, you get n! P (D|M ) = 0.5r0.5n−r 1 r!(n − r)! i.e., the binomial probability with p = 0.5, Pbin(r|0.5, n). For M2, n! r!(n − r)! 1 P (D|M ) = = 2 r!(n − r)! (n + 1)! n + 1

(work not shown, see B-J!), so the Bayes factor is

B = (n + 1)Pbin(r|0.5, n) Astr 630 Class Notes – Spring 2020 64

For a uniform prior, the Bayes factor for different r and n is given by Bailer-Jones 11.1 Note that you’d need 156 tosses, even with getting exactly half heads, to prefer M1 by a factor of 10! Example: is a linear fit better than a constant? (Bailer-Jones 11.3). Consider the data set show in Bailer-Jones 11.2. How can we determine whether the data warrant a linear slope (or higher order) as compared with just a constant? Compare two models: a constant or a linear slope. To do the comparison, one needs to choose priors for the parameters. For the intercept, BJ chooses a gaussian centered on 0 with a standard deviation of 1. For the slope (for M2), he chooses uniform in α, where slope = tanα. He also fits for the scatter in the points, using a uniform prior between log 0.5 and log 2 (note that it is critical to use a proper prior to be able to calculate the evidence!). See Bailer-Jones 11.3 for representations of the classes of models. Calculating the evidence, BJ gets

log P (D|M1) = −8.33

log P (D|M2) = −8.44

log B12 = 0.11

B12 = 1.3 i.e., no strong discrimination of models. Repeated tests with different samples of this size give different values, but always no strong discrimination. Modifying priors does not have too big of an effect in most cases, but if σ is forced to be small, then it does. On the other hand, a larger data set of 50 points does discriminate:

log P (D|M1) = −33.87

log P (D|M2) = −29.37

log B12 = −4.50

B12 = 3.15e − 5 Limitations of Bayesian model comparison: can depend significantly on the choice of prior: may need to consier sensitivity to different choices of prior. Also, may be hard to compute the evidence. In difficult situations, can fall back on AIC, BIC, and/or cross validation, as discussed previously. The Akaike Information Criterion (AIC) compares models by their likelihood but with an additional term that penalizes a model for the number of paramters:

AIC = −2lnLmax + 2J Astr 630 Class Notes – Spring 2020 65 where J is the number of parameters. For small sample sizes, the AIC can prefer models with more parameters, so there is a correction for small samples:

2J(J + 1) AIC = −2lnL + 2J + max N − J − 1 One would prefer a model with a smaller AIC. The Bayesian Information Citerion (BIC) is another similar metric for compring models, with some penalization for number of parameters:

BIC = −2lnLmax + J ln N

Understand how Bayesian model comparison works. Know what the odds ratio and Bayes factors are and how to calculate them. Understand what the evidence is.

8.3 Bayesian implementation of regularization 9 Nonparametric models

Sometime we may want to characterize the relationship between quantities without specifying a mathematical relation with parameters, e.g. we just want some table of values that “smoothly” pass through a series of points. The simplest example of this is just a histogram, giving the frequency of observing some given quantity. However, when you make a histogram you have to make a choice of bin sizes and locations, and these choices can affect how you might interpret your results; see, e.g. Bailer-Jones 7.1. Typically, one probably wants to choose a bin size that is some fraction of the “scale” of the distribution, and one choice for this is given by “Scott’s rule” : 3.5σ ∆ = N 1/3 However, this can still have sensitivity to bin placement (see, e.g. Bailer-Jones 7.2) and also there could be varying “scale” across the distribution. For some additional information on histogram parameters, and access to some code for easily applying these, see astroML . One method implemented there is that of Bayesian blocks, see Scargle et al 2013 and here Going beyond histograms leads to the concept of kernel density estimation. The basic idea is that each point contributes over a range of values as specified by the kernel function. A simple kernel might be a rectangular box, while a smoother kernel might be a Gaussian. Alternatively, one might not choose to use a fixed width kernel, Astr 630 Class Notes – Spring 2020 66 but one that adapts to the number of points near a given value, e.g. a “k-nearest neighbors kernel”. See Bailer-Jones 7.3 The same ideas can be applied to a set of data relating one quantity to another. The histogram analogy takes average values within bins. The kernel-density analogy takes a running mean across bins of fixed width. A k-nearest neighbors takes a running mean across a fixed number of points. With nonparametric characterization, we cannot evaluate the quality of the char- acterization by comparing the observed distribution with a model distribution. We can, however, compare two different sets of data to determine whether they are con- sistent with the hypothesis of being drawn from the same distribution. The most popular test to do this is the Kolmogorov-Smirnov (K-S) test. This works by com- paring the cumulative distribution function of the two data sets, from which one can determine the maximum deviation between the two CDFs. It turns out that it is possible to derive the probability of obtaining this maximum difference under the null hypothesis that the two distributions are the same. This probability depends on the number of data points used to make the distributions: if the number of points is small, a larger deviation between the CDFs will still be consistent with the null hypothesis. An approximation for relatively large number of data points is:

C(α) DKS = √ ne where ne is the “effective” number of data points:

N1N2 ne = N1 + N2 and C(α) is a value that depends on the desired rejection level α, e.g. C(0.05) = 1.36 and C(0.01) = 1.63. The KS test can also be used to compare a distribution with some proposed reference distribution function in a “one-sample” K-S test. For Python implementation, see scipy.stats.ks 2samp and scipy.stats.kstest Example: galaxy luminosity functions.

Understand the issues with using histograms. Know what kernel density estima- tion is and how to use it. Understand how to compare two distributions using a K-S test.