Machine Learning

Quantitative Analyst Program Session 2, Oct. 14th, 2019 Roy Chen Zhang In collaboration with McGill Investment Club and Desautels Faculty of Management Session Content at [roychenzhang.com/teaching] 1 =37 Outline 1 Another Brief Case Study 1.1 "John Ioannidis 1.2 ...and the Cross-Section of Expected Returns" 2 Statistical Foundations 2.1 Sampling 2.2 Moments and Cumulants 3 Distributions and Correlations 3.1 Distributions 3.2 Correlations 4 Testing and Validation 4.1 Hypothesis Testing 4.2 The Mistake Everyone Made 5 Standardization and Normalization 5.1 Normalization 5.2 Standardization QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 2 =37 Section 1: Another Brief Case Study QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.1 "John Ioannidis 3 =37 Another Brief Case Study: 1 John Ioannidis In 2005, physician John Ioannidis, then a professor at the University of Ioannina Medical School, made an outrageous assertion: That most research findings in all of science were, in fact, frauds. His paper, titled "Why Most Published Research Findings Are False", quickly gained traction in the scientific community, soon becoming the most downloaded paper in the Public Library of Science. This marked the beginning of a field called ’meta-research’, or research on research itself. Questions: • Why might researchers be motivated to fake results? • What do you think he noticed? QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.2 ...and the Cross-Section of Expected Returns" 4 =37 A Brief Case Study: 2 ...and the Cross-Section of Expected Returns Finance Academia’s own epiphany in meta-research came in the form of a 2015 paper, titled ’...and the Cross-Section of Expected Returns’. In their paper, Campbell R. Harvey, Yan Liu, and Heqing Zhu sum up the nature and implications of their findings in one short sentence: "We argue that most claimed research findings in financial economics are likely false." Questions: • Who are the consumers of financial research? (Think about the players from last time) • Who might be most affected by this finding? Why? QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.2 ...and the Cross-Section of Expected Returns" 5 =37 A Brief Case Study: 3 Finding the Truth • In reality, what thousands of scholars had failed to notice for decades was a fundamental statistical detail: the multiple testing problem. • This simply referred to the fact that, the more tests one performed on a particular set of data, the more likely it was for a false positive to happen at least once. • Hence, using the same levels of significance across an arbitrary number of tests made the numbers look much rosier than reality. Questions: • Can you think of a real-world example of multiple testing? • What defenses might one have against bad statistics? QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.2 ...and the Cross-Section of Expected Returns" 6 =37 Key Takeaways What can we learn from this? As always, it comes down to your frame of reference: The Academic Perspective The only way to be certain of a finding is to arrive at the same result through several different ways. Much more support is needed for replication studies in research. The Industry Perspective Research may not be completely reliable, or may present conflicting conclusions. Apply research findings with discretion. The Student Perspective Statistics are critical in finance! It’s important to learn how to accurately interpret data and draw conclusions. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 7 =37 Section 2: Statistical Foundations QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.1 Sampling 8 =37 Sampling Not just an amuse-bouche The first thing to keep in mind is that most of the time one cannot observe the entire population at the same time: there will always be exclusions. Hence we have to assume that our sample is representative of the real-world population. Common Sampling Methods: • Random Sampling - By Chance • Systematic Sampling - By Order • Stratified Sampling - By Types • Cluster Sampling - By Groups It’s important to sample objectively and evenly, otherwise your results may be skewed by the sample itself. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 9 =37 Moments | Like | b Comment | Moments are a way to characterize the shape of a particular set of data along a horizontal axis. The general formulation for the k-th raw sample moment is given by: n 1 k ∑(Xi) (1) n i=1 This raw moment forms the basis for many more advanced metrics. Next we will discuss how to create more useful things from simple raw moments. Question to Consider: Does a 0-th moment exist? QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 10 =37 Cumulants A word you didn’t know you needed By transformations of raw moments, we can yield cumulants that are perhaps more familiar to our eyes: 1st Cumulant/1st Moment 1 n ∑ Xi = m (2) n i=1 2nd Cumulant/2nd Central Moment n 1 2 2 ∑(Xi − m) = s (3) n i=1 Are these known by other names? QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 11 =37 The Answers: Real winners don’t scroll ahead to look at this slide 1st Cumulant: Mean 1 n ∑ Xi = m (4) n i=1 The mean (m) is the arithmetic average of all of the data in a particular sample. This is also commonly known as the average. 2nd Cumulant: Variance n 1 2 2 ∑(Xi − m) = s (5) n i=1 The variance (s 2) is a measure of the amount of variation within a dataset. The standard deviation (s) is another very useful metric. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 12 =37 Cumulants Part 2: Statistic Boogaloo 3rd Cumulant/3rd Standardized Moment 1 n X − m ∑( i )3 = g (6) n i=1 s 4th Cumulant/4th Standardized Moment 1 n X − m ∑( i )4 − 3 = k − 3 (7) n i=1 s 1 n X − m or : ∑( i )4 = k (8) n i=1 s What about these cumulants? QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 13 =37 Cumulants Part 2: The Answers These are perhaps a little more obscure: Skewness 1 n X − m ∑( i )3 = g (9) n i=1 s The skewness is a measure of the degree of asymmetry in the distribution. A positive skewness would mean that positive values are more extreme than the negatives, and vice-versa. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 14 =37 Cumulants Part 2: The Answers 4th Cumulant: Kurtosis 1 n X − m ∑( i )4 − 3 = k − 3 (10) n i=1 s 1 n X − m or : ∑( i )4 = k (11) n i=1 s The kurtosis is a measure of the central tendency of a set of data. A higher kurtosis would imply a more centralized distribution. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 15 =37 Section 3: Distributions and Correlations QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 16 =37 Probability Distributions What are they? Probability distributions determine how the values of random variables are spread. For Example: The set of all the possible outcomes of the tossing of a sequence of coins (with equal odds of heads or tails) gives rise to the binomial distribution. Many assume that the means of large samples of the population follow a normal distribution, but that is not necessarily the only case! The features of these distributions are very well known and can be used to extract inferences about the population. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 17 =37 Normal Distributions The Grandfather of them all The normal distribution is the most widely used probability distribution in the financial industry. It is a bell-shaped curve with identical mean, median and mode. Standard notation is N(m;s 2), with m denoting the mean, and s 2 denoting the variance of the sample. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 18 =37 Log-Normal Distributions Remember to Log Out! The Log-Normal distribution is another extremely common distribution in finance, especially in time-series analysis and in modeling prices (which usually cannot be negative). It can be 2 thought of as LNormal(m;s 2) = eN(m;s ) where N is again the Normal Distribution. The term "Log-Normal" comes from the fact that ln(LNormal(m;s 2)) = N(m;s 2)or that the Log of this distribution is Normal. QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 19 =37 Poisson Distributions Seems a bit ’Fishy’... Unlike the previous distributions, the Poisson distribution gives the probability of a number of events occurring over a certain time if these events occur with a constant rate and independently of prior events. Since x is usually only thought of as positive integers, this is a discrete distribution. Examples may include: • Solar flares that can be observed from Earth in a year • Customers ordering at Quesada between 10 and 11 am • Bottles of diet coke drank by a certain professor (allegedly) QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 20 =37 Weibull Distributions Pronounced ’Way-Bull’ In a similar vein, the Weibull distribution is also used to predict time in respect to independent occurrences, but is commonly utilized to model failure rates (such as a manufactured automobile part) with respect to time.

Machine Learning

Maxskew and Multiskew: Two R Packages for Detecting, Measuring and Removing Multivariate Skewness

An Honest Approach to Parallel Trends ∗

Statistical Evidence of Central Moments As Fault Indicators in Ball Bearing Diagnostics

Chapter 5. Multiple Random Variables 5.6: Moment Generating Functions Slides (Google Drive) Alex Tsun Video (Youtube)

Lecture 11: Using the LLN and CLT, Moments of Distributions

Learning Exponential Families in High-Dimensions

Moment-Ratio Diagrams for Univariate Distributions

Least Quartic Regression Criterion to Evaluate Systematic Risk in the Presence of Co-Skewness and Co-Kurtosis

Lecture 12: Central Limit Theorem and Cdfs Raw Moment: 0 N Μn = E(X )

A Simulation Method for Skewness Correction

On the Asymptotic Distribution of an Alternative Measure of Kurtosis

Introduction to Random Variables