Quantitative Analyst Program

Session 2, Oct. 14th, 2019

Roy Chen Zhang

In collaboration with McGill Investment Club and Desautels Faculty of Management Session Content at [roychenzhang.com/teaching] 1 /37 Outline 1 Another Brief Case Study 1.1 "John Ioannidis 1.2 ...and the Cross-Section of Expected Returns" 2 Statistical Foundations 2.1 2.2 Moments and 3 Distributions and Correlations 3.1 Distributions 3.2 Correlations 4 Testing and Validation 4.1 Hypothesis Testing 4.2 The Mistake Everyone Made 5 Standardization and Normalization 5.1 Normalization 5.2 Standardization

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 2 /37 Section 1:

Another Brief Case Study

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.1 "John Ioannidis 3 /37 Another Brief Case Study: 1 John Ioannidis

In 2005, physician John Ioannidis, then a professor at the University of Ioannina Medical School, made an outrageous assertion: That most research findings in all of science were, in fact, frauds. His paper, titled "Why Most Published Research Findings Are False", quickly gained traction in the scientific community, soon becoming the most downloaded paper in the Public Library of Science. This marked the beginning of a field called ’meta-research’, or research on research itself. Questions:

• Why might researchers be motivated to fake results? • What do you think he noticed?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.2 ...and the Cross-Section of Expected Returns" 4 /37 A Brief Case Study: 2 ...and the Cross-Section of Expected Returns Finance Academia’s own epiphany in meta-research came in the form of a 2015 paper, titled ’...and the Cross-Section of Expected Returns’. In their paper, Campbell R. Harvey, Yan Liu, and Heqing Zhu sum up the nature and implications of their findings in one short sentence:

"We argue that most claimed research findings in financial economics are likely false."

Questions:

• Who are the consumers of financial research? (Think about the players from last time) • Who might be most affected by this finding? Why?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.2 ...and the Cross-Section of Expected Returns" 5 /37 A Brief Case Study: 3 Finding the Truth

• In reality, what thousands of scholars had failed to notice for decades was a fundamental statistical detail: the multiple testing problem. • This simply referred to the fact that, the more tests one performed on a particular set of data, the more likely it was for a false positive to happen at least once. • Hence, using the same levels of significance across an arbitrary number of tests made the numbers look much rosier than reality.

Questions:

• Can you think of a real-world example of multiple testing? • What defenses might one have against bad ?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 1 Another Brief Case Study 1.2 ...and the Cross-Section of Expected Returns" 6 /37 Key Takeaways What can we learn from this? As always, it comes down to your frame of reference: The Academic Perspective The only way to be certain of a finding is to arrive at the same result through several different ways. Much more support is needed for studies in research.

The Industry Perspective Research may not be completely reliable, or may present conflicting conclusions. Apply research findings with discretion.

The Student Perspective Statistics are critical in finance! It’s important to learn how to accurately interpret data and draw conclusions.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 7 /37 Section 2:

Statistical Foundations

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.1 Sampling 8 /37 Sampling Not just an amuse-bouche

The first thing to keep in mind is that most of the time one cannot observe the entire population at the same time: there will always be exclusions. Hence we have to assume that our sample is representative of the real-world population. Common Sampling Methods:

• Random Sampling - By Chance • Systematic Sampling - By Order • Stratified Sampling - By Types • - By Groups

It’s important to sample objectively and evenly, otherwise your results may be skewed by the sample itself.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 9 /37 Moments | ΠLike | b Comment |

Moments are a way to characterize the shape of a particular set of data along a horizontal axis. The general formulation for the k-th raw sample is given by:

n 1 k ∑(Xi) (1) n i=1

This raw moment forms the basis for many more advanced metrics. Next we will discuss how to create more useful things from simple raw moments. Question to Consider: Does a 0-th moment exist?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 10 /37 Cumulants A word you didn’t know you needed By transformations of raw moments, we can yield cumulants that are perhaps more familiar to our eyes: 1st /1st Moment

1 n ∑ Xi = µ (2) n i=1

2nd Cumulant/2nd

n 1 2 2 ∑(Xi − µ) = σ (3) n i=1

Are these known by other names?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 11 /37 The Answers: Real winners don’t scroll ahead to look at this slide

1st Cumulant:

1 n ∑ Xi = µ (4) n i=1 The mean (µ) is the arithmetic average of all of the data in a particular sample. This is also commonly known as the average.

2nd Cumulant:

n 1 2 2 ∑(Xi − µ) = σ (5) n i=1 The variance (σ 2) is a measure of the amount of variation within a dataset. The (σ) is another very useful metric.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 12 /37 Cumulants Part 2: Boogaloo

3rd Cumulant/3rd Standardized Moment

1 n X − µ ∑( i )3 = γ (6) n i=1 σ

4th Cumulant/4th Standardized Moment

1 n X − µ ∑( i )4 − 3 = κ − 3 (7) n i=1 σ 1 n X − µ or : ∑( i )4 = κ (8) n i=1 σ What about these cumulants?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 13 /37 Cumulants Part 2: The Answers These are perhaps a little more obscure:

1 n X − µ ∑( i )3 = γ (9) n i=1 σ The skewness is a measure of the degree of asymmetry in the distribution. A positive skewness would mean that positive values are more extreme than the negatives, and vice-versa.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 2 Statistical Foundations 2.2 Moments and Cumulants 14 /37 Cumulants Part 2: The Answers

4th Cumulant:

1 n X − µ ∑( i )4 − 3 = κ − 3 (10) n i=1 σ 1 n X − µ or : ∑( i )4 = κ (11) n i=1 σ The kurtosis is a measure of the of a set of data. A higher kurtosis would imply a more centralized distribution.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 15 /37 Section 3:

Distributions and Correlations

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 16 /37 Probability Distributions What are they?

Probability distributions determine how the values of random variables are spread. For Example: The set of all the possible outcomes of the tossing of a sequence of coins (with equal odds of heads or tails) gives rise to the binomial distribution. Many assume that the of large samples of the population follow a normal distribution, but that is not necessarily the only case! The features of these distributions are very well known and can be used to extract inferences about the population.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 17 /37 Normal Distributions The Grandfather of them all The normal distribution is the most widely used in the financial industry. It is a bell-shaped curve with identical mean, and . Standard notation is N(µ,σ 2), with µ denoting the mean, and σ 2 denoting the variance of the sample.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 18 /37 Log-Normal Distributions Remember to Log Out! The Log-Normal distribution is another extremely common distribution in finance, especially in time-series analysis and in modeling prices (which usually cannot be negative). It can be 2 thought of as LNormal(µ,σ 2) = eN(µ,σ ) where N is again the Normal Distribution. The term "Log-Normal" comes from the fact that ln(LNormal(µ,σ 2)) = N(µ,σ 2)or that the Log of this distribution is Normal.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 19 /37 Poisson Distributions Seems a bit ’Fishy’... Unlike the previous distributions, the Poisson distribution gives the probability of a number of events occurring over a certain time if these events occur with a constant rate and independently of prior events. Since x is usually only thought of as positive integers, this is a discrete distribution. Examples may include:

• Solar flares that can be observed from Earth in a year • Customers ordering at Quesada between 10 and 11 am • Bottles of diet coke drank by a certain professor (allegedly)

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 20 /37 Weibull Distributions Pronounced ’Way-Bull’ In a similar vein, the Weibull distribution is also used to predict time in respect to independent occurrences, but is commonly utilized to model failure rates (such as a manufactured automobile part) with respect to time. As time is positive, but not an integer, this distribution is continuous. Contrast with the Poisson, which is discrete. Pop Quiz: Why might this distribution be useful in Finance applications?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 21 /37 Uniform Distributions Just like the first day of private school The Uniform Distribution refers to a distribution in which numbers within a certain are equally likely, and none outside are likely. If you asked a computer to randomly generate numbers between 1 and 100, each number will have a 1-in-100 chance of being chosen, while 101 cannot be an output. This is a uniform distribution. Humans, on the other hand...

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 22 /37 Extreme Value Theory Tails Are Important Too! Most distributions place a large amount of emphasis on the centre of distributions, and neglect tails. For risk management, it is critical that we model tails correctly, in addition to the centres. For Example: In 2008, the financial industry mis-estimated tails in probability distributions used to model Subprime Mortgage Defaults, leading to few foreseeing the eventual crash and global finance crisis.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 23 /37 Generalized Extreme Value (GEV) AKA the Fisher-Tippett distributions GEV Distributions are distributions purpose-built for the modeling of tails. The theory behind GEV distributions is that each subsequent value has a smaller chance of being the maximum possible value. They come in 3 different flavours depending on the :

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 24 /37 Peaks over Threshold (POT) Hey, It’s Legal Now

• One might ask: ’But why not just throw away the data you don’t care about?’ • This is exactly the philosophy of POT,Peaks-Over-Threshold. • The key here is to model only extreme events, with one and one intensity distribution. • Anticipated events could be random, or non-random.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.1 Distributions 25 /37 Other Types of Distributions The Possibilities are Truly Endless!

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.2 Correlations 26 /37 Covariation Data of a Feather Flock Together

• Another important consideration when analyzing distributions is that some variables may be related. • When one variable goes down a lot, the other may also do the same. Covariance

1 n Cov(X,Y) = ∑(Xi − µX )(Yi − µY ) (12) n i=1

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 3 Distributions and Correlations 3.2 Correlations 27 /37 Correlation One Way a Number Can Start a Relationship ;)

• The correlation, on the other hand, is simply a generalization of covariances vs • The magnitudes and directions of correlation are standardized, so it’s comparable across datasets. Correlation

Cov(X,Y) ρX,Y = (13) σX σY High correlation suggests existence of substantial relationships, while a low correlation suggests that there may not be.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 4 Testing and Validation 28 /37 Section 4:

Testing and Validation

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 4 Testing and Validation 29 /37 Validity What is true and what is false?

• When analyzing data, we usually will seek to confirm a conclusion, or a hypothesis • The null hypothesis refers to the conclusion that we cannot observe a meaningful relationship in the data • Thus confirming or rejecting the null hypothesis is very important. • We can usually anchor this with respect to a certain level of confidence and a probability distribution. • A 95% confidence level, for example, would mean that there is a 95% chance our conclusion is correct, or 1 in 20 of these conclusions are false.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 4 Testing and Validation 4.1 Hypothesis Testing 30 /37 Hypothesis Testing Measuring our Confidence

• To start with a hypothesis test, we need a test statistic to describe the relationship with an underlying distribution. • This is usually formulated as the magnitude of what you are trying to measure, over how certain you are about it.

For example, consider a simple z test statistic

µ zµ = (14) σµ To Consider: If the actual mean was 0, how likely is it that we observe a sample mean of µ?

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 4 Testing and Validation 4.1 Hypothesis Testing 31 /37 Hypothesis Testing Tying it all together

• A test statistic needs to be combined with a distribution to get a probability. • Another way to frame this is: ’is the null hypothesis within the x% confidence interval of the value we observed? ’ • In our example, you would take the z score, and then compare it to the critical threshold from your selected distribution.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 4 Testing and Validation 4.1 Hypothesis Testing 32 /37 One-Sided Hypothesis Testing Inequalities

• The same can be done for one-sided hypotheses, where we try to verify an inequality instead of an equality. • In these cases, you only care about one of the tails, since the other one would still confirm the inequality.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 4 Testing and Validation 4.2 The Mistake Everyone Made 33 /37 The Multiple Testing Problem What the whole world almost missed

• Intuitively, would the chance of a false positive go up if you conduct a lot of tests on the same data? • If you said ’Yes’, then you also understand the multiple testing problem. • While statistics in finance have been adjusted by sample sizes for decades, it was not until quite recently adjusting for the amount of tests became mainstream.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 5 Standardization and Normalization 34 /37 Section 5:

Standardization and Normalization

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 5 Standardization and Normalization 35 /37 The Set-up Why transform your data?

• While it is simple to compare data reported in the same sample, it often times becomes a little bit more involved to compare across samples. • Think about reporting heights: if one sample is reported in feet and inches, but the other in cm, it becomes hard to say if one person in sample A is taller than another in sample B. • The same intuition applies in standardization and normalization: comparability should be preserved.

The Mars Rover Disaster: Dec. 11, 1998 A NASA review board found that it was the thruster software. The software calculated thrust in pounds of force. A second code assumed it was in the metric unit-"newtons per sqm", killing the mission on a day when engineers had expected to celebrate.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 5 Standardization and Normalization 5.1 Normalization 36 /37 Normalization But Who is Really ’Normal’ Nowadays?

• The first way to transform data is to Normalize it, which is also the simplest way. • This simply involves scaling the data set so that the range is the same across all sets. • Normalizing enables us to compare the median, mean, and mode across two different datasets. • But other differences may still remain.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 5 Standardization and Normalization 5.2 Standardization 37 /37 Standardization How We Survive Hammami’s Finals

• Standardizing is another common way of ensuring comparable data. • This involves removing the mean of the data set and then dividing by the volatility (stdev) • Standardization lets us adjust for outliers and still compare cumulants objectively. • But this yields ranges, , and modes that may not be the same.

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com) 5 Standardization and Normalization 5.2 Standardization 37 /37 Recap: Or should I say... reQAP? 1 Another Brief Case Study 1.1 "John Ioannidis 1.2 ...and the Cross-Section of Expected Returns" 2 Statistical Foundations 2.1 Sampling 2.2 Moments and Cumulants 3 Distributions and Correlations 3.1 Distributions 3.2 Correlations 4 Testing and Validation 4.1 Hypothesis Testing 4.2 The Mistake Everyone Made 5 Standardization and Normalization 5.1 Normalization 5.2 Standardization

QAP 2019-2020 Roy Chen Zhang (roychenzhang.com)