Advanced Quantitative Analysis: Methodology for the Social Sciences

Thomas Plümper Professor of Quantitative Social Research Vienna University of Economics [email protected]

© Thomas Plümper 2017 - 2018 1

Credits

- transparencies from Vera Troeger’s advanced regression course - transparencies from Richard Traunmueller’s causal inference course

- Wikipedia - Institute for Digital Research and Education http://stats.idre.ucla.edu/stata/dae/ - Stata Corp.

- FiveThirtyEight https://fivethirtyeight.com/

© Thomas Plümper 2017 - 2018 2

Structure: Estimation (Scale of the Depvar) and Complications

Error/Residuals/ Coefficient/Effect Functional Conditionality Selection Truncation/Censoring Dynamics Heterogeneity Spatial Significance Form Dependence OLS Chapter 4 Chapter 5 Chapter 6 Chapter 8 Chapter 9 Probit/Logit Chapter 7 n.a. Multinomial n.a. Chapter 10 Ordered Chapter 3 Chapter 7 Chapter 5 Chapter 7 Chapter 6 n.a. Chapter 11 Poisson/Neg Chapter 12 Chapter 6 Binomial Survival Chapter 13

© Thomas Plümper 2017 - 2018 3

ToC

Chapter 1: Empirical Research and the Inference Problem 32 Chapter 2: Probabilistic Causal Mechanisms and Modes of Inference 62 Chapter 3: Statistical Inference and the Logic of 99 Chapter 4: Linear Models: OLS 120 Chapter 5: Minor Complications and Extensions 147 Chapter 6: More Complications: Selection, Truncation, Censoring 190 Chapter 7: Maximum Likelihood Estimation of Categorical Variables 215 Chapter 8: ML Estimation of 262 Chapter 9: Dynamics and the Estimation 286 Chapter 10: Temporal Heterogeneity 320 Chapter 11: Causal Heterogeneity 335 Chapter 12: Spatial Dependence 361 Chapter 13: The Analysis of Dyadic Data 404 Chapter 14: Effect Strengths and Cases in Quantitative Research 405

© Thomas Plümper 2017 - 2018 4

Literature useful textbooks (among others)

© Thomas Plümper 2017 - 2018 5

Chapter 1

Cohen, M.F., 2013. An introduction to logic and scientific method. Read Books Ltd. Curd, M. and Cover, J., 1998. Philosophy of science: The central issues.

Chapter 2

Pearl, J., 2009. Causality. Cambridge University Press. Morgan, S.L. and Winship, C., 2007. Counterfactuals and causal analysis: Methods and principles for social research, Cambridge University Press.

Chapter 3

Leamer, E.E., 1978. Specification searches: Ad hoc inference with nonexperimental data (Vol. 53). John Wiley & Sons Incorporated. Nichols, A., 2007. Causal inference with observational data. Stata Journal, 7(4), p.507. Neumayer, E. and Plümper, T., 2017. Robustness Tests for Quantitative Research. Cambridge University Press.

Chapter 4

Kennedy, P., 2003. A guide to econometrics. MIT press.

© Thomas Plümper 2017 - 2018 6

Chapter 5

Beck, N. and Jackman, S., 1998. Beyond linearity by default: Generalized additive models. American Journal of Political Science, pp.596-627. Schmidt, C.O., Ittermann, T., Schulz, A., Grabe, H.J. and Baumeister, S.E., 2013. Linear, nonlinear or categorical: how to treat complex associations in regression analyses? Polynomial transformations and fractional polynomials. International journal of public health, 58(1), pp.157-160.

Chapter 6

Heckman, J.J., 1976. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In Annals of Economic and Social Measurement, Volume 5, number 4 (pp. 475-492). NBER. Amemiya, T., 1973. Regression analysis when the dependent is truncated normal. Econometrica: Journal of the Econometric Society, pp.997-1016. Heckman, J.J., 1977. Sample selection bias as a specification error (with an application to the estimation of labor supply functions).

Chapter 7

© Thomas Plümper 2017 - 2018 7

Long, J.S. and Freese, J., 2006. Regression models for categorical dependent variables using Stata. Stata press.

Chapter 8

King, G., 1989. Variance specification in event count models: From restrictive assumptions to a generalized estimator. American Journal of Political Science, pp.762-784. King, G., 1989. Event count models for international relations: Generalizations and applications. International Studies Quarterly, 33(2), pp.123-147. Mullahy, J., 1997. Heterogeneity, excess zeros, and the structure of count data models. Journal of Applied Econometrics, pp.337-350.

Chapter 9

De Boef, S. and Keele, L., 2008. Taking time seriously. American Journal of Political Science, 52(1), pp.184- 200. Judson, R.A. and Owen, A.L., 1999. Estimating dynamic panel data models: a guide for macroeconomists. Economics letters, 65(1), pp.9-15. Plümper T. and Troeger V.E. (2018). Not so Harmless After All. The Fixed Effects Model and Dynamic Misspecification. Political Analysis.

© Thomas Plümper 2017 - 2018 8

Chapter 10

Toyoda, T., 1974. Use of the Chow test under heteroscedasticity. Econometrica: Journal of the Econometric Society, pp.601-608. Stock, J.H., 1994. Unit roots, structural breaks and trends. Handbook of econometrics, 4, pp.2739-2841. Perron, P., 2006. Dealing with structural breaks. Palgrave handbook of econometrics, 1(2), pp.278-352.

Chapter 11

Nickell, S., 1981. Biases in dynamic models with fixed effects. Econometrica: Journal of the Econometric Society, pp.1417-1426.

Bell, A. and Jones, K., 2015. Explaining fixed effects: Random effects modeling of time-series cross-sectional and panel data. Political Science Research and Methods, 3(1), pp.133-153.

Clark, T.S. and Linzer, D.A., 2015. Should I use fixed or random effects?. Political Science Research and Methods, 3(2), pp.399-408.

© Thomas Plümper 2017 - 2018 9

Chapter 12

Franzese Jr, R.J. and Hays, J.C., 2008. Interdependence in comparative politics: Substance, theory, empirics, substance. Comparative Political Studies, 41(4-5), pp.742-780.

Neumayer, E. and Plümper, T., 2017. W. Political Science Research and Methods.

Chapter 13

Neumayer, E. and Plümper, T., 2010. Spatial effects in dyadic data. International Organization, 64(1), pp.145-166. Neumayer, E. and Plümper, T., 2019. Dyadic Data Analysis. In: Franzese et al. editor. Handbook of Research Methods. Ross, M.H. and Homer, E., 1976. Galton's problem in cross-national research. World Politics, 29(1), pp.1-28.

Chapter 14

King, G., Tomz, M. and Wittenberg, J., 2000. Making the most of statistical analyses: Improving interpretation and presentation. American journal of political science, pp.347-361.

© Thomas Plümper 2017 - 2018 10

Hanmer, M.J. and Ozan Kalkan, K., 2013. Behind the curve: Clarifying the best approach to calculating predicted probabilities and marginal effects from limited dependent variable models. American Journal of Political Science, 57(1), pp.263-277. Plümper, T. and Neumayer, E. 2019. Effect Size Analysis. Unp. Williams, R., 2012. Using the margins command to estimate and interpret adjusted predictions and marginal effects. Stata Journal, 12(2), p.308.

© Thomas Plümper 2017 - 2018 11

© Thomas Plümper 2017 - 2018 12

Chapter 1: Empirical Research and the Inference Problem

© Thomas Plümper 2017 - 2018 13

What is Science? When is Research Scientific?

© Thomas Plümper 2017 - 2018 14

Science is a Methodology (or perhaps many, but that is not the point here)

© Thomas Plümper 2017 - 2018 15

The Logic of Science

“Science is a public process. It uses systems of concepts called theories to help interpret and unify observation statements called data; in turn the data are used to check or ‘test’ the theories. Theory creation may be inductive, but demonstration and testing are deductive, although, in inexact subjects, testing will involve statistical inference. Theories that are at once simple, general and coherent are valued as they aid productive and precise scientific practice.” David F. Hendry 1980

© Thomas Plümper 2017 - 2018 16

The Scientific Method

The scientific method has been invented to eliminate or at least largely reduce the influence of priors, beliefs, preferences, and interests on scientific results.

© Thomas Plümper 2017 - 2018 17

The Scientific Method

“The scientific method is a body of techniques for investigating phenomena, acquiring new knowledge, or correcting and integrating previous knowledge. To be termed scientific, a method of inquiry is commonly based on empirical or measurable evidence subject to specific principles of reasoning. The Oxford Dictionaries Online define the scientific method as "a method or procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses".

Wikipedia ‘Scientific Method’ 24.02.2017

© Thomas Plümper 2017 - 2018 18

The Scientific Method observation of real world phenomena logical deduction of predictions from assumptions identification of a puzzle or a questions to formulate a potential causal mechanism which the answer appears to be unknown formulation of an ad hoc explanation developing predictions into hypotheses identify a case or a set of cases to which the identification of the population of cases to explanation applies which the ‘theory’ applies develop a model that explains the variation of outcomes in the population of cases collect data to explore the phenomenon collect data that matches the model test the prediction of the theory embedded in a model using a random draw of cases from the population generalize findings in respect to generalize sample results to population - causal mechanism - effect strengths - population results in a theory that explains the selected results in a tested theory, verified or falsified cases for the chosen empirical model and the sample effects are average treatment effects for the sample

© Thomas Plümper 2017 - 2018 19

© Thomas Plümper 2017 - 2018 20

What then is the Purpose of (Social) Science?

© Thomas Plümper 2017 - 2018 21

What then is the Purpose of (Social) Science?

Science aims at the simplification of reality to sets of relevant causal mechanisms which explain (social) phenomena. -> Theories

Social Science Theories A theory is a generalized explanation of how nature and culture determines social behaviour and outcomes.

Every theory simplifies reality to what the scientists perceives as the relevant. Elements: - assumptions - causal mechanism(s) - prediction

© Thomas Plümper 2017 - 2018 22

Theory

A theory is a theory if and only if it can be wrong.

Necessary Conditions for a Theory − generalizations over a category of phenomena − predictions over outcomes given a state or a change of state − consistency. In other words, if something cannot be expressed by a formal model, it is not a theory. More importantly, theories must be falsifiable. A set of assumptions and their derivations is formulated in a way that cannot be falsified, is useless.

© Thomas Plümper 2017 - 2018 23

Verification

Falsifiability is a necessary condition for theories.

Yet, the verification of predictions does not render a theory correct.

© Thomas Plümper 2017 - 2018 24

Example

In 1726, Swift Johannes Kepler predicted the existence of two Mars moons. In 1877 these two moons were actually detected. One of these moons is now called Swift, the other one Voltaire (scientifically, they are called Phobos and Deimos), because Voltaire made an identical prediction in 1750.

Causal Mechanism and Prediction Well, Kepler predicted two moons because in 1726 it was known that the Earth has one moon and Jupiter four moons. Since Kepler believed in the symmetry of a god-given universe, he predicted the existence of two Mars moons.

Indeed, according to Kepler, the next planet, Saturn, should have 8 moons – it has five, Uranus should have 64 moons – and has five as well. Hence, no symmetry. Correct prediction (in one case), but wrong theory nevertheless.

© Thomas Plümper 2017 - 2018 25

What can we learn here: Induction and Underdetermination

Induction is not deductively valid. Pretty much everybody agrees.

Every empirical evidence is consistent with more than one theory. Some right, some wrong. Duhem and Quine

Any phenomenon can be explained by a multiplicity of theories. How, then, can data ever be sufficient to prove a theory right?

© Thomas Plümper 2017 - 2018 26

The Status and Relevance of Empirical Research

Empirical research aims at testing the validity of a theory. (It can also be used to develop a theory, but this theory is then untested)

© Thomas Plümper 2017 - 2018 27

What are Scientists interested in?  maximizing life-time utility (income, social status, attention and so on) getting tenure getting cited publications in a certain type of journals (a book with a very good publisher)

- results consistent with their ideological priors - results consistent with their previous results - results that attract media attention

© Thomas Plümper 2017 - 2018 28

What Scientists ought to be interested in

- formulation of consistent theories - reliable test of the validity of a consistent theory

and besides:

- concept development - data collection - and so on…

© Thomas Plümper 2017 - 2018 29

Simplicity

Eventually, (social) scientists are interested in theories, which are simultaneously as simple and as general as possible.

It follows: 1. A simpler theory is better than a more complicated theory which does not explain more. 2. An equally simple theory that explains more is better than a theory that explains less. 3. A more complicated theory that explains more is not per se better than a less complicated theory that explains less.

‘More’ more cases, more phenomena, …

Social scientists need to develop theories and test them (test generalizations of the theory). BUT: keep in mind that theories need to simplify.

Testing theories means testing whether the predictions of the theory are correct, not whether the assumptions are ‘true’.

© Thomas Plümper 2017 - 2018 30

How can we know, whether a theory makes valid predictions?

© Thomas Plümper 2017 - 2018 31

Popper: Falsifiability

Popper forcefully states that scientists cannot proof a theory to make valid predictions, one can also falsify a theory.

Popper uses the term falsifiability with two different meanings:

1. Falsifiability is a logical property of statements which requires that scientific statements logically imply at least one testable prediction.

2. Falsifiability is a normative construct, telling scientists that a test of a theory should try to refute it.

There is no relevant dissent with the first meaning, but the prescriptive meaning has let to huge controversies.

© Thomas Plümper 2017 - 2018 32

Popper on Popper

1. It is easy to obtain confirmations, or verification for nearly every theory. 2. Confirmation should only count if we should have expected an event which would have refuted the theory. 3. Every good scientific theory forbids certain things to happen. 4. A theory which is not refutable by any conceivable event is non-scientific. 5. Every genuine test of a theory is an attempt to falsify it. 6. Confirming evidence should not count except when it is the result of a genuine test of the theory (i.e. it is an unsuccessful falsification attempt). 7. Some genuinely testable theories, though falsified, are being upheld by their admirers. This destroys or at least lower the theories scientific status.

© Thomas Plümper 2017 - 2018 33

Pauli: Not even wrong

The 'scientific method' is to replace believe by empirical testing. Scientists test the predictions of theories, the causal mechanism if theories, and sometimes the assumptions underlying theories in a way that allows statements about the validity of the inference made based on the analysis. A theory is scientific if it allows the derivation of hypotheses (predictions) which are testable in principle. A theory that cannot be tested is "not even wrong". (Wolfgang Pauli)

To make the logic of 'not even wrong' as clear as possible: It is the purpose of science to develop theories which can be wrong. In fact, as the probability that a theory is wrong becomes too small, the theory is trivial. If an argument cannot be falsified in principle, it is not a theory.

© Thomas Plümper 2017 - 2018 34

On Naïve Falsification

Thomas Kuhn and Imre Lakatos:

Abandoning a theory if it makes a single false predictions would eliminate too much good research. Lakatos: one cannot test a theory in isolation. One always tests multiple auxiliary assumptions at the same time. Thomas Kuhn (1962): Actual scientists do not refute a theory simply because it makes false predictions (or even worse: one false prediction).

© Thomas Plümper 2017 - 2018 35

Deterministic versus Probabilistic Theories

The huge majority of (social science) theories is not deterministic but probabilistic.

This implies that we cannot falsify a theory in Popper’s sense. Rather, we have to show that on average the theory’s predictions are wrong.

But let’s not talk about paradigmatic change and scientific revolutions here…

© Thomas Plümper 2017 - 2018 36

Can probabilistic theories be tested?

Of course, but scientists need to socially agree on a certain threshold which tells us when empirical evidence ‘too much’ contradicts the probabilistic predictions derived from a theory.

© Thomas Plümper 2017 - 2018 37

Probabilistic Theories and Verification

Verification is logically impossible. (Popper)

But is it?

© Thomas Plümper 2017 - 2018 38

Verification and Falsification in a Bayesian Approach to Science

“Verification is logically impossible.” (page 13)

Bayesians assume that i) knowledge about the validity of theory is on a continuum from 0 to 1, where 0 is certainty that a theory is wrong, 1 certainty that a theory is correct. ii) Plausible empirical analyses and evidence change the degree to which we believe that a theory is correct. iii) Empirical research makes a larger contribution that larger the effect of the findings on our perception that a theory is correct.

© Thomas Plümper 2017 - 2018 39

© Thomas Plümper 2017 - 2018 40

© Thomas Plümper 2017 - 2018 41

Summary

1. Science is dominantly interested in the formulation of valid theories. 2. Empirical research explores the validity of theories. 3. The scientific method has been invented to eliminate the influence of priors, beliefs, ideology, on scientific results. 4. Methodology develops the elements of a toolbox which aim at maximizing the validity of inferences. 5. Methodology requires rules about the interpretation of empirical results. 6. Examples include Popper, Lakatos, statistical inference, Bayesian Philosophy of Science, and so on… 7. These approaches disagree on the status of falsification and verification.

© Thomas Plümper 2017 - 2018 42

Chapter 2: Probabilistic Causal Mechanisms and Modes of Inference

© Thomas Plümper 2017 - 2018 43

Causality and Causal Inference

Thomas Plümper Professor of Quantitative Social Research Vienna University of Economics [email protected]

© Thomas Plümper 2017 - 2018 44

What is Causality?

At the beginning… …all events were predetermined by gods.

Consider natural disasters as an example:

- acts of god until the Lisbon tsunami in 1755 - acts of nature until the late 20th century - acts of human maladaptation

© Thomas Plümper 2017 - 2018 45

Causality in Disaster Damage and Mortality

Ted Steinberg (2013): “Calling a disaster an (…) act of nature is a distraction. It is a result of poor planning and a lack of preparation.”

The World Bank and the United Nations (2010: 1): “Disasters expose the cumulative implications of many earlier decisions, some taken individually, others collectively. (…) Prevention is possible (…) [though] many measures – private and public – must work well together for effective prevention.”

Plümper, Neumayer, Quiroz (2017): “What we call natural disasters are in fact disasters allowed for or at least exacerbated by human action. True, some hazards are completely unforeseeable or so extreme that no human action could prevent them from turning into disasters. For the most part, however, hazards only turn into disasters where humans have made insufficient efforts at prevention, mitigation, preparation and adaptation.”

© Thomas Plümper 2017 - 2018 46

Hume ‘Treatise of Human Nature’ (1739)

“We remember to have seen that species of object we call flame, and to have felt that species of sensation we call heat. We likewise call to mind their constant conjunction in all past instances. Without any farther ceremony, we call the one cause and the other effect, and infer the existence of the one from that of the other.”

Thus, for Hume, causality is perfect (deterministic) association.

© Thomas Plümper 2017 - 2018 47

From Hume’s Perfect Association to Popper’s Falsification

If Hume is correct that causality exists if and only if two ‘events’ are perfectly associated, then Popper is correct in arguing that a single refutation falsifies the hypothesis that the cause causes the effect.

However, Hume is wrong:

First, perfect association does not need to indicate causality. Two factors could be independent by perfectly predetermind by a third factor. Second, even if perfect association indicated causality, this would not that perfect association was a necessary condition.

© Thomas Plümper 2017 - 2018 48

Another View: Does Correlation Imply Causality?

Pearson: “I interpret Galton to mean that there was a category broader than causation, namely correlation, of which causation was only the limit.”

In logic, the technical use of the word "implies" means "is a sufficient circumstance for". This is the meaning intended by statisticians when they say causation is not certain. Indeed, p implies q has the technical meaning of the material conditional: if p then q symbolized as p → q. That is "if circumstance p is true, then q follows." In this sense, it is always correct to say "Correlation does not imply causation."

© Thomas Plümper 2017 - 2018 49

Tufte on Pearson’s Logic

Edward Tufte, in a criticism of the brevity of "correlation does not imply causation", deprecates the use of "is" to relate correlation and causation (as in "Correlation is not causation"), citing its inaccuracy as incomplete. While it is not the case that correlation is causation, simply stating their nonequivalence omits information about their relationship. Tufte suggests that the shortest true statement that can be made about causality and correlation is one of the following:[4]

"Empirically observed covariation is a necessary but not sufficient condition for causality." "Correlation is not causation but it sure is a hint."

© Thomas Plümper 2017 - 2018 50

Bullet in the Head

Does bullet in the head cause death?

Medicinenet.com: “How does a bullet damage flesh and organs? Damage to the body from a bullet is caused in a two ways. The first type of injury is caused by the direct blow or crush of the bullet. Whatever gets in its way is damaged, and this bullet track causes a permanent cavity. If the bullet yaws, the energy transfer increases and the cavity becomes larger. The second injury type is caused by the shock waves of the bullet. The tissue surrounding the bullet track becomes caught up in a temporary vacuum that can be as much as 40 times as large as the bullet itself. This tissue cavity gets stretched and deformed and then reforms itself numerous times, like ripples in the water, until the tissue cavity returns to normal position. With this type of injury, the higher the velocity of the bullet, the larger the cavity of tissue that is at risk for damage.”

© Thomas Plümper 2017 - 2018 51

Surviving a Gunshot

Rudi Dutschke On April 11, 1968, Dutschke was shot in the head by a young anti-communist, Josef Bachmann. Dutschke survived the assassination attempt, and he and his family went to the United Kingdom in the hope that he could recuperate there. Dutschke continued to suffer health problems. He died on 24 December 1979 in Århus, Denmark. He had an epileptic seizure while in the bathtub and drowned.

Gabrielle Giffords On January 8, 2011, Giffords was a victim of an assassination attempt at a Safeway supermarket where she was meeting publicly with constituents. She was critically injured by a gunshot wound to the head. Giffords was later brought to a rehabilitation facility in Houston, Texas, where she recovered some of her ability to walk, speak, read and write. On January 22, 2012, Giffords announced her resignation from her congressional seat in order to concentrate on recovering from her wounds.

© Thomas Plümper 2017 - 2018 52

Does Bullet in the Head Cause Death?

There are two perspectives here:

1) A bullet in the head may cause death. The relation is probabilistic – the probability of surviving a shot in the head is low, but not nil.

2) A bullet in the does not cause death. What causes death is cavity of the brain tissue, internal bleeding, or the shock.

© Thomas Plümper 2017 - 2018 53

Bullet in the Leg

Bullet in the leg may cause death:

Police Report: “A 33-year-old Wilmington man found shot in the leg died as a result of his injuries, city authorities announced Monday. According to Wilmington Police, the victim was located in an alleyway in the 2300 block of North Pine Street, on Sunday, January 22, 2017, at approximately 8:27 a.m., unresponsive and bleeding from the upper, right leg.”

Does a bullet in the leg cause death? 1) Yes, in a probabilistic sense, with a low probability of death. 2) No, bleeding to death causes death.

© Thomas Plümper 2017 - 2018 54

Does ‘Smoking Kill’ and how do we know?

© Thomas Plümper 2017 - 2018 55

Causality as Regularity… is dead:

- causality exists without regularity - not everything that is fairly regular is necessarily causal

© Thomas Plümper 2017 - 2018 56

Causality as Counterfactual

A counterfactual conditional (abbreviated CF), is a conditional containing an if-clause which is contrary to fact. The term counterfactual was coined by Nelson Goodman in 1947, extending Roderick Chisholm's (1946) notion of a "contrary-to-fact conditional".

In 1748, when defining causation, David Hume referred to a counterfactual case: "… we may define a cause to be an object, followed by another, and where all objects, similar to the first, are followed by objects similar to the second. Or in other words, where, if the first object had not been, the second never had existed …" — David Hume, An Enquiry Concerning Human Understanding

Heckman (2005: 1): “Causality is a very intuitive notion that is difficult to make precise without lapsing into tautology.” He argues that two concepts are central for a scientific definition of causality: a set of possible outcomes and manipulation of one (or more) of the determinants.

© Thomas Plümper 2017 - 2018 57

David Lewis David Lewis (1973) proposed that the following definition of the notion of causal dependence: “An event E causally depends on C if, and only if, (i) if C had occurred, then E would have occurred, and (ii) if C had not occurred, then E would not have occurred.”

Judea Pearl Pearl defines counterfactuals directly in terms of a "structural equation model". Given such a model, the sentence "Y would be y had X been x" (formally, X = x > Y = y ) is defined as the assertion: If we replace the equation currently determining X with a constant X = x, and solve the set of equations for variable Y, the solution obtained will be Y = y. This definition has been shown to be compatible with the axioms of possible world semantics and forms the basis for causal inference in the natural and social sciences, since each structural equation in those domains corresponds to a familiar causal mechanism that can be meaningfully reasoned about by investigators. https://www.microsoft.com/en-us/research/video/tutorial-session-b-causes-and-counterfactuals-concepts-principles-and- tools/?from=http%3A%2F%2Fresearch.microsoft.com%2Fapps%2Fvideo%2Fdefault.aspx%3Fid%3D206977

© Thomas Plümper 2017 - 2018 58

Lewis’ and Pearl’s definitions are inherently deterministic, but perhaps not necessarily so.

© Thomas Plümper 2017 - 2018 59

Rubin-Neyman-Holland (Potential Outcomes)

The Rubin causal model (RCM), also known as the Neyman–Rubin causal model, is an approach to the statistical analysis of cause and effect based on the framework of potential outcomes, named after Donald Rubin. This approach was extended it into a general framework for thinking about causation in both observational and experimental studies.

The potential outcomes framework is based on the idea of potential outcomes and the assignment mechanism: every unit has different potential outcomes depending on their "assignment" to a condition. Potential outcomes are expressed in the form of counterfactual conditional statements, which state what would be the case conditional on a prior event occurring. For instance, a person would have a particular income at age 40 if they had attended a private college, whereas they would have a different income at age 40 had they attended a public college.

© Thomas Plümper 2017 - 2018 60

The Fundamental Problem of Causal Inference

One cannot observe unit i given treatment and unit i not giving treatment at the same time.

Since it is not possible to observe both potential outcomes for the same unit at the same time. It is only possible to observe different outcomes for treatment and no treatment for the same unit at different times, for different units at the same time or for different units at different times.

Thus, for exact causal inference, one of the potential outcomes is always missing.

The fundamental problem of causal inference makes observing causal effects impossible. However, this does not make causal inference impossible. (hence one should not call it the fundamental problem of causal inference in the first place…)

© Thomas Plümper 2017 - 2018 61

Causal Inference is not equal to the Observation of Causal Effects

A typical error of interpreting causal inference as if it was causal effect observation: “A randomized experiment works by assigning people randomly to treatments (in this case, public or private college). Because the assignment was random, the groups are (on average) equivalent, and the difference in income at age 40 can be attributed to the college assignment since that was the only difference between the groups.”

But it is not the only difference in the group, because the size of the group will be finite and therefore randomization of treatment does not balance groups.

The assignment mechanism is the explanation for why some units received the treatment and others the control. However, not that randomization only asymptotically generates treatment and control groups with identical properties.

© Thomas Plümper 2017 - 2018 62

Rubin on the Intuitive Interpretation of Treatment Effects

Intuitively, the causal effect of one treatment, E, over another, C, for a particular unit and an interval of time from t1 to t2 is the difference between what would have happened at time t2 if the unit had been exposed to E initiated at t1 and what would have happened at t2 if the unit had been exposed to C initiated at t1:

1) 'If an hour ago I had taken two aspirins instead of just a glass of water, my headache would now be gone,' or 2) ‘Because an hour ago I took two aspirins instead of just a glass of water, my headache is now gone.'

Note that only statement 1 assumes external validity of the claim. Statement 2 assumes internal validity of the claim.

Note also that internal and external validity are not identical and (at the very least) independent of each other.

© Thomas Plümper 2017 - 2018 63

Internal and External Validity

Internal Validity: the obtained effect of x on y for sample k is the true effect of x on y for sample k (plusminus some stochastic sampling error).

External Validity: the obtained effect of x on y in sample k is the true effect of x on y in the population P.

Neumayer and Plümper 2017: define internal validity as an estimated effect which is correct for the units in the sample. An identified average treatment effect is thus not necessarily internally valid if causal heterogeneity exists. In fact, the average treatment effect does not need to validly describe the true effect of x on y for any i.

© Thomas Plümper 2017 - 2018 64

Population and Sample

Neumayer and Plümper 2017:

"A population is a set of cases or subjects (such as individuals, groups, institutions, countries etc.). It exists if and only if its subjects can be distinguished from other subjects that do not belong to the population (Ryder 1964). Definitions of a population must therefore, implicitly at least, justify the set of cases included given the causal mechanism studied. A case is the unit of analysis that can but need not be observed more than once, e.g. over time. A sample ought to be a strict subset of the population. Samples can be selected or random. To be a perfect random sample, all cases included in the population need to have an identical a priori probability of being drawn into the sample while all cases that do not belong to the population have an a priori probability of zero of being drawn into the sample.”

© Thomas Plümper 2017 - 2018 65

The Potential Outcomes Approach focuses on the identification of causal effect through equalization of treatment and control group. According to proponents, equalization can be achieved through

- randomization and perhaps

- matching - regression discontinuities - selection models - iv variable models

Neither of the above works perfectly, though.

We return to these issues in term 2.

© Thomas Plümper 2017 - 2018 66

Pearl on the Potential Outcome Approach

A major shortcoming of RCM is that all assumptions and background knowledge pertaining to a given problem must first be translated into the language of counterfactuals (e.g., ignorability) before analysis can commence. In SEM, by comparison, Pearl (2000) and Heckman (2008) hold that background knowledge is expressed directly in the vocabulary of ordinary scientific discourse, invoking cause-effect relationships among realizable, not hypothetical variables.

Neumayer and Plümper on the Potential Outcome Approach

Causality may exist where manipulation of causes is impossible and it may exist without change. For example, a perfectly stable equilibrium that resists change will have causes. A black hole does not emit light and will never do, but this state is caused by its gravitational force. In other words, causality exists beyond the realm of causes that can be manipulated.

© Thomas Plümper 2017 - 2018 67

More Concerns

Causality can exist even though treatments cannot be randomized and treatment and control group cannot be perfectly matched.

Hence, identification of causal effects is different from the existence of causality.

Causality is more than a treatment effect.

Causality does not have a single effect.

Causal effects are conditioned by other factors.

Causal effects vary over time in two different meanings:

- the effect strengths changes in the short run - the effect strength

© Thomas Plümper 2017 - 2018 68

Causality is more than the Treatment Effect

Causal inference consists of five distinct elements:

1. the identification of a causal relation between two variables (cause and effect); 2. the estimation or computation of the strength of the effect; 3. the identification and understanding of the causal mechanism; 4. the generalization of the estimated effect to all cases included in the sample; 5. the generalization from the observed cases to the set of cases defined as the population.

© Thomas Plümper 2017 - 2018 69

Causality: The Case of the Effect of Higher Education on Wages

It seems plausible to argue that higher education has a positive effect on the expected utility of most people, because individuals voluntary enrol at a university.

It seems also plausible that university education does not have a positive utility for some people, because some individuals voluntary leave the university without a degree.

Examples: Steve Jobs, Bill Gates, Paul Allen, Oprah Winfrey, Michael Dell, Mark Zuckerberg, Larry Elison, Daniel Ek, Aber auch Anke Engelke, Guenther Jauch, Uli Hoeness, …

A huge element of this utility seems to be the degree, because few students stay forever without finishing and few stay for a 2nd or 3rd degree.

Experiment: One could randomly give away degree certificates for half of the enrolled student population and compare the share of students that take courses among the group with and the group without degree. One could (in the longer run) compare income levels of those finishing and those leaving without degree.

© Thomas Plümper 2017 - 2018 70

Conditional Effects of Higher Education depend on

- individual characteristics - cohort (year of degree) - jobs (sector) choice - country of employment - access to network

- the qualification of others - the individual characteristics of others

- the degree-giving university - the degree - the number of years until degree - extracurricular experience - internships

Hence, the effect of higher education on the income of Steve Jobs differs from the effect of higher education on the income of, say, yours sincerely. So what does the Average Treatment Effect tell us beyond that it is positive? If it is plus 4000 Dollar per year, does it tell us that we will earn 4000 Dollars more if we study? What if we finished a Degree in Antartic Society at the University of Incomprehensive Gibberish?

© Thomas Plümper 2017 - 2018 71

Are Social Scientists Interested in Treatment Effects?

Or in the Conditionality of Causal Mechanisms?

Or both?

© Thomas Plümper 2017 - 2018 72

From Causation to Causal Mechanism: The Case of Aspirin

Wikipedia on the History of Aspirin

1. Medicines made from willow and other salicylate-rich plants appear in clay tablets from ancient Sumer as well as the Ebers Papyrus from ancient Egypt.

2. Hippocrates referred to their use of salicylic tea to reduce fevers around 400 BC, and were part of the pharmacopoeia of Western medicine in classical antiquity and the Middle Ages.

3. Willow bark extract became recognized for its specific effects on fever, pain and inflammation in the mid-eighteenth century.

4. By the nineteenth century pharmacists were experimenting with and prescribing a variety of chemicals related to salicylic acid, the active component of willow extract.

5. In 1853, chemist Charles Frédéric Gerhardt treated acetyl chloride with sodium salicylate to produce acetylsalicylic acid for the first time.

6. In 1897, scientists at the drug and dye firm Bayer began investigating acetylsalicylic acid as a less- irritating replacement for standard common salicylate medicines, and identified a new way to synthesize it. By 1899, Bayer had dubbed this drug Aspirin and was selling it around the world.

© Thomas Plümper 2017 - 2018 73

7. In the 1960s and 1970s, John Vane (Noble Prize for Medicine) and others discovered the basic mechanism of aspirin's effects.

Aspirin causes several different effects in the body, mainly the reduction of inflammation, analgesia (relief of pain), the prevention of clotting, and the reduction of fever. Much of this is believed to be due to decreased production of prostaglandins and TXA2. Aspirin's ability to suppress the production of prostaglandins and thromboxanes is due to its irreversible inactivation of the cyclooxygenase (COX) enzyme. Cyclooxygenase is required for prostaglandin and thromboxane synthesis. Aspirin acts as an acetylating agent where an acetyl group is covalently attached to a serine residue in the active site of the COX enzyme. This makes aspirin different from other NSAIDs (such as diclofenac and ibuprofen), which are reversible inhibitors. However, other effects of aspirin, such as uncoupling oxidative phosphorylation in mitochondria, and the modulation of signaling through NF-κB, are also being investigated.

Do we want to know these mechanisms, or are we satisfied to learn that willow bark reduces pain in 83 percent of the cases in our sample of patients with mild headache?

© Thomas Plümper 2017 - 2018 74

Neumayer and Plümper 2017 on the Causality of Aspirin

The identification of a causal effect and an unbiased estimate of its strength differs from understanding the causal mechanism – the chain of events that ultimately brings about the effect. Consider the causal effect of Aspirin on headache. The headache does not disappear because a patient swallowed an Aspirin pill. It disappears because the pill has an ingredient, salicylic acid, stopping the transmission of the pain signal to the brain. Consequently, Aspirin does not eliminate the origin of the pain but prevents the brain from noticing the pain. The molecules of salicylic acid attach themselves to COX-2 enzymes, which blocks these enzymes from creating those chemical reactions that will eventually be perceived as ‘pain’. Clearly, identifying causation – the pain disappears after taking a pill – is distinct from understanding causal mechanisms.

© Thomas Plümper 2017 - 2018 75

Causal Effects: Some Possibilities

Probabilistic:

- causal effects of interest in the social science are probabilistic

Lags and Leads:

- effects can occur simultaneously with causes - effects can occur delayed - in social sciences, effects can occur before a ‘cause’ because human beings have expectations - these lags and leads do not need to be homogeneous, in fact, they probably are not

Effect Function: effects can have very different functional forms

- the effect of higher education on income should often increase over time (it may even be negative for a few years)

© Thomas Plümper 2017 - 2018 76

- the effect of a pill of Aspirin on pain perception slowly increases and then slowly declines

- the effect of a bullet in the head is often but not always instantaneous, and then it is a constant (unless one believes in the concept of rebirth)

- effect function can be very complex

Thus, periodization matters.

Conditionality

- all effects in the social science are conditioned by other factors - conditionalities can easily be heterogeneous

© Thomas Plümper 2017 - 2018 77

Two Perspectives on Causality

Neumayer and Plümper (2017) Table 1: Concepts of Causality and the Social World traditional concept data-generating process in social of causality sciences

causal effect deterministic probabilistic

strength of causal effect homogeneous and unconditional heterogeneous and conditional

dynamics of causality determined by causal mechanism influenced by agents’ au- tonomous decisions

sequence of causality cause precedes effect distorted by rational expec- tations

effect on non-treated none (homogeneous) possible, due to effect on expectations (Placebo, Nocebo) and to spill-overs

© Thomas Plümper 2017 - 2018 78

Summary

Understanding causal mechanisms is the main task of sciences.

In the social science causal mechanisms of interest are probabilistic, conditional, and heterogeneous.

© Thomas Plümper 2017 - 2018 79

Chapter 3: Statistical Inference and the Logic of Regression Analysis

© Thomas Plümper 2017 - 2018 80

The Goal of Statistical Inference

Wikipedia: “Inferential statistical analysis infers properties about a population: this includes testing hypotheses and deriving estimates. The population is assumed to be larger than the observed data set; in other words, the observed data is assumed to be sampled from a larger population.”

Sampling here means: a perfect random draw.

© Thomas Plümper 2017 - 2018 81

The Goal of Statistical Inference

In regression analysis, statistical inference also makes (implicit) assumptions about the homogeneity of cases/observations:

An empirical model is correctly specified if and only if the unexplained part of the variance of the dependent variable is stochastic.

© Thomas Plümper 2017 - 2018 82

The Error Process

Regressions analysis allows for the analysis of not deterministic processes. However, this strength requires that researchers make strong assumptions about the nature of stochastic element.

© Thomas Plümper 2017 - 2018 83

The Error Process

Randomness has properties

- a mean of zero (positive and negative random effects cancel each other out) - a variance larger zero (this only means that stochastic errors do exist) - a - asymptotically uncorrelated with structure (but correlations ≠ 0 in finite data) - constant variation (homoscedasticity)

© Thomas Plümper 2017 - 2018 84

Regression Analysis

Regression Analysis is a tool that splits the variance of the dependent variable into an explained part and an unexplained part, where the unexplained part shall have the properties of stochastic error processes.

Regression analysis guarantees (by construction) that residuals (estimated errors)

- have a mean of 0 - are normally distributed (approximately) - a variance > 0 - uncorrelated with regressors

Regression analysis cannot (usually) guarantee that residuals

- are uncorrelated with the true model - are homoscedastic

© Thomas Plümper 2017 - 2018 85

Model Specification

The assumptions about the error process are correct if and only if the model is correctly specified – that is if the empirical model matches the true data generating process.

If the model is not correctly specified, the unexplained part of the variance of the dependent variables cannot be assumed to have the properties assumed for the error process.

© Thomas Plümper 2017 - 2018 86

Quotes

Box and Draper: “All models are wrong, but some are useful”.

Martin Feldstein: “In practice all econometric specifications are necessarily false models”.

Luke Keele: “Statistical models are always simplifications, and even the most complicated model will be a pale imitation of reality”.

Peter Kennedy: “It is now generally acknowledged that econometric models are false and there is no hope, or pretense, that through them truth will be found.”

Neumayer and Plümper: “The major problem with regression analysis of observational data, broadly defined, is that in order to produce unbiased and generalizable estimates, the estimation model must be correctly specified, the estimator must be unbiased given the data at hand and the estimation sample must be randomly drawn from a well-specified population. Social scientists know this ideal is simply unachievable. Empirical models of real world phenomena are hardly ever – we would say: never – correctly specified.”

© Thomas Plümper 2017 - 2018 87

Fisher-Significance

Is it possible to generalize the regression results for the sample under observation to the universe of cases (the population)?

Can we conclusions for individuals, countries, time-points beyond those observations in your data-set?

Fisher-significance pretends to answer exactly these questions.

Fisher-significance compares the coefficient and the standard errors and derives the level of significance from the t-distribution.

Common wisdom suggests that if a coefficient is significant (p-value<0.10, 0.05, 0.01) then the variable is significant and the hypotheses confirmed. If the p-value is larger than an assumed threshold, the variable is insignificant and the hypothesis falsified.

© Thomas Plümper 2017 - 2018 88

The t-test

T-test for significance: testing the H0 (Null-Hypothesis) that beta equals zero: H0: beta=0; HA: beta≠0 The test statistic follows a student-t distribution under the Null t is the critical value of a t – distribution for a specific number of observations and a specific level of significance: convention in is a significance level of 5% (2.5% on each side of the t-distribution for a 2-sided test) – this is also called the p-value.

ˆˆrr  tn2 SE ˆ  SSR N* Var X  ˆˆ  tn2 SE ˆ  SSR N* Var X 

© Thomas Plümper 2017 - 2018 89

The (Misleading) Logic of p-value

Compute from the observations the observed value tobs of the test statistic T.

Calculate the p-value. This is the probability, under the null hypothesis, of sampling a test statistic at least as extreme as that which was observed.

Reject the null hypothesis, in favor of the alternative hypothesis, if and only if the p-value is less than the significance level (the selected probability) threshold.

© Thomas Plümper 2017 - 2018 90

Type-I and Type-II Errors

hypothesis wrong hypothesis correct

reject hypothesis correct Type I Error

accept hypothesis Type II Error correct

© Thomas Plümper 2017 - 2018 91

However, this p-test interpretation is wrong.

The significance test assumes a correct model specification. If a model is misspecified, both the coefficient and the standard error may be biased and thus the p-value is wrong, too.

The ‘significance’ threshold is arbitrary and can neither be used to verify nor to falsify a theory/hypothesis.

The p-value does not measure the probability that beta=0.

Instead, it computes the probability that random draws of errors from a standard normal distribution accounts for the covariation of x and y if the point estimate of beta equals the truth.

© Thomas Plümper 2017 - 2018 92

Jeremy Samuel Faust, Slate Magazin

“The popular understanding of the low p-value (the common cutoff being less than 0.05) is that the data attached to it must be true. That’s a false understanding. The p-value actually means that the data in question has less than a 5 percent chance of being the result of chance if the underlying experimental null hypothesis is true. If you don’t know what that means, don’t refer to p-values.”

© Thomas Plümper 2017 - 2018 93

ASA Statement on Significance

“The p-value was never intended to be a substitute for scientific reasoning,” said Ron Wasserstein, the ASA’s executive director. “Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold. The ASA statement is intended to steer research into a ‘post p<0.05 era.’”

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

© Thomas Plümper 2017 - 2018 94

Statistical Inference After Fisher-Significance

Unknown territory…

Effect strengths is the most likely contender.

And think about it: Which treatment effect do you prefer (assuming more is better as in a lottery win)?

a) Beta=100 with a standard error of 40. b) Beta=120 with a standard error of 80. c) Beta=150 with a standard error of 120.

© Thomas Plümper 2017 - 2018 95

Option a versus Option c

100,40

150,120

0 200

© Thomas Plümper 2017 - 2018 96

Alternatives: Effect Strength

Effect strength is important information.

However, social scientists also need to know how certain (or uncertain) the strengths of an effect is.

Regression analysis do not generate truth, but uncertain information.

This uncertainty needs to be analysed and taken into consideration when interpreting results (go back to Bayesian Philosophy of Science).

© Thomas Plümper 2017 - 2018 97

Bayesian Models

The Bayesian principle relies on Bayes' theorem which states that the probability of B conditional on A is the ratio of joint probability of A and B divided by probability of B. Bayesian econometricians assume that coefficients in the model have prior distributions.

Priors are subjective. Flat priors are uninformative. Substantive priors are contested.

© Thomas Plümper 2017 - 2018 98

Robustness

One way to deal with uncertainty is to conduct robustness tests (plausible variation of model specifications).

Neumayer and Plümper: “Robustness testing allows researchers to explore the stability of their estimates to alternative plausible model specifications. In other words: robustness tests analyze the variation in estimates resulting from model uncertainty. To be sure, model uncertainty is but one source potentially leading to wrong inferences. Other important inferential threats result from sampling variation and from lack of perfect fit between the assumptions an estimator makes and the true data-generating process. In our view, model uncertainty has the highest potential to invalidate inferences, which makes robustness testing the most important way in which empirical researchers can improve the validity of their inferences.”

© Thomas Plümper 2017 - 2018 99

Summary

Social scientist strive for certainty about the causal inferences they make, but this aim will remain an illusion.

Instead of making inferential decisions based on an arbitrary threshold, social scientist should learn to cope with model uncertainty and develop strategies, including research designs, which reduce uncertainties.

Important:

- internal and external validity - probabilistic theories and statistical inference - counterfactuals - model uncertainty - case versus mean effect - effect strengths

© Thomas Plümper 2017 - 2018 100

Chapter 4: Linear Models: OLS

© Thomas Plümper 2017 - 2018 101

Simple Estimation Models: OLS

What do regression analyses do?

- separate information from noise (or better: the explained part of y’s variance and the unexplained part of y’s variance)

- estimate a slope (and a standard error) for each regressor in the model, which explains deviation from 푦̅ by deviations from 푥̅.

© Thomas Plümper 2017 - 2018 102

Relationship / Correlation between Variables in a Multivariate Model

Regression models help investigating bivariate and multivariate relationships between variables, where we can hypothesize that 1 variable depends on another variable or a combination of other variables.

Relationships between variables in political science and economics are not exact – unless true by definition, but relationships include most often a non-structural or random component, due to the probabilistic nature of theories and hypotheses in the social sciences, measurement errors etc.

Regression analysis enables to find average relationships that may not be obvious by just „eye-balling“ the data – explicit formulation of structural and random components of a hypothesized relationship between variables.

© Thomas Plümper 2017 - 2018 103

Conditions for OLS

continuous dependent variable

Example: per capita income

Default functional form: linearity

© Thomas Plümper 2017 - 2018 104

What OLS does…

yxi    i   i nn ˆ 22ˆ (Yi Y i )  (  i )  min i 1 i 1 n ˆˆˆˆ yii  x  y  x i1 n yyi   ˆ i1    n xxi   i1 n y y x x  ii  Cov x, y ˆ i1   1   n   X 'X X ' y 2 V ar x xxi   i1

© Thomas Plümper 2017 - 2018 105

Error process: - mean 0 - min(variation) - normal distribution

Structure: the coefficient is calculated by dividing the covariance (y,x) by the variance (x):

n  (xii x)(y y) ˆ i1 yx n 2  (xi  x) i1

© Thomas Plümper 2017 - 2018 106

7

6

5

4 y 3

2

1

0

-3 -2 -1 0 1 2 3 4 x

© Thomas Plümper 2017 - 2018 107

Equation y = a + b*x 7 Plot D Weight No Weighting Intercept 2.97312 ± 0.01904 Slope 0.98462 ± 0.01747 6 Residual Sum of Squares 22.45217 Pearson's r 0.96312 R-Square(COD) 0.92761 Adj. R-Square 0.92731 5

4 y 3

2

1

0

-3 -2 -1 0 1 2 3 4 x

© Thomas Plümper 2017 - 2018 108

7

6

5

4 y 3

2

1

0

-3 -2 -1 0 1 2 3 4 x

© Thomas Plümper 2017 - 2018 109

Equation y = a + b*x 7 Plot E Weight No Weighting Intercept 2.95521 ± 0.03173 Slope 0.97436 ± 0.02911 Residual Sum of Squares 62.36714 6 Pearson's r 0.90485 R-Square(COD) 0.81875 Adj. R-Square 0.81802 5

4 y 3

2

1

0

-3 -2 -1 0 1 2 3 4 x

© Thomas Plümper 2017 - 2018 110

8

7

6

5

4 y 3

2

1

0

-1 -3 -2 -1 0 1 2 3 4 x

© Thomas Plümper 2017 - 2018 111

8 Equation y = a + b*x Plot F Weight No Weighting 7 Intercept 2.91041 ± 0.06345 Slope 0.94873 ± 0.05822 Residual Sum of Squares 249.46858 Pearson's r 0.71907 6 R-Square(COD) 0.51706 Adj. R-Square 0.51512

5

4 y 3

2

1

0

-1 -3 -2 -1 0 1 2 3 4 x

© Thomas Plümper 2017 - 2018 112

Parameter Estimate with more than one Regressor

yi0   1 x i 1   2 x i 2   i

n n n n 2  xxi2 2  xxyy i 1  1 i    xxxx i 1  1 i 2  2  xxyy i 2  2  i   ˆ i1 i  1 i  1 i  1 1  n n n 2 22  xi1 x 1  x i 2  x 2   x i 1  x 1 x i 2  x 2  i1 i  1 i  1

ˆ  X'' X1 X y

The estimated betas have partial effect or ceteris paribus interpretations. Example: multiple regression coefficients tell us what effect an additional year of education has on personal income if we hold social background, intelligence, sex, number of children, marital status and all other factors constant that also influence personal income.

Of course, we assume that the model is correctly specified… (which may not be true).

© Thomas Plümper 2017 - 2018 113

Standard Error and R2 in Multiple Regression

 2 1 n Var ˆ   ˆ 22  ˆ  1  2  i SSTR111  nk 1 i1

n 2 SST1 xi 1 x 1  i1 n ˆ 2  xxi11  22SSE i1 R1 for the regression of xii 1 on x 2: R 1 n SST 2  xxi11  i1 ˆˆ  SDSE11     2 SSTR111 

© Thomas Plümper 2017 - 2018 114

Model Misspecification

In econometric theory, errors are assumed to be a random draw from a normal distribution with mean=0 and some positive variance.

One would therefore expect that residuals (‘estimated’ errors), which by design have a mean=0 have no structure. Hence, structure in the residuals indicates model misspecification.

But note: EVERY structure in the residuals has a positive probability to be caused by chance.

© Thomas Plümper 2017 - 2018 115

Gauss-Markov Conditions in Econometrics

Linearity: The dependent variable is assumed to be a linear function of the variables specified in the model. (of course: for all continuous variables, there exists a transformation that satisfies this assumption)

Strict Exogeneity: For all observations, the expectation—conditional on the regressors—of the error term is zero. (this assumes that the model is correctly specified!)

Full Rank: No perfect correlation. As the correlation between x and z increases, the efficiency of the estimate declines. If the correlation=1 or -1, no coefficient can be computed.

Spherical Errors: The outer product of the error vector must be spherical. (this is the consequence if the model is correctly specified)

© Thomas Plümper 2017 - 2018 116

Gauss-Markov 1: Homoscedasticity

constant error variance deviation: heteroscedasticity

© Thomas Plümper 2017 - 2018 117

Gauss-Markov 2: Uncorrelated errors (in any dimensions)

Problem: correlated errors

- correlated with regressor - correlated over time (serial correlation) - correlated across space (spatial correlation) - correlated with unobserved variable (undetectable?)

© Thomas Plümper 2017 - 2018 118

Violations of Gauss-Markov Conditions tell us that there probably is a problem, but they do not identify the problem.

(Econometric theory demonstrates that tests identify known misspecifications).

Causes of Gauss-Markov Violations

- sample selection - omitted variable(s) - wrong regressor(s) - misspecified functional form - conditionality and unit heterogeneity - structural change - misspecified dynamics - misspecified spatial dependence

© Thomas Plümper 2017 - 2018 119

Other Stuff

Total Sum of Squares (SST): n 2 SST yi y  i1

Explained (Estimation) Sum of Squares (SSE): n ˆ 2 SSE yi y  i1

Residual Sum of Squares or Sum of Squares Residuals (SSR): nn ˆ2 2 SSRi  y i     x i  ii11

Goodness of Fit (R2):

© Thomas Plümper 2017 - 2018 120

n ()YYˆˆ 2  SSE SSR R2 i1  1  n SST SST ()YY 2 i1

© Thomas Plümper 2017 - 2018 121

Properties of R²

The R2 is an estimate.

0 ≤ R² ≤ 1, often the R² is multiplied by 100 to get the percentage of the sample variation in y that is explained by x. If the data points all lie on the same line, OLS provides a perfect fit to the data. In this case the R² equals 1 or 100%. A value of R² that is nearly equal to zero indicates a poor fit of the OLS line: very little of the variation in the y is captured by the variation in the y_hat (which all lie on the regression line) R²=(corr(y,yhat))². The R² follows a complex distribution which depends on the explanatory variable.

Adding further explanatory variables leads to an increase the R². (We thus usually use the adjusted R2, which discriminates by the number of variables. Adding a regressor with a t-value>1 increases the R2.

The R² can have a reasonable size in spurious regressions if the regressors are non-stationary. Linear transformations of the regression model change the value of the R² coefficient. The R² is not bounded between 0 and 1 in models without intercept.

© Thomas Plümper 2017 - 2018 122

Finite Sample Properties

Bias: estimated coefficient is on average identical with the true coefficient

Efficiency: estimated coefficient on average deviates relatively little from true coefficient

Root Mean Squared Error: average expected deviation from inefficiency and bias

Infinite Sample Properties

Consistency: asymptotic unbiasedness (meaning: the estimator is biased for all real-world problems)

© Thomas Plümper 2017 - 2018 123

Bias

3.5

3

2.5

2

Density

1.5

1

.5 0

.5 .6 .7 .8 .9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 b1

© Thomas Plümper 2017 - 2018 124

Efficiency

.4

.08

.3 .06

.2 .04

Density Density

.1 .02

0 0 -20 -10 0 10 20 -4 -2 0 2 4 b1 b2

© Thomas Plümper 2017 - 2018 125

Trade-Off Bias Efficiency

With real world data researchers sometimes have only the choice between a biased but efficient and an unbiased but inefficient estimator. Then another criterion can be used to choose between the two estimators, the root mean squared error (RMSE). The RMSE is a combination of bias and efficiency and gives us a measure of overall performance of an estimator. However, relative RMSEs really depend on the assumed DGPs.

© Thomas Plümper 2017 - 2018 126

Four Lessons to be Learned

1 Consistency is irrelevant for selecting an estimator. 2 Sometimes, a unbiased estimator performs worse than a biased, but more efficient estimator. 3 Unbiasedness is important but overrated. 4 Efficiency is important and underrated.

Choosing the best estimator for the data at hand is a questions for which econometric theory has surprisingly little answers. Econometric patches are rarely the solution that econometricians suggest.

© Thomas Plümper 2017 - 2018 127

Chapter 5: Minor Complications and Extensions

© Thomas Plümper 2017 - 2018 128

Minor Complications and Extensions

OLS is a linear estimator.

© Thomas Plümper 2017 - 2018 129

Linearity

OLS is a linear estimator: y is a linear function of x.

This assumption does not always make sense…

© Thomas Plümper 2017 - 2018 130

© Thomas Plümper 2017 - 2018 131

Observe that democracy has no significant linear effect on government spending, but the estimates become significant’, once we introduce the squared term.

21

20

19

18

17

16

15

governmentconsumption in GDP of % 14

13 0 2 4 6 8 10 degree of democracy

© Thomas Plümper 2017 - 2018 132

Issues

One or more explanatory variables have a non-linear effect on the dependent variable: estimating a linear model would lead to wrong or/and insignificant results.

© Thomas Plümper 2017 - 2018 133

Test-based Approaches:

Ramsay RESET F-test gives a first indication for the whole model. But careful: the Ramsey test responds to many model misspecifications.

Tests, whether parameters become significant after transformation

In Stata, one can use acprplot to verify the linearity assumption against an explanatory variable – though this is just “eye-balling”.

© Thomas Plümper 2017 - 2018 134

Transformations

Log

Log-Log Models Interpretation: Elasticities various exponential transformations (transformed variable should be strictly positive…) roots

Box-Cox transformation (approximates a Normal distribution)

and numerous other, more complex transformations

© Thomas Plümper 2017 - 2018 135

Data-Mining 1: Polynomial Models of Non-Linear Relations

There is no reason why scholars are restricted to estimating full sets of polynomial models that include all polynomial terms up to the highest degree. For example, the polynomial regression model 29 y a  b1 x  b 2 x  b 3 x (1) allows the same number of inflection and turning points as 23 y a  b1 x  b 2 x  b 3 x , (2) but more skew of the functional form. It typically remains unclear which of these two similar models approximates the true functional form better and, more importantly, which one offers the best robustness test model for the more parsimonious baseline model.

© Thomas Plümper 2017 - 2018 136

Data-Mining 2: Semi-Parametric Models

Semi-parametric models transform continuous variables into a set of dummy variables.

The estimated parameters from these dummy variables included in an estimation model allow an entirely flexible form. Semi-parametric models can retrieve about any functional form if the number of categories is sufficiently large.

Even structural breaks in the functional form are possible. This flexibility comes with costs.

Most importantly, flexibility runs the risk of over-fitting the data.

In smaller datasets, a trade-off occurs between the number of categories and the number of observations in each category. The optimal number of categories chosen depends on three factors: the number of observations, the distribution of the variable, and the degree of functional form flexibility the researcher wishes to achieve. As always, flexibility conflicts with parsimony.

© Thomas Plümper 2017 - 2018 137

Example: Per Capita Income and CO2 Emissions – Environmental Kuznets Curve

The theory of the ‘Environmental Kuznets Curve’ postulates an inverse-U shaped functional form between per capita emissions and per capita income (Grossman and Krueger 1995). It predicts that emissions increase proportionally with economic activities but that demand for environmental quality in-creases as societies become richer. Governments, thus, increasingly internalize environmental damage, which lets emissions per unit of economic activity de-cline. However, since CO2 emissions neither damage the environment immediately nor locally, the validity of the environmental Kuznets curve for CO2 emissions has been questioned. Baseline model m1 regresses CO2 per capita emissions on per capita income in thousands of 2005 constant Dollars in a global sample over the period 1960 to 2012.

© Thomas Plümper 2017 - 2018 138

Prediction: u-shaped relation between per capita income and emissions per capita Mechanism: the logic of development

- typically, the theory also refers to environmental regulation…

- the empirical literature remains inconclusive, particularly on CO2 emissions - however: it is easy to find empirical support for the prediction at first glance

© Thomas Plümper 2017 - 2018 139

Bivariate Relations

100

80

60

CO2 40

20

0 0 20 40 60 80 100 per capita income, 1000

© Thomas Plümper 2017 - 2018 140

Table 12: Higher-Degree Polynomials Tests m1 m2 m3 m4 constant 0.990** 0.339 -0.320 -0.569 (0.403) (0.368) (0.433) (0.449) x 0.343*** 0.548*** 0.842*** 1.011*** (0.0497) (0.0626) (0.119) (0.165) x2 -0.00452** -0.0188*** -0.0330** (0.00180) (0.00573) (0.0141) x3 0.000150*** 0.000504 (5.55e-05) (0.000375) x4 -2.56e-06 (2.78e-06) RMSE 4.535 4.381 4.295 4.284 R2 0.472 0.508 0.527 0.530

Note: Dependent variable is CO2 per capita emissions. OLS estimation. x is per capita income in thousand USD. Year fixed effects included. Standard errors clustered on countries in parentheses. N = 7135. * statistically significant at .1, ** at .05, *** at .01 level.

© Thomas Plümper 2017 - 2018 141

Figure 12: Higher Degree Polynomial Test 1

50

45

40

35

30

25

20

emissions per capita per emissions 2

15 CO

10

5

0 0 20 40 60 80 per capita income ($1000)

Note: grey-shaded area represents confidence interval of baseline model

© Thomas Plümper 2017 - 2018 142

Figure 13: Higher Degree Polynomial Test 2

50

45

40

35

30

25

20

emissions per capita per emissions 2

15 CO

10

5

0 0 20 40 60 80 per capita income ($1000)

Note: grey-shaded area represents confidence interval of baseline model

© Thomas Plümper 2017 - 2018 143

Semi-Parametric Estimates

40

35

30

25

20

15

10 conditionalCO2 on effect

5

0 0 10 20 30 40 50 60 70 80 90 gdppc

© Thomas Plümper 2017 - 2018 144

Context Conditionality

All causal effects in the social sciences are conditional.

That does not mean that social scientists model conditionality.

© Thomas Plümper 2017 - 2018 145

Consider the Effect of Higher Education on Income …

What are the conditions?

© Thomas Plümper 2017 - 2018 146

Conditional for the Effect of Higher Education on Income:

– individual traits, - country - time - university - degree scheme - intelligence - diligence, …

© Thomas Plümper 2017 - 2018 147

Modelling Conditionalities

Two explanatory variables do not only have a direct effect on the dependent variable but also a combined effect.

This is usually modelled as a multiplicative interaction effect… y x  x  x  x  i 1 1i 2 2i 31i 2i i

Interpretation: combined effect: b1*SD(x1)+b2*SD(x2)+b3*SD(x1*x2)

Example: monetary policy of currency union has a direct effect on monetary policy in outsider countries but this effect is increased by import shares.

© Thomas Plümper 2017 - 2018 148

Social Science versus Health Language

The notion of "interaction" is closely related to that of "moderation" that is common in social and health science research: the interaction between an explanatory variable and an environmental variable suggests that the effect of the explanatory variable has been moderated or modified by the environmental variable.

© Thomas Plümper 2017 - 2018 149

Rules of Thump

Brambor, Clark and Golder (2006) recommend to always include the constituent terms even if theory predicts that x cannot have an effect on y if z equals zero (and vice versa for leaving out x) because the theory may be wrong.

Kam and Franzese (2009) suggest excluding the constituent terms if both theory predicts the unconditional effects of the variables to be zero and their estimated coefficients are close to zero with small standard errors – but not if they are merely statistically insignificant due to large standard errors. Researchers may leave out the constituent terms if they believe that the ensuing inefficiency of including both terms disturbs inferences more than the potential bias resulting from excluding them (Kam and Franzese 2009: 102).

Neumayer and Plümper argue that these uncertainties should be dealt with by two variants of the conditionality test rather than by rules of thump. In the first variant, if the constituent terms are excluded in the baseline model, the robustness test model with the constituent terms included should find the baseline model to be robust if the assumption of no unconditional effects holds true. In the second variant, if the constituent terms are included in the baseline model, the robustness test model excludes these.

© Thomas Plümper 2017 - 2018 150

Example 1: Migrant Telephone Calls

Perkins and Neumayer (2013) analyze the effect of foreign migrants residing in a country on bilateral international telephone traffic (measured in duration of telephone calls) between the host and the resident country in a global sample of undirected country dyads.

Perkins and Neumayer (2013) estimate a gravity-type log-log model.

A naïve baseline model would estimate the migrant effect as being unconditional. The estimated elasticity equals 0.3. A ten percent increase in the combined migrant stock increases bilateral telephony by three percent.

Assessing whether the assumption of an unconditional effect appears to be robust, we include an interaction effect between the dyadic sum of migrants and the dyadic sum of per capita incomes. In other words, we test whether the migrant effect is conditioned by the combined per capita income in both countries since migrants residing in relatively richer countries might face lower opportunity costs for calling back home.

© Thomas Plümper 2017 - 2018 151

Figure 14: Conditionality Test

0.5

0.4

0.3

0.2

EstimatedElasticity Migrant of Stock 0.1

0.0 0 10 20 30 40 50 60 Log of Dyadic Sum of Per Capita Income

Note: grey-shaded area represents confidence interval of baseline model

© Thomas Plümper 2017 - 2018 152

Example 2: The Effect of Centralization on Unemployment is Conditioned by the Central Bank

© Thomas Plümper 2017 - 2018 153

Example 3: Education, Ideology and Climate Change Beliefs

By IChiloe - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=48240215

© Thomas Plümper 2017 - 2018 154

Reporting Interactions

© Thomas Plümper 2017 - 2018 155

© Thomas Plümper 2017 - 2018 156

Reporting Interactions

© Thomas Plümper 2017 - 2018 157

Are Interaction Effects Real?

Interaction effects research relies on strong assumptions:

Symmetry and linearity in effect strengths.

E=mc2 is deemed impossible by assumption.

© Thomas Plümper 2017 - 2018 158

Non-Linearity

Figure 15: Non-linear Conditionality Test

0.5

0.4

0.3

0.2

estimatedelasticity migrant of stock 0.1

0.0 6 7 8 9 10 11 12 log of dyadic sum of per capita income

Note: grey-shaded area represents confidence interval of baseline model

© Thomas Plümper 2017 - 2018 159

Semi-Parametric Approach to Interactions

- allows data-mining of functional forms - transformation of 2 (or more) variables (assume continuous variables) into combination dummies

Example: 0.2 0.4 0.6 0.2 1 2 3 0.4 4 5 6 0.6 7 8 9 0.8 10 11 12 1 13 14 15

Advantage: tests symmetry and linearity of interaction assumption, full flexibility however: boundaries arbitrary, each group should have sufficient number of observations

© Thomas Plümper 2017 - 2018 160

Non-Linearity versus Conditionality: An Unsolvable Issue?

A generated example demonstrating the close link between functional form and conditional effect. Assume the true data-generating process to be y x  x22  z  z   , where ε is a normally distributed error term. Estimating a model that does not account for the non-linear functional form of x and z and instead includes an interaction term between the two variables results in spurious evidence for a conditional relationship between x and z (see table A1, model 1). In contrast, if we allow for non-linear effects of x and z, then the coefficient of the interaction term becomes indistinguishable from zero (model 2).

© Thomas Plümper 2017 - 2018 161

Table A1: Functional Form Test: Quadratic Relationship

model 1 model 2 x 0.923*** 0.984*** (0.242) (0.0351) x2 1.030*** (0.0281) z 1.084*** 0.993*** (0.192) (0.0172) z2 1.008*** (0.00596) x·z 1.924*** -0.0138 (0.126) (0.0210) constant 3.024*** -0.0399 (0.203) (0.0447) N 1,000 1,000 R2 0.539 0.978

© Thomas Plümper 2017 - 2018 162

The opposite also holds: if the true data-generating process includes an interaction term not included in the model, researchers may find evidence for a non-linear functional form where the data-generating process is linear. In other words, misspecified or ignored conditionality can lead to wrong inferences about linearity. To demonstrate this, we now specify the true relationship of x and z on y to be: y x  x  z  z   . Model 3 reported in table A2 erroneously suggests a non-linear relationship of x on y and z on y, respectively, due to the failure of accounting for the conditional relationship of x and z. When the conditional relationship is adequately accounted for in model 4, the ‘evidence’ for a non-linear effect of x on y and z on y disappears.

© Thomas Plümper 2017 - 2018 163

Table A2: Functional Form Test: Bi-Linear Conditional Relationship.

model 1 model 2 x 0.912*** 0.984*** (0.0981) (0.0351) x2 0.846*** 0.0298 (0.0645) (0.0281) z 0.999*** 0.993*** (0.0539) (0.0172) z2 0.186*** 0.00758 (0.0167) (0.00596) x·z 0.986*** (0.0210) constant -0.596*** -0.0399 (0.0699) (0.0447) N 1,000 1,000 R2 0.772 0.920

© Thomas Plümper 2017 - 2018 164

Outliers

Outliers are observation that a model does not fit well.

Reasons:

- observation does not belong to the population - systematic or random measurement error - causal heterogeneity

Outliers can be ‘eyeballed’, but tests also exist.

© Thomas Plümper 2017 - 2018 165

Terms

Let’s begin our discussion on robust regression with some terms in . Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value. Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness. Cook’s distance (or Cook’s D): A measure that combines the information of leverage and residual of the observation.

© Thomas Plümper 2017 - 2018 166

Stata rreg

Stata’s rreg command implements a version of robust regression. It first runs the OLS regression, gets the Cook’s D for each observation, and then drops any observation with Cook’s distance greater than 1. Then iteration process begins in which weights are calculated based on absolute residuals. The iterating stops when the maximum change between the weights from one iteration to the next is below tolerance. Two types of weights are used. In Huber weighting, observations with small residuals get a weight of 1, the larger the residual, the smaller the weight. With biweighting, all cases with a non-zero residual get down- weighted at least a little. The two different kinds of weight are used because Huber weights can have difficulties with severe outliers, and biweights can have difficulties converging or may yield multiple solutions. Using the Huber weights first helps to minimize problems with the biweights. You can see the iteration history of both types of weights at the top of the robust regression output. Using the Stata defaults, robust regression is about 95% as efficient as OLS (Hamilton, 1991). In short, the most influential points are dropped, and then cases with large absolute residuals are down-weighted.

Problem: This procedure potentially ‘solves’ model misspecification by changing the sample and thus largely reducing the generalizability of results to the population.

© Thomas Plümper 2017 - 2018 167

Example

For our data analysis below, we will use the crime data set. This dataset appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). It has 51 observations. We are going to use poverty and single to predict crime.

© Thomas Plümper 2017 - 2018 168 regress crime poverty single Source | SS df MS Number of obs = 51 ------+------F( 2, 48) = 57.96 Model | 6879872.44 2 3439936.22 Prob > F = 0.0000 Residual | 2848602.3 48 59345.8813 R-squared = 0.7072 ------+------Adj R-squared = 0.6950 Total | 9728474.75 50 194569.495 Root MSE = 243.61

------crime | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------+------poverty | 6.787359 8.988529 0.76 0.454 -11.28529 24.86001 single | 166.3727 19.42291 8.57 0.000 127.3203 205.425 _cons | -1368.189 187.2052 -7.31 0.000 -1744.59 -991.7874 ------lvr2plot, mlabel(state)

© Thomas Plümper 2017 - 2018 169

Conclusion

It is easily possible to estimate complex functional relations in the world of linear models. This merely requires a transformation of the variables.

Yet, flexible functional forms are not likely to identify the true functional relation. There always is an argument for simplicity…

It is also possible to estimate conditional relations. However, most social scientists rely on the default assumptions of linearity and symmetry.

It is not always possible (without theoretical knowledge) to distinguish complex functional forms from conditionalities.

© Thomas Plümper 2017 - 2018 170

Chapter 6: More Complications: Selection, Truncation, Censoring

Truncated variables: only observations are used that are larger or smaller than a certain value. Example: analysis of the determinants of poverty – only poor people are sampled.

Censored variables: values above or below a certain threshold cannot be observed. Example: income categories.

Selection: Cases are selected based on a criterion that is correlated with the dependent variable: e.g. eastern European countries decide whether to apply for EU membership based on the actual probability of accession.

© Thomas Plümper 2017 - 2018 171

Truncation versus Censoring

Consider the following Distribution: 1 1.25 2 4 5

Censoring: some observations will be censored, meaning that we only know that they are below (or above) some bound. This can for instance occur if we measure the concentration of a chemical in a water sample. If the concentration is too low, the laboratory equipment cannot detect the presence of the chemical. It may still be present though, so we only know that the concentration is below the laboratory's detection limit.

If the detection limit is 1.5, so that observations that fall below this limit is censored, our example data set would become: <1.5 <1.5 2 4 5

Truncation: the process generating the data is such that it only is possible to observe outcomes above (or below) the truncation limit. This can for instance occur if measurements are taken using a detector which only is activated if the signals it detects are above a certain limit. There may be lots of weak incoming signals, but we can never tell using this detector.

If the truncation limit is 1.5, our example data set would become 2 4 5

© Thomas Plümper 2017 - 2018 172

Truncation

In statistics, truncation results in values that are limited above or below, resulting in a truncated sample.

© Thomas Plümper 2017 - 2018 173 y

true regression line

truncation threshold

regression line for truncated sample

x

© Thomas Plümper 2017 - 2018 174

Estimation

Latent variable model: y   x  ,  | x Normal 0,  2  

For truncation from above, we need y given x and c:

2 f y|, xi  g y| x , c , y c  i i 2 i F cii|, x  f(.) – normal denisty (PDF) with mean  x i and variance 2 

Density of y given x, divided by the probability that y = c given x – “renormalisation”

© Thomas Plümper 2017 - 2018 175

Examples

Example 1. A study of students in a special GATE (gifted and talented education) program wishes to model achievement as a function of language skills and the type of program in which the student is currently enrolled. A major concern is that students are required to have a minimum achievement score of 40 to enter the special program. Thus, the sample is truncated at an achievement score of 40.

Example 2. A researcher has data for a sample of Americans whose income is above the poverty line. Hence, the lower part of the distribution of income is truncated. If the researcher had a sample of Americans whose income was at or below the poverty line, then the upper part of the income distribution would be truncated. In other words, truncation is a result of sampling only part of the distribution of the outcome variable.

© Thomas Plümper 2017 - 2018 176

Example 1:

Description of the data

Let’s pursue Example 1 from above. We have a hypothetical data file, truncreg.dta, with 178 observations. The outcome variable is called achiv, and the language test score variable is called langscore. The variable prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled.

Truncated regression Below we use the truncreg command to estimate a truncated regression model. The i. before prog indicates that it is a factor variable (i.e., ), and that it should be included in the model as a series of indicator variables. The ll() option in the truncreg command indicates the value at which the left truncation take place. There is also a ul() option to indicate the value of the right truncation, which was not needed in this example.

truncreg achiv langscore i.prog, ll(40) (note: 0 obs. truncated)

Fitting full model:

Iteration 0: log likelihood = -598.11669 Iteration 1: log likelihood = -591.68374 Iteration 2: log likelihood = -591.31208 Iteration 3: log likelihood = -591.30981

© Thomas Plümper 2017 - 2018 177

Iteration 4: log likelihood = -591.30981

Truncated regression Limit: lower = 40 Number of obs = 178 upper = +inf Wald chi2(3) = 54.76 Log likelihood = -591.30981 Prob > chi2 = 0.0000

------achiv | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------+------langscore | .7125775 .1144719 6.22 0.000 .4882168 .9369383 | prog | 2 | 4.065219 2.054938 1.98 0.048 .0376131 8.092824 3 | -1.135863 2.669961 -0.43 0.671 -6.368891 4.097165 | _cons | 11.30152 6.772731 1.67 0.095 -1.97279 24.57583 ------+------/sigma | 8.755315 .666803 13.13 0.000 7.448405 10.06222 ------

The output begins with a note indicating that zero observations were truncated. This is because our sample contained no data with values less than 40 for achievement. The note is followed by the iteration log, which gives the values of the log likelihoods starting with a model that has no predictors. The last value in the log is the final value of the log likelihood and is repeated below.

The header information is provided next. On the left-hand side are the lower and upper limits of the truncation and a repeat of the final log likelihood. On the right-hand the number of observations used (178) is given, along with the Wald chi-square with three degrees of freedom. The Wald chi-square is what

© Thomas Plümper 2017 - 2018 178 you would get if you used the test command, after estimating the model, to test that all the coefficients are zero. Finally, there is a p-value for the chi-square test. As a whole, this model is statistically significant. In the table of coefficients, we have the truncated regression coefficients, the standard error of the coefficients, the Wald z-tests (coefficient/se), and the p-value associated with each z-test. By default, we also get a 95% confidence interval for the coefficients. With the level() option you can request a different confidence interval.

The ancillary statistic /sigma is equivalent to the standard error of estimate in OLS regression. The value of 8.76 can be compared to the of achievement which was 8.96. This shows a modest reduction. The output also contains an estimate of the standard error of /sigma, as well as a 95% confidence interval for this value.

© Thomas Plümper 2017 - 2018 179

Censored Data

In statistics, engineering, economics, and medical research, censoring is a condition in which the value of a measurement or observation is only partially known.

For example, suppose a study is conducted to measure the impact of a drug on mortality rate. In such a study, it may be known that an individual's age at death is at least 75 years (but may be more). Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75.

Left censoring – a data point is below a certain value but it is unknown by how much. Interval censoring – a data point is somewhere on an interval between two values. Right censoring – a data point is above a certain value but it is unknown by how much.

© Thomas Plümper 2017 - 2018 180

Tobit Estimator

Why not OLS?

1. prediction of negative values 2. RHS variables do not exert a constant marginal effect 3. expected value of y conditional on X (y|x) cannot be linear in X if a non-trivial number of observations of y are zero but X varies considerably. Since number of zeros is inflated, y cannot be conditional normal distributed.

Tobit model: based on latent variable model, y* is iid.

y*   x   ,  | x Normal  0,  2 

The observed variable is equal to the latent variable when larger than zero but zero when the latent variable is negative. The Tobit model combines a linear model (OLS) and a for the zeros.

© Thomas Plümper 2017 - 2018 181

Example

Dahlberg, M. and Johansson, E., 2002. On the vote-purchasing behavior of incumbent governments. American political Science review, 96(01), pp.27-40.

“A couple of months before the Swedish election in 1998, the incumbent government distributed 2.3 billion SEK to 42 out of 115 applying municipalities. This was the first wave of a four-year long grant program intended to support local investment programs aimed at an ecological sustainable development.”

© Thomas Plümper 2017 - 2018 182

Descriptive Stats

© Thomas Plümper 2017 - 2018 183

Results

© Thomas Plümper 2017 - 2018 184

Results

© Thomas Plümper 2017 - 2018 185

More Examples

Example 1. In the 1980s there was a federal law restricting speedometer readings to no more than 85 mph. So if you wanted to try and predict a vehicle’s top-speed from a combination of horse-power and engine size, you would get a reading no higher than 85, regardless of how fast the vehicle was really traveling. This is a classic case of right-censoring (censoring from above) of the data. The only thing we are certain of is that those vehicles were traveling at least 85 mph.

Example 2. A research project is studying the level of lead in home drinking water as a function of the age of a house and family income. The water testing kit cannot detect lead concentrations below 5 parts per billion (ppb). The EPA considers levels above 15 ppb to be dangerous. These data are an example of left-censoring (censoring from below).

Example 3. Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). The problem here is that students who answer all questions on the academic aptitude test correctly receive a score of 800, even though it is likely that these students are not “truly” equal in aptitude. The same is true of students who answer all of the questions incorrectly. All such students would have a score of 200, although they may not all be of equal aptitude.

© Thomas Plümper 2017 - 2018 186

Description of the data

Let’s pursue Example 3 from above. We have a hypothetical data file, tobit.dta with 200 observations. The academic aptitude variable is apt, the reading and math test scores are read and math respectively. The variable prog is the type of program the student is in, it is a categorical (nominal) variable that takes on three values, academic (prog = 1), general (prog = 2), and vocational (prog = 3). Let’s look at the data. Note that in this dataset, the lowest value of apt is 352. No students received a score of 200 (i.e. the lowest score possible), meaning that even though censoring from below was possible, it does not occur in the dataset.

© Thomas Plümper 2017 - 2018 187

Tobit regression Below we run the tobit model, using read, math, and prog to predict apt. The ul( ) option in the tobit command indicates the value at which the right-censoring begins (i.e., the upper limit). There is also a ll( ) option to indicate the value of the left-censoring (the lower limit) which was not needed in this example. The i. before prog indicates that prog is a factor variable (i.e., categorical variable), and that it should be included in the model as a series of dummy variables. Note that this syntax was introduced in Stata 11. tobit apt read math i.prog, ul(800)

Tobit regression Number of obs = 200 LR chi2(4) = 188.97 Prob > chi2 = 0.0000 Log likelihood = -1041.0629 Pseudo R2 = 0.0832

------apt | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------+------read | 2.697939 .618798 4.36 0.000 1.477582 3.918296 math | 5.914485 .7098063 8.33 0.000 4.514647 7.314323 | prog | 2 | -12.71476 12.40629 -1.02 0.307 -37.18173 11.7522 3 | -46.1439 13.72401 -3.36 0.001 -73.2096 -19.07821 | _cons | 209.566 32.77154 6.39 0.000 144.9359 274.1961 ------+------/sigma | 65.67672 3.481272 58.81116 72.54228 ------Obs. summary: 0 left-censored observations 183 uncensored observations 17 right-censored observations at apt>=800

© Thomas Plümper 2017 - 2018 188

• The final log likelihood (-1041.0629) is shown at the top of the output, it can be used in comparisons of nested models, but we won’t show an example of that here. • Also at the top of the output we see that all 200 observations in our data set were used in the analysis (fewer observations would have been used if any of our variables had missing values). • The likelihood ratio chi-square of 188.97 (df=4) with a p-value of 0.0001 tells us that our model as a whole fits significantly better than an empty model (i.e., a model with no predictors). • In the table we see the coefficients, their standard errors, the t-statistic, associated p-values, and the 95% confidence interval of the coefficients. The coefficients for read and math are statistically significant, as is the coefficient for prog=3. Tobit regression coefficients are interpreted in the similiar manner to OLS regression coefficients; however, the linear effect is on the uncensored latent variable, not the observed outcome. See McDonald and Moffitt (1980) for more details. o For a one unit increase in read, there is a 2.7 point increase in the predicted value of apt. o A one unit increase in math is associated with a 5.91 unit increase in the predicted value of apt. o The terms for prog have a slightly different interpretation. The predicted value of apt is 46.14 points lower for students in a vocational program (prog=3) than for students in an academic program (prog=1). • The ancillary statistic /sigma is analogous to the square root of the residual variance in OLS regression. The value of 65.67 can be compared to the standard deviation of academic aptitude which was 99.21, a substantial reduction. The output also contains an estimate of the standard error of /sigma as well as the 95% confidence interval.

© Thomas Plümper 2017 - 2018 189

Selection

Selection bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. The phrase "selection bias" most often refers to the distortion of a statistical analysis, resulting from the method of collecting samples.

© Thomas Plümper 2017 - 2018 190

Estimation: Heckman Selection Model

Truncation is a special case of selection – exclusion of cases is non-random: Estimation of linear models is biased if exclusion criterion is correlated with DV (endogenous selection)

In case of endogenous selection: estimation adds a selection equation – 1 if y is observable and 0 otherwise (again latent variable model):

y x  , E  | x  0

1st stage: selection probit

s  z u 10if z u s   0 else

In case of truncation, selection equation equals c (threshold).

2nd stage – latent variable model: y* - latent continuous variable:

© Thomas Plümper 2017 - 2018 191

y if s  0 yx* '; y   i i i 0 else

Heckman selection model: 1st stage – probit, 2nd stage – linear normal or probit

Estimation: 2 stage or full MLE (better for correct SEs)

Estimation of the second stage is conditioned on the outcome of the first stage: As the conditional mean function we get:

* * E yi s i0  E  x i '   i z i '    i  0  E yi s i 0  x i '  E   i  i   z i '   with the elements of the error term being

    i2 i i , with i~NVE 0,  i ;  i  i   0 

© Thomas Plümper 2017 - 2018 192

Example: EU Enlargement

Plümper, T., Schneider, C.J. and Troeger, V.E., 2006. The politics of EU eastern enlargement: Evidence from a Heckman selection model. British Journal of Political Science, 36(01), pp.17-38.

The eastern enlargement of the European Union is a twofold process, in which governments of transition countries decide whether or not to apply for membership and in turn EU members decide whether or not to accept these applicants. The authors argue that the level of democracy and the extent of market reforms together determine the first decision, while the second decision is largely determined by the EU observing the reform process in applicant countries imposed by the acquis communautaire conditionality.

© Thomas Plümper 2017 - 2018 193

© Thomas Plümper 2017 - 2018 194

Results

© Thomas Plümper 2017 - 2018 195

Chapter 7: Maximum Likelihood Estimation of Categorical Variables

© Thomas Plümper 2017 - 2018 196

Scales

scale basic empirical observation permissible statis- allowed op- – determination of: tics erations nominal equality number of cases = ≠ contingency corre- lation ordinal greater or less all the above plus: = ≠ > < percentile interval equality of intervals or all the above plus: = ≠ differences mean > < standard deviation + - rank correlation product-moment correlations ratio equality of ratios all the above plus: = ≠ coefficient of varia- > < tion + - · : © Thomas Plümper 2017 - 2018 197

Limited Dependent Variables: Categorical Data

A dependent variable is called limited dependent if it is not a continuous variable.

In this case, estimation by OLS is suboptimal and likely to be biased.

Most likely models with limited dependent variables are estimated with maximum likelihood estimation (MLE).

Examples for limited DV: dichotomous (binary choice) variables, discrete choice (ordered or multinomial) variables, count variables.

© Thomas Plümper 2017 - 2018 198

Discrete Variables

In contrast, a discrete variable over a particular range of real values is one for which, for any value in the range that the variable is permitted to take on, there is a positive minimum distance to the nearest other permissible value. The number of permitted values is either finite or countably infinite. Common examples are variables that must be integers, non-negative integers, positive integers, or only the integers 0 and 1.

Thus, a discrete variable cannot take on all values within the limits of the variable.

Examples: survey responses gender disaster death

© Thomas Plümper 2017 - 2018 199

Definitions

Discrete choice variables:

Binary variables: nominal, either 1 or 0: yes/no, war, turnout;

Multinomial variables: more than 2 categories that cannot be ordered: party identification (more than 2 parties)

Ordered choice variables: more than 2 categories that can be ordered: scales (e.g. 1-strongly agree to 5 strongly disagree)

Count variables: discrete – specific distribution, positive values, number of wars/ terrorist attacks in a year

Truncated variables: only observations are used that are larger or smaller than a certain value: analysis of the determinants of poverty – only poor people are analysed

Censored variables: values above or below a certain threshold cannot be observed: income categories

© Thomas Plümper 2017 - 2018 200

Discrete Choice Models: Why not OLS?

The distribution of the error (disturbance term) would consist of just two specific values, it would be neither continuous nor normally distributed.

Therefore, standard errors of the betas would be wrong and significance tests invalid (nothing new here....)

yi0   1 x i 1   i   i  y i   0   1 x i 1

11   0   1xi 1

20   0   1xi 1

In addition, the two values of the error term would change with X, so that the distribution of the error term is heteroskedastic as well.

A linear probability model would predict values larger than 1 and smaller than 0.

© Thomas Plümper 2017 - 2018 201

Maximum Likelihood

The problem with the error term is solved by employing a estimation technique called maximum likelihood instead of least squares – maximum likelihood fits a instead of minimizing the sum of the squared residuals.

The maximum likelihood estimator of beta is the value of beta that maximizes the likelihood function – therefore we write L (the likelihood function) as a function of beta.

Maximum likelihood principle: out of all the possible values for beta, the value that makes the likelihood of the observed data largest should be chosen – given the distribution of the random sample.

Usually it is more convenient to work with the log-likelihood function which is obtained by taking the natural log of the likelihood function – here one can use the fact that the log of the product equals the sum of the logs: log(u*v)=log u + log v and the second law: log u^r = r*log u.

© Thomas Plümper 2017 - 2018 202

Maximum Likelihood Discrete Choice Models: Logit and Probit

The problem of predicted values larger than 1 and smaller than 0 is solved by applying the cumulative density functions (CDF) of a normal or logistic distribution to the dependent variable.

Difference between logit and probit models:

Probit model: the error term is assumed to be normally distributed Logit model: the error term is assumed to follow a logistic distribution

© Thomas Plümper 2017 - 2018 203

Terminology

PDF vs. CDF: Probability Density Function vs. Cumulative Distribution Function of a variable:

PDF: For a continuous function, the probability density function (pdf) is the probability that the variate has the value x. CDF: The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x. The CDF is the antiderivative or integral of the PDF and the PDF is the derivative of the CDF. Example: normal distribution: PDF:

2 2 x  /2  e f()' x   F x 2

CDF: x F()() x    f x dx 

© Thomas Plümper 2017 - 2018 204

Vizualization

© Thomas Plümper 2017 - 2018 205

Logistic MLE

xs / e PDF: f x , , s 2 ,  mean location , s  0 scale xs / se1  1 CDF: F x  1 e xs / for 0 and s 1, standard logistic distribution 11 PDF: f x  1   F x 1  F x 11eexx   1 CDF: F x  1 e x

Logistic distribution is rather inflexible but more popular because easier to compute. The multivariate normal has nicer characteristics but difficult to compute.

© Thomas Plümper 2017 - 2018 206

More Visualizations: PDFs

© Thomas Plümper 2017 - 2018 207

More Visualization: CDFs

© Thomas Plümper 2017 - 2018 208

Binary Choice

The problem: binary dependent variable

© Thomas Plümper 2017 - 2018 209

1.0

0.8

0.6 A

0.4

0.2

0.0

-3 -2 -1 0 1 2 3 B

© Thomas Plümper 2017 - 2018 210

What happens if we estimate a linear relation?

© Thomas Plümper 2017 - 2018 211

1.0

Equation y = a + b*x Plot A 0.8 Weight No Weighting Intercept 0.46215 ± 0.03762 Slope 0.27992 ± 0.03687 Residual Sum of Squares 19.70281 Pearson's r 0.57282 R-Square(COD) 0.32812

0.6 Adj. R-Square 0.32243 A

0.4

0.2

0.0

-3 -2 -1 0 1 2 3 B

What is wrong (suboptimal)?

© Thomas Plümper 2017 - 2018 212

The Solution: Non-linear curve fit

© Thomas Plümper 2017 - 2018 213

Probit Likelihood: yi  x i'   u i

ln xii '  if y 1 ln l  Log-Likelihood: i  ln xii '  if y  0

1 Logit Likelihood: yu ii1 exp x '  i

 ln 1  exp xii '  if y  1 Log-Likelihood: ln li   x'  ln 1  exp  x ' if y  0  i  i i

© Thomas Plümper 2017 - 2018 214

The Logic of Binary Choice Models

First we assume that the dependent variable has not only two outcomes (0 and 1) but continuous normally distributed values between -∞ and ∞ – this is called the latent variable or the latent continuous model:

yx ' i i i

Where  is a latent, continuous dependent variable and is and independently, identically, normally yi i distributed error term. For the latent dependent variable there exist an observation rule:

 1if yi  0, yi   0 otherwise

© Thomas Plümper 2017 - 2018 215

The probability to observe a 1 for observation (individual) i given the probit specification (normal cdf) is:

 Pry i 1  Pry i  0  Prx' i    i  0  Pr    x '  ;   ii 

ii    Pr   xii '    1  Pr    x '      

1    xii '    x '   is thecdf of the N 0,1 random var iable   i  a a    d   Pr   a 

© Thomas Plümper 2017 - 2018 216

The likelihood function is the product of all possible outcomes and their assigned probabilities: n 1 yii y L  Pr yii  0 Pr y  1 i1 n 1 yii y  1   xii '     x '   i1 with thelog likelihood given by : n lnL  1yln1 i    x' i   yln i  x' i  i1 then thelog likelihood function is max imized with respect to : ln L ˆˆn  x '     i   yii  x ' ˆ   0   x ' ˆ 1   x 'ˆ i1  ii   and solved for

© Thomas Plümper 2017 - 2018 217

Interpretation

The effect of one variable depends on the cumulative effect of other variables.

This is a general property of non-linear models.

Direction of effects can be obtained from the coefficient’s sign (but note that this does not necessarily hold for interaction effects).

It is possible to compute the marginal effect and the aggregate effect of a variable (using the delta method).

© Thomas Plümper 2017 - 2018 218

Marginal Effects

 ; Pr yxii  1    '   x_ik is continuous: derivative or elasticity for comparative statics:

Pr y 1 i  xx''      ik  ik k xik elasticity:

nnPr y 1 i   x '  k ik  ii11xik

Average change of the probability that y_i=1 if x_ik changes by one unit.

© Thomas Plümper 2017 - 2018 219

We want to estimate the effect of one specific x on the probability of success of y. But this is complicated by the non-linear nature of the used CDF: Partial effect:

Pr yx 1|    x '   x j j phi is the normal PDF (probability density function). In the logit and probit case the normal and logistic cdf are strictly increasing, therefore the PDF (and therefore phi) is larger than zero for all x‘alpha. Thus, the partial effect of x_j on PR[y=1] depends on all x through the positive quantity phi(x‘alpha), which means that the partial effect always has the same sign as the estimated coefficient.

The partial effect changes for different values of x_j and all other x.

Yet, the relative effect of any two continuous explanatory variables do not depend on x: the ratio of the partial effects for x_j and x_h is alpha_j/alpha_h.

© Thomas Plümper 2017 - 2018 220

The Logic of Logit and Probit Models

exp 1xi 1   2 x i 2  ...   k x ik  Pr yXi  1|  1 exp  1xi 1   2 x i 2  ...   k x ik 

Thus, logit and probit models estimate the probability that the dependent variable is 1.

Beta gives the so called odds ratio – which are the results of a model. Example: assume y – doing a masters degree after college (1) or not (0) x – sex, female (0) or male (1) observation: on average 50 of 100 male students do a masters, 20 of 100 female students do a masters: thus, men are 2.5 times more likely to do a masters. odds ratio: 0.5/ 0.5 0.5*0.8 0.4    4 0.2 / 0.8 0.2*0.5 0.1

Men have 4 times the odds to do a masters degree. Odds-ratios seem to overstate the relative positions, therefore normally log-odds are used. Log-odds are the natural logarithm of odds ratios and are computed in logit regression models, log-odds are more symmetric and correct overstatement. © Thomas Plümper 2017 - 2018 221

Goodness of Fit

Pseudo R² (McFadden):

R² = 1- |Log Likelihood(ur)| / |Log Likelihood(0)|

The Pseudo R² is calculated by the ratio of the likelihood function for the full model divided by the empty model that includes only an intercept.

But also compute the correctly predicted 1’s…

…and that exactly is the problem.

© Thomas Plümper 2017 - 2018 222

The Problem with Binary Choice Models: Logit and the Problem of War

© Thomas Plümper 2017 - 2018 223

Interaction Effects in Binary Choice Models

Complicated because the effects depend on the value/effect of other regressors.

The intuition from linear models does not extent to non-linear models Example: probit model with two continuous explanatory variables that have an interaction effect:

E y| x1 , x 2 , X   1 x 1   2 x 2   12 x 1 x 2  X     

The interaction effect is the cross derivative of the expected value of y:

E y| x , x , X  x   x   x x  X      1 2  1 1 2 2 12 1 2    However, most applied researchers instead compute the marginal effect of the interaction term which is not equal to the true interaction effect

   '   xx 12  12

© Thomas Plümper 2017 - 2018 224

Implications

There are 4 implications: 1. The interaction effect can be non-zero even though the coefficient of the IA effect is zero; 2. the significance of the IA effect cannot be tested with a simple t-test; 3. as in all non-linear models the IA effect is conditional on all other independent variables; 4. the IA effect may have different signs for different values of covariates.

© Thomas Plümper 2017 - 2018 225

Effect Strengths of Interactions in Non-Linear Models

General Non-Linear Models: F is a known function that is twice differentiable: E y|, x  F x  

IA effect is found by computing cross derivatives – the IA effect of x_1 and x_2 on y is:

2 Fx ,     12 xx 12 Which is estimated by: 2 Fx, ˆ   ˆ12  xx12 Beta_hat is a consistent estimator of beta The standard error of the estimated IA effect mu_12 is found by applying the Delta method which gives the asymptotic variance:

22ˆˆ    F x,, F x  ˆ 2  ˆ   12  '  x1  x 2     x 1  x 2     

© Thomas Plümper 2017 - 2018 226

Extensions: Conditional Logit

The utility for each alternative depends on attributes of that alternative, interacted perhaps with attributes of the person.

Extensions: Mixed Logit

Mixed logit is a fully general statistical model for examining discrete choices. The motivation for the mixed logit model arises from the limitations of the standard logit model. The standard logit model has three primary limitations, which mixed logit solves: "It obviates the three limitations of standard logit by allowing for random taste variation, unrestricted substitution patterns, and correlation in unobserved factors over time." Mixed logit can also utilize any distribution for the random coefficients, unlike probit which is limited to the normal distribution. It has been shown that a mixed logit model can approximate to any degree of accuracy any true random utility model of discrete choice, given an appropriate specification of variables and distribution of coefficients."

© Thomas Plümper 2017 - 2018 227

More than 2 Ordered Categories: Ordered-Logit/Ordered-Probit

When a DV has more than 2 categories and the values assigned to each category have a meaningful sequential order, e.g. higher values indicate indeed “more”, or probit models will provide useful estimation procedures.

Ordered logit and probit models are generalized probit and logit models which allow for more than one outcome.

The proportional-odds ordered logit model is so called because, if we consider the odds odds(k) = P(Y _ k)=P(Y > k), then odds(k1) and odds(k2) have the same ratio for all independent variable combinations. Proportional odds assumption: ordered logit is equal to k sets of binary regressions with the critical assumption that the slope parameters are identical across all regressions.

© Thomas Plümper 2017 - 2018 228

Visualization

2.0

1.5

D 1.0

0.5

0.0

-3 -2 -1 0 1 2 3 E

© Thomas Plümper 2017 - 2018 229

Cutpoints

In ordered logit, an underlying score is estimated as a linear function of the independent variables and a set of cutpoints. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cutpoints estimated for the outcome:

U_j is assumed to be logistically distributed in ordered logit. In either case, we estimate the coefficients beta_1,…, beta_k together with the cutpoints k_1,…,k_k-1, where k is the number of possible outcomes. K_0 is taken as -∞ and k_k is taken as +∞. All of this is a direct generalization of the ordinary two-outcome logit model.

© Thomas Plümper 2017 - 2018 230

Example

RISK FACTORS ASSOCIATED WITH BUS ACCIDENT SEVERITY IN THE UNITED STATES: A GENERALIZED ORDERED LOGIT MODEL

Sigal Kaplan Department of Transport, Technical University of Denmark, Kgs. Lyngby, Denmark [email protected]

Carlo Giacomo Prato Department of Transport, Technical University of Denmark, Kgs. Lyngby, Denmark [email protected]

Recent years have witnessed a growing interest in improving bus safety operations worldwide. While in the United States buses are considered relatively safe, the number of bus accidents is far from being negligible, triggering the introduction of the Motor-coach Enhanced Safety Act of 2011.The current study investigates the underlying risk factors of bus accident severity in the United States. A generalized ordered logit model is estimated in order to account for the ordered nature of severity, while allowing the violation of the proportional odds assumption across severity categories. Data for the analysis are retrieved from the General Estimates System (GES) database for the years 2005-2009. Results show that accident severity increases: (i) for young bus drivers under the age of 25; (ii) for drivers beyond the age of 55, and most prominently for drivers over 65 years old; (iii) for female drivers; (iv) for very high (over 65 mph) and very low (under 20 mph) speed limits; (v) at intersections; (vi) because of inattentive and risky driving.

© Thomas Plümper 2017 - 2018 231

Extract from Results

Table 2 Estimation results of generalized ordered logit model for bus accident severity

Threshold between: Variable Category 0 and 1 1 and 2 2 and 3 3 and 4 Bus driver’s Male -0.281*** -0.164*** -0.422*** -1.471*** gender Female a - - - - Bus driver’s < 25 0.026 0.149*** 0.431*** 0.525** age 25-34 0.530*** 0.130*** 0.013 1.152*** 35-44 a - - - - 45-54 a - - - - 55-64 0.198*** 0.081*** 0.352*** 0.029 > 64 0.126*** 0.611*** 0.816*** 0.504*** Bus driver’s No charged offense a - - - - behaviour Drowsy 0.883*** 0.505*** 0.669*** 3.491*** Distracted 0.237*** 0.181*** 0.096*** 1.018*** Speeding 0.769*** 0.634*** 0.227*** 3.450*** Bus service Other bus a - - - - type School bus -0.415*** -0.356*** -0.167*** 0.261** Bus vehicle Regular bus a - - - - type Van bus 0.165*** 0.075*** -0.012 -0.615***

© Thomas Plümper 2017 - 2018 232

Generalized Models

© Thomas Plümper 2017 - 2018 233

Multiple Unsystematic Categories: Multinomial Logit

Examples:

- the choice of a party in an election - choosing a car brand of between American, European, Asian cars - choice of an anchor currency

Else?

© Thomas Plümper 2017 - 2018 234

In the multinomial logit model, we estimate a set of coefficients, beta(1), beta(2), and beta(3), corresponding to each outcome:

For identification one of the 3 parameters needs to be set to zero, that is to make a choice between 2 options, we need to assume that a third option is irrelevant for the comparison of the two. Hence, the choice between 3 options is the cumulative probability distribution of the choice between three independent (!) pairs of two options.

© Thomas Plümper 2017 - 2018 235

Independence from Irrelevant Alternatives

McFadden and Train

Assumption can be somewhat relaxed using the nested-logit model.

© Thomas Plümper 2017 - 2018 236

Example: Choice of Anchor Currency

This paper adopts and develops the ‘fear of floating’ theory to explain the decision to implement a de facto peg, the choice of anchor currency among multiple key currencies and the role of central bank independence for these choices. We argue that since exchange rate depreciations are passed through into higher prices of imported goods, avoiding the import of inflation provides an important motive to de facto peg the exchange rate in import-dependent countries. This study shows that the choice of anchor currency is determined by the degree of dependence of the potentially pegging country on imports from the key currency country and on imports from the key currency area, consisting of all countries which have already pegged to this key currency. The fear of floating approach also predicts that countries with more independent central banks are more likely to de facto peg their exchange rate since independent central banks are more averse to inflation than governments and can de facto peg a country’s exchange rate independently of the government.

© Thomas Plümper 2017 - 2018 237

Empirical Background

© Thomas Plümper 2017 - 2018 238

© Thomas Plümper 2017 - 2018 239

© Thomas Plümper 2017 - 2018 240

Example: Alvarez and Nagler on Vote Choice Models

“We think that if we have some theoretical reason to believe voters do not obey the IIA axiom, then it is important to correctly specify how voters perceive their choices in the systemic component of our models.“

© Thomas Plümper 2017 - 2018 241

Conclusion

Categorical data analysis has become a flexible tool over the last years. As in all maximum likelihood models, the reliability of these models depends on the likelihood function. For probit and logit models, these make strong assumptions:

- monotony - marginal effect strongest in the middle of the distribution

© Thomas Plümper 2017 - 2018 242

Chapter 8: ML Estimation of Count Data

© Thomas Plümper 2017 - 2018 243

Count Data

In statistics, count data is a statistical , a type of data in which the observations can take only the non-negative integer values {0, 1, 2, 3, ...}, and where these integers arise from counting rather than ranking. The statistical treatment of count data is distinct from that of , in which the observations can take only two values, usually represented by 0 and 1, and from , which may also consist of integers but where the individual values fall on an arbitrary scale and only the relative ranking is important.

© Thomas Plümper 2017 - 2018 244

Famous (and infamous) Examples:

Examples of events that may be modelled as a include:

The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).

The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (1876–1937).[30]

The number of phone calls arriving at a call centre within a minute. This example was made famous by A.K. Erlang (1878 – 1929).

Internet traffic.

The number of goals in sports involving two competing teams.[31]

The number of deaths per year in a given age group.

The number of jumps in a stock price in a given time interval.

The number of mutations in a given stretch of DNA after a certain amount of radiation.

© Thomas Plümper 2017 - 2018 245

The proportion of cells that will be infected at a given multiplicity of infection.

The arrival of photons on a pixel circuit at a given illumination and over a given time period.

The targeting of V-1 flying bombs on London during World War II investigated by R. D. Clarke in 1946.

© Thomas Plümper 2017 - 2018 246

OLS-Analysis of Count Data

Until very recently, methodologist argued that instead of using ‘complex’ ML estimation procedures such as Poisson or negative binomial models, researchers may take the log of the (dependent) count variable and analyze the resulting variable using OLS.

However, the log of a count variable is still discrete:

120 250

100 200

80

150

60 Count

Frequency 100 40

50 20

0 0 0 5 10 15 20 25 30 -5 -4 -3 -2 -1 0 1 Counts C

original counts log of count

© Thomas Plümper 2017 - 2018 247

Poisson Distribution

The distribution was first introduced by Siméon Denis Poisson (1781–1840) and published, together with his probability theory, in 1837 in his work Recherches sur la probabilité des jugements en matière criminelle et en matière civile ("Research on the Probability of Judgments in Criminal and Civil Matters").

The work theorized about the number of wrongful convictions in a given country by focusing on certain random variables N that count, among other things, the number of discrete occurrences (sometimes called "events" or "arrivals") that take place during a time-interval of given length. The result had been given previously by Abraham de Moivre (1711) in De Mensura Sortis seu; de Probabilitate Eventuum in Ludis a Casu Fortuito Pendentibus in Philosophical Transactions of the Royal Society, p. 219. This has prompted some authors to argue that the Poisson distribution should bear the name of de Moivre.

A practical application of this distribution was made by Ladislaus Bortkiewicz in 1898 when he was given the task of investigating the number of soldiers in the Prussian army killed accidentally by horse kicks; this experiment introduced the Poisson distribution to the field of reliability engineering.

© Thomas Plümper 2017 - 2018 248

Poisson Models

Poisson is a distribution with mean=variance=λ.

Importantly, the Poisson distribution assumes that mean=variance.

© Thomas Plümper 2017 - 2018 249

Assumptions

The Poisson distribution is an appropriate model if the following assumptions are true.

K is the number of times an event occurs in an interval and K can take values 0, 1, 2, …

The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.

The rate at which events occur is constant. The rate cannot be higher in some intervals and lower in other intervals.

Two events cannot occur at exactly the same instant.

The probability of an event in a small interval is proportional to the length of the interval.

© Thomas Plümper 2017 - 2018 250

Simple Maths

The probability of observing k events in an interval is given by the equation

k   pe k ! where

λ is the average number of events per interval e is the number 2.71828... (Euler's number) the base of the natural k takes values 0, 1, 2, … k! = k × (k − 1) × (k − 2) × … × 2 × 1 is the factorial of k.

© Thomas Plümper 2017 - 2018 251

Violations

Overdispersion: Variance>Mean

Underdispersion: Variance

Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently heterogeneous (non-uniform) contrary to the assumptions implicit within widely used simple parametric models.

© Thomas Plümper 2017 - 2018 252

Poisson Estimates and Overdispersion

Overdispersion is often encountered when fitting very simple parametric models, such as those based on the Poisson distribution. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example, analysis is commonly used to model count data.

If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit.

© Thomas Plümper 2017 - 2018 253

Negative Binomial: Estimation of Overdispersed Count Data

In the case of count data, a mixture model like the negative can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a drawn – in this case – from the thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters).

More clearly:

The negative binomial model allows analyzing overdispersed count data. It does so by adding a probit estimation to the Poisson model. The probit estimate shifts a part of the Poisson distribution to the right. This gives an overdispersed, flatter gamma distribution.

© Thomas Plümper 2017 - 2018 254

NegBin Distributions

© Thomas Plümper 2017 - 2018 255

Conditional Negative Binomial

Occasionally, it is not possible that the count is 0. For example, if we count the number of items in a supermarket trolley at the checkout, a ‘0’ is extremely unlikely simply because a customer that does not buy anything would not queue.

Of course, in these cases one could reduce the counted number by 1 and use a standard poisson or negative binomial model.

One could also use the conditional negative binomial or, as it is also called, zero-truncated negative binomial.

© Thomas Plümper 2017 - 2018 256

Zero-truncated negative Binomial

Since the ZTP is a truncated distribution with the truncation stipulated as k > 0, one can derive the probability mass function g(k;λ) from a standard Poisson distribution f(k;λ) as follows: [4]

f(;) kk e g( k ; ) P ( X  k∣ X  0)   1 f (0; ) ke!1     k   gk(;) (ek 1) ! f(;) k kk e  P( X  k∣ X  0)    1f (0; )ke!1    ( e 1) k !

The mean is e E[]X  11ee

And the variance is  22  Var[XXX ]   E[ ](1   E[ ]) 1ee (1 )2

© Thomas Plümper 2017 - 2018 257

Examples

Example 1.

A study of length of hospital stay, in days, as a function of age, kind of health insurance and whether or not the patient died while in the hospital. Length of hospital stay is recorded as a minimum of at least one day.

Example 2.

A study of the number of journal articles published by tenured faculty as a function of discipline (fine arts, science, social science, humanities, medical, etc). To get tenure faculty must publish, therefore, there are no tenured faculty with zero publications.

(actually not exactly true…)

Example 3.

A study by the county traffic court on the number of tickets received by teenagers as predicted by school performance, amount of driver training and gender. Only individuals who have received at least one citation are in the traffic court files. http://stats.idre.ucla.edu/stata/dae/zero-truncated-poisson-regression/

© Thomas Plümper 2017 - 2018 258

Let’s look at the data.

use http://www.ats.ucla.edu/stat/stata/dae/ztp, clear

summarize stay

Variable | Obs Mean Std. Dev. Min Max ------+------stay | 1493 9.728734 8.132908 1 74

histogram stay, discrete

tnbreg stay age i.hmo i.died, ll(0)

© Thomas Plümper 2017 - 2018 259

Fitting truncated Poisson model:

Iteration 0: log likelihood = -6908.7992 Iteration 1: log likelihood = -6908.7991

Fitting constant-only model:

Iteration 0: log likelihood = -4817.852 Iteration 1: log likelihood = -4778.7604 Iteration 2: log likelihood = -4770.8734 Iteration 3: log likelihood = -4770.848 Iteration 4: log likelihood = -4770.848

Fitting full model:

Iteration 0: log likelihood = -4755.5912 Iteration 1: log likelihood = -4755.2798 Iteration 2: log likelihood = -4755.2796

Truncated negative Number of obs = 1493 Truncation point: 0 LR chi2(3) = 31.14 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -4755.2796 Pseudo R2 = 0.0033

------stay | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------+------age | -.0156929 .013107 -1.20 0.231 -.0413822 .0099964 1.hmo | -.1470576 .0592161 -2.48 0.013 -.263119 -.0309962 1.died | -.2177714 .0461605 -4.72 0.000 -.3082442 -.1272985 _cons | 2.408328 .071982 33.46 0.000 2.267245 2.54941 ------+------/lnalpha | -.5686389 .0551506 -.6767321 -.4605457 ------+------alpha | .5662957 .0312316 .5082753 .6309393 ------Likelihood-ratio test of alpha=0: chibar2(01) = 4307.04 Prob>=chibar2 = 0.000

© Thomas Plümper 2017 - 2018 260

Zeros from Two Sources: Zero-inflated Negative Binomial / Poisson

An often misused model:

Assumption: 0 are generated by two different processes:

Classical Example 1: Park with fishing opportunity: count the number of caught fishes at the exit gate… 0’s result from: not having fished, fished by not having caught anything

Example 2: The number of insurance claims within a population for a certain type of risk would be zero- inflated by those people who have not taken out insurance against the risk and thus are unable to claim.

© Thomas Plümper 2017 - 2018 261

The Zero-Inflation Correction

The zinb model has two parts, a negative binomial count model and the logit model for predicting excess zeros.

The zero-inflated Poisson (ZIP) or ZINB models employ two components that correspond to two zero generating processes. The first process is governed by a binary distribution that generates structural zeros. The second process is governed by a Poisson distribution that generates counts, some of which may be zero. The two model components are described as follows:

 Pr(yej  0)   (1  ) h   i e Pr(yj h i )  (1  ) , h i  1 hi !

Very similar to the Heckman model: one equation for the probability of fishing, one for the count…

© Thomas Plümper 2017 - 2018 262

Example: Fishing from http://stats.idre.ucla.edu/stata/dae/zero-inflated-negative-binomial- regression/ zinb count child i.camper, inflate(persons) vuong zip

Fitting constant-only model:

Iteration 0: log likelihood = -519.33992 Iteration 1: log likelihood = -471.96077 Iteration 2: log likelihood = -465.38193 Iteration 3: log likelihood = -464.39882 Iteration 4: log likelihood = -463.92704 Iteration 5: log likelihood = -463.79248 Iteration 6: log likelihood = -463.75773 Iteration 7: log likelihood = -463.7518 Iteration 8: log likelihood = -463.75119 Iteration 9: log likelihood = -463.75118

Fitting full model:

Iteration 0: log likelihood = -463.75118 (not concave) Iteration 1: log likelihood = -440.43162 Iteration 2: log likelihood = -434.96651 Iteration 3: log likelihood = -433.49903 Iteration 4: log likelihood = -432.89949 Iteration 5: log likelihood = -432.89091 Iteration 6: log likelihood = -432.89091

© Thomas Plümper 2017 - 2018 263

Zero-inflated negative binomial regression Number of obs = 250 Nonzero obs = 108 Zero obs = 142

Inflation model = logit LR chi2(2) = 61.72 Log likelihood = -432.8909 Prob > chi2 = 0.0000

------count | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------+------count | child | -1.515255 .1955912 -7.75 0.000 -1.898606 -1.131903 1.camper | .8790514 .2692731 3.26 0.001 .3512857 1.406817 _cons | 1.371048 .2561131 5.35 0.000 .8690758 1.873021 ------+------inflate | persons | -1.666563 .6792833 -2.45 0.014 -2.997934 -.3351922 _cons | 1.603104 .8365065 1.92 0.055 -.036419 3.242626 ------+------/lnalpha | .9853533 .17595 5.60 0.000 .6404975 1.330209 ------+------alpha | 2.678758 .4713275 1.897425 3.781834 ------Likelihood-ratio test of alpha=0: chibar2(01) = 1197.43 Pr>=chibar2 = 0.0000 Vuong test of zinb vs. standard negative binomial: z = 1.70 Pr>z = 0.0444

© Thomas Plümper 2017 - 2018 264

Hurdle Model

Stata: Hurdle models concern bounded outcomes. For instance, how much someone spends at the movies is bounded by zero. In this sense, hurdle models are much like tobit models. They differ in that hurdle models provide separate equations for the bounded and the unbounded outcomes, whereas tobit models use the same equation for both. Hurdle models assume the unbounded outcomes are the result of clearing a hurdle. When the hurdle is not cleared, bounded outcomes result.

© Thomas Plümper 2017 - 2018 265

Money Spent in the Cinema

. churdle linear money dating teenager nkids, select(newborn hours weekends) ll(0)

Cragg hurdle regression Number of obs = 10,000 LR chi2(3) = 8775.37 Prob > chi2 = 0.0000 Log likelihood = -20230.563 Pseudo R2 = 0.2408

money Coef. Std. Err. z P>|z| [95% Conf. Interval]

money

dating 15.07349 .2602275 57.92 0.000 14.56345 15.58353 teenager 3.055787 .1502961 20.33 0.000 2.761212 3.350362 nkids 14.9045 .1299277 114.71 0.000 14.64984 15.15915 _cons 14.98066 .045653 328.14 0.000 14.89118 15.07014

selection_ll

newborn -.1832054 .0408579 -4.48 0.000 -.2632854 -.1031254 hours -.0476496 .0063111 -7.55 0.000 -.060019 -.0352802 weekends -.4235522 .0788783 -5.37 0.000 -.5781509 -.2689536 _cons .2977912 .0285355 10.44 0.000 .2418626 .3537199

lnsigma _cons 1.100659 .0097069 113.39 0.000 1.081634 1.119684 /sigma 3.006146 .0291802 2.949494 3.063885

© Thomas Plümper 2017 - 2018 266

Chapter 9: Dynamics and the Estimation

© Thomas Plümper 2017 - 2018 267

Data Structure

Cross-Section: >30 observations, one period

Time-Series: one observation, many periods (app. 30 as minimum requirement)

Panel: many observations, many periods

few observations, >30 periods (pooled time-series analysis) > 30 observations, few periods (repeated cross-section) > 30 observations, >30 periods (dynamic panel) few observations, few periods (problematic: cross-section with deflated standard errors)

© Thomas Plümper 2017 - 2018 268

Effect Dynamics

Effects are ‘dynamic’ if they can be observed in more than one period. Apparently, dynamics depend on the specification of ‘periods’.

© Thomas Plümper 2017 - 2018 269

Effect Onset

Effect onset can be

- immediate - delayed and (in the social science) anticipated, that is before the ‘cause’ (or the cause is called: expectation).

© Thomas Plümper 2017 - 2018 270

Lags and Leads

Periodization of data also determines the number of periods between stimulus and effect:

- contemporaneous - lags - leads

© Thomas Plümper 2017 - 2018 271

Effect Functions

Effects have functional forms as the strengths of the effect may change over time.

- increasing: education on income

- decreasing:

Effects lengths can vary from ’short’ to ‘infinite’.

© Thomas Plümper 2017 - 2018 272

Dynamic Specification Choices

1. Determine the temporal unit (period) of analysis.

2. Determine whether the cause (treatment) resembles an ‘impulse’ or a ‘process’. A cause is best modelled as an impulse if it starts and stops in the same period. A cause should be modelled as a process if the beginning and the end of treatment occur in different periods.

3. Effects can be ‘immediate’, ‘delayed’ or ‘anticipated’. An effect is immediate if and only if it occurs in the same period as the cause. An effect is delayed if it does not occur in the same period as the cause but in later periods. In the social sciences, actors can also anticipate the occurrence of a cause and respond before the cause occurs. Hence, effects can occur in periods before the cause.

4. Similar to causes, effects too can be like an impulse or a process. An effect is like an impulse if the beginning and end of the effect happen in the same period, whereas an effect is a process if the end is in a later period than the start.

5. If effects are processes, the strength of the effect can evolve over several periods.

6. Any of the dynamic properties of effects can be heterogeneous across units.

© Thomas Plümper 2017 - 2018 273

The (Unfortunate) Practice of Dynamic Specifications

- periods are usually arbitrary, at times the number of periods is maximized (which also maximizes temporal dependence)

- dynamic specifications are not derived from theory, but aim at eliminating serially correlated errors

- often, one dynamic functional form is assumed for all regressors

- the potential for dynamic heterogeneity is ignored

- serially correlated errors are ‘treated’ by econometric patches (variables with do not exist in the data- generating process, but which reduce or eliminate serially correlated errors, regardless of the model misspecification from which these correlated errors emanate)

© Thomas Plümper 2017 - 2018 274

Periodization

Lengths of period depends on data observation: public data: annual finance data: high frequency, daily, hourly, … correct?

© Thomas Plümper 2017 - 2018 275

Periodization: Annual Data versus Legislative Period Data

© Thomas Plümper 2017 - 2018 276

© Thomas Plümper 2017 - 2018 277

What does this tell us?

On an annualized base, left incumbents do not increase social spending, over the entire legislative period they do…

Possible?

© Thomas Plümper 2017 - 2018 278

The Problem of ‘Too Many’ Periods

- the more often one observes outcomes (and causes), the larger c.p. the effect of the past on outcomes. - the shorter each period, the larger the serial correlation of data

The Problem of ‘Too Few Periods’

- the measurement of variables becomes imprecise (too many changes per period) - too few observations to specify dynamics

© Thomas Plümper 2017 - 2018 279

Temporal Dependence and Serial Correlation of Errors

Data for one unit (country, region, individual) over a certain period of time (days, month, years)

Main problem: observations of the dependent variable are not independent of each other. Observation in t0 depends on observation in t-1,t-2…:

Errors may not be independent of each other, too.

Note: serially correlated errors violate Gauss-Markov conditions, they indicate dynamic model misspecification.

In turn, the absence of serial correlation of errors does not validate the dynamic specification.

One can adjust for this problem as long as rho is larger than -1 and smaller than 1. non-stationarity: rho ≥ 1, rho ≤ -1, explosive processes

Note: Non-stationarity is an asymptotic problem, non-stationarity over a finite number of periods may simply be a property of the data.

© Thomas Plümper 2017 - 2018 280

Autocorrelation

The error term in t1 is dependent on the error term in t0.

Compute the autocorrelation coefficient by regressing residuals on lagged residuals:      i t i t 1 it

Not controlling for autocorrelation violates on of the basic assumptions of OLS and may bias the estimation of the beta coefficients.

The residual of a regression model picks up the influences of those variables affecting the DV that have not been included in the regression equation. Thus, persistence in excluded variables is the most frequent cause of autocorrelation. But there are other causes…

Positive autocorrelation: rho is positive: it is more likely that a positive value of the error-term is followed by a positive one and a negative by a negative one.

Negative autocorrelation: rho is negative: it is more likely that a positive value of the error-term is followed by a negative one and vice versa.

In ‘real data’ autocorrelation tends to be positive.

© Thomas Plümper 2017 - 2018 281

Unit Root

Stationarity: a is said to be (weakly) stationary if its expected value and population variance are independent of time and if the population covariance between its values at time t and t+s depends only on s but not on time: y_t now only depends on random errors, this relationship is therefore called a random walk. The contribution of each innovation is permanently built into the time series. Because the series incorporates the sum of the shocks, it is said to be integrated (it is non-stationary). By contrast when beta<1 the contribution of each shock to the series is exponentially attenuated and eventually becomes negligible.

Note that autoregressive coefficients larger 1 are possible in the short run, but lead to explosive time- series (and thus predictions of infinity) when they are time-invariant.

© Thomas Plümper 2017 - 2018 282

Unit Root

The expected value and population variance of y_t do not have unconditional meanings. If the expectations are taken at time 0, the expected value at any future time t is independent of t (it is always equal to y_t0) but the variance increases with time.

Random walks can also have a „drift“ where both expected value and variance depend on t.

A time series following a random walk has a single UNIT ROOT since beta=1.

Consequences of non-stationarity: OLS and test statistics are inconsistent and lead to biased estimates and wrong inference (error mean and/or variance are time dependent) if inferences are in T.

© Thomas Plümper 2017 - 2018 283

Explosive Time Series Do Exist

© Thomas Plümper 2017 - 2018 284

How to Specify the Dynamics of an Empirical Model

Two options:

- model the dynamics correctly

- use a dynamic specification that eliminates serial correlation of errors

Is it possible/likely to model dynamics correctly?

© Thomas Plümper 2017 - 2018 285

Simple Dynamics: Lagged Dependent Variable, ARMA, VAR

LDV: include a lagged dependent variable

yt   x t   y i,1 t   t

LDV and FE

yt   x t   y i,1 t  u i   t

LDV, FE and Period FE (Beck and Katz)

yt   x t   y i,1 t  u i  p t   t

ARMA (p, q) (Box and Jenkins 1976): Autoregressive moving average models: AR = lagged dependent variable of order p, MA = lagged error term of order q; can be used univariate or multivariate; estimated by

K yt  1 y t 1   2 y t  2 ...   p y t  p   k x k   t k 1         ...      t1 t 1 2 t  2 q t  q t © Thomas Plümper 2017 - 2018 286

VAR: Vector auto-regression model (for stationary time series): simultaneous equations modelling – more that 1 endogenous variable. All endogenous variables are modelled as function of own lags and lags of other endogenous variables plus some exogenous variables (X).

K y1t 1   1111 y t   1221 y t    1112 y t    1222 y t    k x k   1 t k 1 K y2t 2   2111 y t   2221 y t    2112 y t    2222 y t    k x k   2 t k 1

Note: lags tend to be correlated with one another -> reduced efficiency

The above models assume identical dynamic processes for all regressors.

© Thomas Plümper 2017 - 2018 287

Interpretation of LDV Models

The interpretation of the LDV as measure of time-persistency is misleading:

LDV captures average dynamic effect, this can be shown by Cochrane-Orcutt distributive lag models.

LDV models assume that effects are monotonically increasing at a diminishing rate if ρ<1 and an increasing rate if ρ>1. Needless to say that the increase is constant if ρ=1.

Accordingly, with LDV the effect of x on y after k periods is

2 k  xit     x it ....     x it

Note that the long-term effect of x on y can be significant even if the beta is not.

© Thomas Plümper 2017 - 2018 288

Prais-Winsten Models

Prais-Winsten model the serial correlation in the error term – regression results for X variables are more straight forwardly interpretable: yx    i t i t i t      with i t i t 1 it The VC matrix of the error term is

1  2  T 1  1  T2 1   2 1  T 3 12   T 1 T  2 T  3    1

The matrix is stacked for N units. Diagonals are 1.Prais-Winston is estimated by GLS. It is derived from the AR(1) model for the error term. The first observation is not dropped (unlike LDV models).

© Thomas Plümper 2017 - 2018 289

Interpretation:

Instead of just computing yyit  it1 as in LDV models, Prais-Winsten models also compute xxit  it1 . This usually gives larger coefficient, but often similar effects.

The ρ comes from an auxiliary regression with:      i t i t 1 it

A Cochrane-Orcutt transformation is applied for observations t=2,…,n y y   x  x   it it1 it it1 it while the transformation for t=1 is 2 2 2 1 y1   1  x 1  1   1  

© Thomas Plümper 2017 - 2018 290

Distributed Lag Model

Simplest form is Cochrane-Orcutt – dynamic structure of all independent variables is captured by 1 parameter, either in the error term or as LDV. If dynamics are that easy – LDV or Prais-Winston is fine – saves Degrees of Freedom. Problem: if theory predicts different lags for different right hand side variables – than a miss-specified model leads necessarily to bias Test down – start with relatively large number of lags for potential candidates: yit x it1 x it12  x it23   x itnn1    it n 1, , t 1

Testing down does not need to give a correct final model.

© Thomas Plümper 2017 - 2018 291

Error Correction Models

ECM were originally developed for non-stationary data but are equally useful for stationary time series data, especially for ‘long-memoried’ time series. ECM have the advantage of estimating both long-term and short-term effects at the same time. This allows consideration of theories where the dynamic effects include both short-term shocks and longer equilibrium forces.

The bivariate single equation ECM looks as follows:

YYXX           t0 1 t 1 1 t 1 0 t t current changes in Y are a function of current changes in X (the first difference of X) and the degree to which the two series are outside of their equilibrium in the previous time period. Specifically, beta0 captures any immediate effect that X has on Y, described as a contemporaneous effect or short-term effect. The coefficient beta1 reflects the equilibrium effect of X on Y . It is the causal effect that occurs over future time periods, often referred to as the long-term effect that X has on Y . Finally, the long-term effect occurs at a rate dictated by the value of alpha1.

© Thomas Plümper 2017 - 2018 292

More Complicated Stuff: GARCH Models

Model for Dynamics that Affect Error Variance

The variance of the disturbance term is time dependent.

Estimation: squared forecasted residuals

Autoregressive conditional heteroskedasticity (ARCH) in the residuals (Engle 1982). This particular specification of heteroskedasticity was motivated by the observation that in many financial time series, the magnitude of residuals appeared to be related to the magnitude of recent residuals. ARCH in itself does not invalidate standard LS inference. However, ignoring ARCH effects may result in loss of efficiency Present when T very large – high frequency data (daily or weekly) ARCH might be a problem even though there is no Autocorrelation left in the error term: e.g. when variables are differenced. ARCH/ GARCH models: Engle and Granger (1982)

Variants: E-Garch (Exponential Garch), T-Arch (Threshold Arch), Arch-M (Arch in Mean).

© Thomas Plümper 2017 - 2018 293

GARCH Estimation

ARCH and GARCH models do not only estimate the conditional mean function, but also a conditional variance function: the variance is a function of the size of prior unanticipated innovations. The GARCH model (Bollerslev 1986) also include lagged values of the conditional variance – the GARCH (m, k) model is given by: yx    i t i t i t 2 2   2  2   2 it 0 1it1 mitm  1it1  kitk 

Stability condition: Sum of ARCH and GARCH terms has to be <=1 to avoid explosive variance processes.

© Thomas Plümper 2017 - 2018 294

Variants

ARCH-M: ARCH in Mean models allow the conditional variance or the conditional standard deviation of the series to influence the conditional mean: this is particular convenient for modelling the risk/return relationship in financial series: the riskier an investment, c.p., the lower its expected return Asymmetric ARCH Models: E-GARCH and T-Garch: allow different impact of positive an negative innovations in error term.

T-GARCH: 2    2  D  2   2 t t1 t  1 t  1 t  1

E-GARCH:

22tt11 logtt    1   2   log   1  22 tt11

In all this cases ARCH is treated merely as nuisance, but it is possible to include substantial explanatory variables on the right hand side of the conditional variance equation.

© Thomas Plümper 2017 - 2018 295

Example: Dependence of Monetary Policy on Monetary Policy in Key Currency Countries

Dependent variable: changes of real interest rates Model 5 Model 7 of non-EMU EU countries (Den, Swe, UK) unweighted trade weighted Mean Equation: Intercept -0.041* -0.039* (0.022) (0.022) Δ Real Interest Rate Germany, 80-90 0.060 0.087 (0.108) (0.113) Δ Real Interest Rate Germany/Euro-zone, 90-94 0.045 0.038 (0.081) (0.079) Δ Real Interest Rate Euro-zone, 94-99 0.243*** 0.247*** (0.092) (0.092) Δ Real Interest Rate Euro-zone, 99-02 0.356*** 0.355*** (0.106) (0.103) Δ Real Interest Rate Euro-zone, 02-05 0.494*** 0.615*** (0.099) (0.103) Δ Real Interest Rate USA, 80-90 -0.054 -0.049 (0.049) (0.044) Δ Real Interest Rate USA, 90-94 0.132** 0.125** (0.058) (0.057) Δ Real Interest Rate USA, 94-99 0.068 0.066 (0.052) (0.048) Δ Real Interest Rate USA, 99-02 0.024 0.018 (0.021) (0.021) Δ Real Interest Rate USA, 02-05 0.027* 0.013 (0.016) (0.020) MA 1   -0.016 -0.021 t1 (0.040) (0.040) Variance Equation: Intercept 0.0004 0.0003 (0.001) (0.001) ARCH 1  2  0.062*** 0.063*** t1 (0.014) (0.015) GARCH 1  2  0.936*** 0.935*** t1 (0.012) (0.013)

© Thomas Plümper 2017 - 2018 296

‘Solutions’ to Non-stationary Processes

- do not generalize in T.

- differentiate the data (be aware: this changes the hypothesis tested)

- Co-Integration

© Thomas Plümper 2017 - 2018 297

Co-Integration Relation

50

40

30

20 10 1960 1970 1980 1990 2000 year

govcons spend

Co-Integration explain the variation between two time-series

© Thomas Plümper 2017 - 2018 298

Summary

dynamics number of trend dynamic structural dynamic variation across breaks parameters variables LDV homogenous 1 de-trending none no control period FE none 0 de-trending none exogenous Prais-Winsten homogenous 1 de-trending none no control Distributed Lag heterogenous >1 de-trending none no control Error Correction homogenous 2 de-trending yes partly endogenous GARCH homogenous 2 de-trending none endogenous

© Thomas Plümper 2017 - 2018 299

Conclusion

The number of dynamic specifications is high, but ALL these specifications are econometric patches – social scientists do not even try to model dynamics correctly (theoretically informed).

The quality (appropriateness) of these patches is usually judged by the elimination of serial correlation.

It is likely, that these econometric patches eliminate too much of the dynamics in the model. That is: dynamics caused by the substantive factors are eliminated by the dynamic specifications, too.

© Thomas Plümper 2017 - 2018 300

Chapter 10: Temporal Heterogeneity

© Thomas Plümper 2017 - 2018 301

What is Temporal Heterogeneity?

- change in parameter size over time

“Social systems change over time. Even relatively stable societies look different when we revisit them after a period of, say, 40, 50, or 60 years. In the 1960s, some US states still insisted on different compartments for white and black bus passengers; in the 2010s, the US had its first black president. Leading positions in business and society were almost exclusively held by men, while the Nordic countries will probably be the first to reach gender parity in a few years. In 1954 Germany won the football world championship, while in 2014… Well, some things never change.”

© Thomas Plümper 2017 - 2018 302

Theories of Structural Change

Theories of structural change have relied on cultural, religious, economic, scientific and technological forces, and on ideas.

Heraclian theories suggest that change is endemic. Societies can only prevail by changing all the time.

Hegelian and Marxist theories suggest that change occurs in response to a ‘conflict’ between antagonistic factors: ideas in Hegel’s understanding, social classes in Marx’s perspective.

Darwinist (evolutionary) theories suggest that change occurs randomly (like mutations) and only survives if it proves to be advantageous.

With the exception of Marxist and Hegelian theories which predict sudden abrupt and comprehensive changes, most theories perceive change as the rule and an ongoing process. Changes are frequent, evolu- tionary and slow rather than revolutionary.

© Thomas Plümper 2017 - 2018 303

Examples

For example, international relations scholars have argued that the end of the Cold War increased the global system’s propensity for militarized conflict which was previously kept in check by the existence of two opposing superpowers (Baldwin 1995).

Many social scientists believe that the digital revolution will change the causal relation between economic growth and demand for labour.

© Thomas Plümper 2017 - 2018 304

Structural Change and Causal Inference

“Temporal heterogeneity in cause-effect relationships cannot be observed, it needs to be inferred and crucially depends on the estimation model and not on observable changes in outcomes. Even the set of causal mechanisms may change. As society evolves, some causal mechanisms may disappear while others are born. The various ‘logics of social interaction’ do not persist forever, at least not with the same strengths. The possibility and likelihood of change in causal mechanisms is one of the fundamental differences between the social and the physical world.”

© Thomas Plümper 2017 - 2018 305

Modi

- structural break - evolutionary change (trend) - temporary crisis

© Thomas Plümper 2017 - 2018 306

Example 1: Evolutionary Process

Table 15: Trended Effect Test

m1 m2 rho

baseline trended effect of GDP p.c. alcohol consumption -0.117* -0.109* 0.926 (0.0139) (0.0131) lung cancer mortality -1.785* -1.856* 0.954

(0.316) (0.303) external cause mortality -5.153* -5.250* 0.920

(0.255) (0.269) GDP p.c. 0.103* 0.167*

(0.00571) (0.0143) GDP p.c. * year trend -0.00167*

(0.000409) constant 75.28* 74.54*

(0.411) (0.382) R2 0.853 0.857 period fixed effects yes yes Note: Dependent variable is life expectancy at birth. N=1,328. Robust standard errors in parentheses. * statistically significant at .01 level. © Thomas Plümper 2017 - 2018 307

Example 2: Government Spending over Time

Garrett/ Mitchell: Typically, parameter heterogeneity is ignored, despite numerous theoretical arguments claiming (in our case) that left policies have changed over time and governments dominated by left parties do no longer implement a Keynesian macroeconomic policy.

How to estimate? xtreg y x*period1 x*period2 x*periodT then do a Chow-test for the differences of period estimates

© Thomas Plümper 2017 - 2018 308

Left Cabinet Portfolios -0.0356 -0.0031 -0.0355 (0.0042) *** (0.0023) (0.0041) *** Left 1966-1970 0.0204 0.0198 (0.0046) *** (0.0045) *** Left 1971-1975 0.0343 0.0334 (0.0044) *** (0.0042) *** Left 1976-1980 0.0417 0.0406 (0.0047) *** (0.0047) *** Left 1981-1985 0.0451 0.0459 (0.0050) *** (0.0051) *** Left 1986-1990 0.0445 0.0444 (0.0051) *** (0.0051) *** Left 1991-1994 0.0408 0.0384 (0.0069) *** (0.0066) *** Christian Democrat -0.0078 -0.0261 -0.0183 Portfolios (CDEM). (0.0040) * (0.0064) *** (0.0063) *** CDEM 1966-1970 0.0079 0.0024 (0.0059) (0.0058) CDEM 1971-1975 0.0142 0.0099 (0.0081) * (0.0077) CDEM 1976-1980 0.0304 0.0261 (0.0071) *** (0.0068) *** CDEM 1981-1985 0.0108 0.0120 (0.0090) (0.0088) CDEM 1986-1990 0.0113 0.0119 (0.0078) (0.0078) CDEM 1991-1994 © Thomas Plümper 20170.0157 - 2018 0.0198 309 (0.0079) ** (0.0079) **

0,01 LEFT CDEM

0,00

-0,01

-0,02

-0,03 Conditional Effect on Government Spending -0,04 61-65 66-70 71-75 76-80 81-85 86-90 91-94 © Thomas Plümper 2017 - 2018 310 Period

Example 3: Structural Break: The End of Cold War on Life Expectancy

baseline shock or structural break in Eastern Europe GDP p.c. 0.103*

(0.00571) GDP p.c. 0.102* 0.950 (other country years) (0.00562) GDP p.c. 0.792* 0.000 (1990-1996 post-Soviet) (0.220) 1990-1996 post-Soviet dummy -2.784*

(0.601)

© Thomas Plümper 2017 - 2018 311

Example 4: Monetary Policy Autonomy

Since the Euro was phased-in, the specification distinguishes five time periods:

In July 1990, the EMU countries fully liberalized capital accounts vis-à-vis each other. In January 1994, central banks of the EMU began to coordinate and harmonize interest rate policies more closely. In January 1999, the EMU countries fixed their exchange-rate and introduced the Euro. In January 2002 the Euro became the only means of payment in all EMU countries.

© Thomas Plümper 2017 - 2018 312

Dependent variable: changes of real interest rates Model 1 Model 2 of non-EMU EU countries (Den, Swe, UK) Intercept -0.047** -0.169 (0.022) (0.134) Level of Real Interest Rate 0.021*** 0.042*** (DNK, SWE, UK) (0.008) (0.012) Δ Real Interest Rate Germany, 80-90 0.017 -0.028 (0.105) (0.109) Δ Real Interest Rate Germany/Euro-zone, 90-94 0.088 0.066 (0.079) (0.077) Δ Real Interest Rate Euro-zone, 94-99 0.267*** 0.264*** (0.090) (0.090) Δ Real Interest Rate Euro-zone, 99-02 0.355*** 0.350*** (0.107) (0.106) Δ Real Interest Rate Euro-zone, 02-05 0.540*** 0.627*** (0.097) (0.105) Exchange rate towards DM/EURO 0.026 (0.018) Growth (Den, Swe, UK) 0.013* (0.007) Growth Germany/Eurozone -0.006 (0.006) Unemployment (Den, Swe, UK) -0.024** (0.012) Chi²-test difference of emu-coef 80-90=99-02 5.10 6.31 (p>Chi²) (0.024) (0.012) Chi²-test difference of emu-coef 90-94=99-02 4.03 4.70 (p>Chi²) (0.045) (0.030) Chi²-test difference of emu-coef 90-94=02-05 13.00 18.49 (p>Chi²) (0.0003) (0.000) Chi²-test difference of emu-coef 94-99=02© Thomas-05 Plümper 2017 - 20184.27 6.90 313 (p>Chi²) (0.039) (0.009) N 906 900

Estimation

Interaction of substantive variable with period dummies:

Evolutionary Change and Temporary Shock

1 2 3 k yit   x it   x it  p1   x it  p 2 ...   x it  p k 1   it where p1..pk indicate periods.

Structural Break

12 yit   x it   x it  p   it where p indicates all periods after (or before) the break.

See Chow-test…

© Thomas Plümper 2017 - 2018 314

Conclusion

Temporal Heterogeneity is important:

Failure to control for temporal heterogeneity biases estimates of dynamics.

Biased estimates of dynamics biases all other coefficients.

© Thomas Plümper 2017 - 2018 315

Chapter 11: Causal Heterogeneity

© Thomas Plümper 2017 - 2018 316

Heterogeneity as Limit to Poolability?

Some (especially qualitative) researchers argue that quantitative analyses do not work properly, because “cases are too different”.

In principle, the argument holds that different cases should not be included in a single estimation because of unit heterogeneity.

What is unit heterogeneity?

© Thomas Plümper 2017 - 2018 317

Unit Heterogeneity, Causal Heterogeneity, Temporal Heterogeneity

For econometricians, means that the expected residuals differ across units given a certain model. Unit heterogeneity is thus a consequence of model misspecification.

© Thomas Plümper 2017 - 2018 318

Alternative Perspectives: Causal Heterogeneity

Causal heterogeneity implies that effect strengths differ across units and/or periods.

Causal heterogeneity is a real phenomenon, but not an observable phenomenon.

Clearly: estimates of causal heterogeneity differ across models with different specification.

© Thomas Plümper 2017 - 2018 319

Types of Heterogeneity

- effect strengths - conditionalities - dynamics - lags and leads - trends - spatial dependence

And how social scientists discuss heterogeneity:

- omitted time-invariant variables

© Thomas Plümper 2017 - 2018 320

Textbook Solutions

1. Different intercepts: first difference models, fixed effects model

2. Different coefficients: random coefficient model, SUR model, interaction effects

3. Time dependent slopes: Interaction of time with variable Interaction of Period Dummy with Variable

4. Different lag structures: no textbook solution available

5. Different dynamics: no textbook solution available

© Thomas Plümper 2017 - 2018 321

When ‘Solutions’ Determine the Choice of Problems: Fixed Effects

According to econometric textbooks, fixed effects models eliminate unobserved heterogeneity.

© Thomas Plümper 2017 - 2018 322

Unobserved Heterogeneity

Whether this is right or wrong depends on the definitions of ‘eliminate’ and ‘unobserved heterogeneity’:

What fixed effects model do is to eliminate all cross-sectional variation (between variation):

- this eliminate bias that results from strictly time-invariant omitted effects

- it does not eliminate bias that results from time-varying omitted effects

- it eliminate all between-variation, not just the endogenous between-variation

© Thomas Plümper 2017 - 2018 323

Fixed Effects Models are Biased

- in the presence of omitted time-varying variables

- in the presence of omitted time-invariant variables with time-varying effects

- in the presence of any other form of model misspecification (most importantly: dynamic misspecification)

© Thomas Plümper 2017 - 2018 324

The Time-Invariant Variables Case for Fixed Effects is Flawed

Justification for fixed effects is the existence of unobservable time-invariant factors:

- culture - geography - institutions - genetic predisposition

Which of the above variables is time-invariant?

© Thomas Plümper 2017 - 2018 325

None of the above factors are asymptotically time-invariant, because of

- tectonic plate movement (quite slow) - cultural change (which is quite rapid) - change in institutions, change in the function of institutions (also quite rapid)

© Thomas Plümper 2017 - 2018 326

Time-invariant Variables have time-varying Effects

Time invariant variables do not justify fixed effects models.

Only time-invariant effects could possibly justify fixed effects models.

But effects are not time-invariant…

Consider the influence of distance on trade…

Depends on transportation technology, tariffs, infrastructure, …

© Thomas Plümper 2017 - 2018 327

The Simple Maths of Fixed Effects

K y it    k  x kit  u i   it k1

u_i is the unit specific effect, it is time invariant

The fixed effects transformation can be obtained by first averaging equation (1) over T:

KM y i  k x ki   m z mi   i  u i k 1 m 1

1 T 1 T 1 T xx yyi  it i i t i   i t T t1 T t1 T t1

© Thomas Plümper 2017 - 2018 328

K yit y i   k x kit  x ki   it   i  u i  u i k1 K yxi t   k ki t   i t k1

1 NN    FEX i 'X i   X i ' y i  i 1   i 1  1 NTNT     xit 'x it    x it ' y it  i 1 t  1   i  1 t  1 

© Thomas Plümper 2017 - 2018 329

Properties of the FE Model

What it does: - drops all (endogenous and exogenous) between-variation from the estimation - thus eliminates bias from unobserved time-invariant variables with constant effects

What it does not: - Does not control away other problems of unit heterogeneity: unobserved time varying variables, slope heterogeneity, unit specific dynamics and lag structures, omitted time-invariant variables with time-varying effects

Small Sample Properties: depend on share of between-variation to total variation: the smaller the worse the small sample properties of the FE estimator.

© Thomas Plümper 2017 - 2018 330

FE does not Just Eliminate Unobserved Time-Invariant Heterogeneity

1

Density

.5 0

-4 -2 0 2 4

black: true FE, grey: estimated FE

© Thomas Plümper 2017 - 2018 331

What is FE good for?

Overall, FE biases inferences in favor of explanatory variables that have lots of within-variation and little between-variation.

It is also a viable estimator and an alternative to FD models when theories predict short-term adjustments.

However, both FE and FD estimators analyze absolute changes. It is often more appropriate to analyze relative changes. But this is a theoretical question.

Intrepretation: Deviation from unit mean in x explains deviations from unit mean in y.

This is NOT the effect of x on y unless level effects do not exist.

No doubt, FE models are largely overused.

© Thomas Plümper 2017 - 2018 332

Random Effects Model

E  x ,u 0  it it i 

E ui x i E u i  0   K yx      i t k ki t i t withi t  u i   i t k1

NN1 ˆˆ11    RE X i '  X i   X i '  y i  i 1   i 1 

2 2 2 2 2 u    u  u  u 2 2 2 2 2 u  u    u  u 2  2  2   2  2 u u u u  2 2 2 2 2 u  u  u  u  

© Thomas Plümper 2017 - 2018 333

Properties of the RE Model

Minor efficiency gains to Pooled-OLS, potentially major efficiency gains compared to FE.

Does not account for omitted variables, just moves time-invariant residuals into a normal distribution.

© Thomas Plümper 2017 - 2018 334

Hausman-Test

tests whether two models give statistically different results. and not whether one should use FE instead of RE.

- Test for differences in FE and RE estimates - False Logic: since the RE estimator is biased if unit specific effects are correlated with any of the RHS variables, differences between FE and RE estimates are interpreted as evidence against the random effects assumption of strict exogeneity.

Assumes FE is unbiased (which is the case if and only if omitted time-invariant effects are the only model misspecification. However, variants of the Hausman test allow for serial correlation, non-stationarity, heteroskedasticity and other violations of GM assumptions.

Problem: one needs to know the model misspecification…

Probability in Real Data: 0

© Thomas Plümper 2017 - 2018 335

FE or RE?

Is that the question???

Clark, T.S. and Linzer, D.A., 2015. Should I use fixed or random effects?. Political Science Research and Methods, 3(02), pp.399-408.

“Our simulations reveal that the Hausman test is not a reliable tool for identifying bias in typically-sized samples; nor does it aid in evaluating the balance of bias and variance implied by the two modeling approaches. As we point out, “testing” for bias in the random effects model implicitly assigns infinite weight to bias at the expense of any possible benefits due to variance reduction. We see no reason why one should not be willing to accept some degree of bias in the parameter estimates if it is accompanied by a sufficient gain in efficiency.”

Perhaps RE or Pooled-OLS is the question… usually, these estimators give close estimates. RE just has a better reputation among researchers that do not fully understand the differences in FE, FD, RE and pooled-OLS.

© Thomas Plümper 2017 - 2018 336

First Difference Models

Econometrically it has the same effect, advantages and problems like FE

Again: does your theory predict level effects or effects in changes?

K yit y it1   k x kit  x kit1   it   it1   u i  u i k1 K

 yxi t   k  ki t   i t k1

First differencing removes the unit specific effects ui.

FE and FD estimates identical when T=2. Different else.

© Thomas Plümper 2017 - 2018 337

Fixed Effects with Dynamic Misspecification

Bias in fixed effects models can result, inter alia, from omitted time-varying variables, from omitted trends, a miss-specified lag structure, and other – more complex – dynamic misspecifications. Since social scientists often rely on standard dynamic specifications rather than on explicitly modelling the dynamics, bias may be reduced, but is unlikely to disappear.

(Plümper and Troeger 2017)

© Thomas Plümper 2017 - 2018 338

Table 1: Bias over all Experiments

Econometric Specification Bias: pooled OLS Bias: FE

mean min max mean min max

No Dynamics 0.377 0.035 0.665 0.620 0.060 1.118

LDV 0.315 0.009 0.753 0.580 0.048 1.133

Arellano-Bond (A-B) 0.597 0.000 1.393

Prais-Winsten GLS 0.547 0.028 1.321 0.612 0.042 1.182

Period Fixed Effects 0.335 0.007 0.662 0.563 0.001 1.116

LDV+ Period Fixed Effects 0.316 0.001 0.749 0.546 0.000 1.131

A-B + Period Fixed Effects 0.569 0.000 1.386

ADL 0.295 0.044 0.487 0.473 0.002 1.006

© Thomas Plümper 2017 - 2018 339

Findings

The fixed effects estimator is consistent in the presence of omitted variables with time-invariant effects. It is not consistent in the presence of dynamic misspecification.

The fixed effects estimator deals with one problem and one problem only: its consistency depends on the strong assumption of the strict absence of any specification error other than omitted constant variables with effects that are entirely independent of time.

Dynamic misspecification does not merely render the fixed effects model biased. Instead we demonstrate in this article that the fixed effects estimator amplifies the bias from dynamic misspecification relative to estimators that do not shelter the estimation from the between-variation.

The increase of bias from dynamic misspecification potentially reaches the point where the combined bias from omit-ted time-invariant variables and dynamic misspecification of OLS estimates becomes smaller than the bias of the fixed effects model from dynamic misspecification alone.

© Thomas Plümper 2017 - 2018 340

In General

These results have rather general implications for econometric research: Misspecifications of the empirical model are not necessarily additive so that solving one problem does not strictly improve the overall performance of the estimator. Quite the contrary is true: Model misspecifications interact with each other so that accounting for one problem by an econometric solution may actually exacerbate the overall bias and therefore increase the probability of wrong inferences.

© Thomas Plümper 2017 - 2018 341

Chapter 12: Spatial Dependence

© Thomas Plümper 2017 - 2018 342

What is Spatial Dependence?

From an econometric point of view, spatial dependence is called spatial correlation and means the correlation of residuals across space.

From a substantive perspective, spatial dependence implies that units of observation are not independent of each other: the outcome in unit i depends on the outcome in unit j (spatial-y) or on determinants in unit j (spatial-x).

© Thomas Plümper 2017 - 2018 343

Substance versus Nuisance

Treating spatial interdependence as “nuisance” – spatial dependence is relegated to the stochastic component of the regression model and standard error estimates are corrected to account for non- spherical disturbances (OLS with PCSE)

Treating the interdependence as “substance” to be modelled using spatial lags a right-hand side variables. Studies using this approach often use theoretically informed spatial weights (e.g. trade, distance) to generate spatial lags Test hypotheses about strategic interdependence or diffusion of economic policies

 It is better not to confine the spatial dependence to the error term because it is likely to be ignored when it comes to interpreting the regression results. If spatial dependence is overlooked, studies will be biased toward finding domestic, internal factors to be more important than international, external ones.

© Thomas Plümper 2017 - 2018 344

Three Variants spatial-error models: outcome in unit i depends on spatial error structure (usually the assumption is that not the outcome but the errors are spatially correlated) spatial-y models: outcome in unit i depends on outcome in unit -i (or j), spatial-x models: outcome in unit i depends on determinants in unit -i.

© Thomas Plümper 2017 - 2018 345

Modelling Spatial Lags

© Thomas Plümper 2017 - 2018 346

© Thomas Plümper 2017 - 2018 347

Spatial-x and Spatial-ε

© Thomas Plümper 2017 - 2018 348

Example 1: Tax Competition

© Thomas Plümper 2017 - 2018 349

© Thomas Plümper 2017 - 2018 350

Example 2: Counterterrorism

© Thomas Plümper 2017 - 2018 351

Estimation

Spatial-y and spatial-ε models suffer from endogeneity:

If yi depends on yj, then yj is also likely to depend on yj.

The endogeneity problem can be solved within a simple estimation (i.e. OLS) that uses instruments for the endogenous spatial lag. Instruments are (a rare situation!) known: instrumentalize yj by xj.

For most dependent variables, spatial ML estimators exist.

Spatial ML tends to give extremely similar point estimates with lower standard errors, that is: spatial ML is more efficient than spatial IVLS (or S-OLS).

Spatial-x estimates do not (at least not obviously) suffer from endogeneity.

© Thomas Plümper 2017 - 2018 352

The Endogeneity Problem and the Size of the Bias

Endogeneity – simultaneity bias

Y1  1 X 1   12 Y 2   1 (1) Y  X   Y   (2) 2 2 2 21 1 2 The left hand side of 1 is on the right hand side of 2 and vice versa: thus, country 1 affects country 2 but country 2 also affects country 1.

The size of the bias can analytically be derived, OLS estimates of interdependence rho will be inflated – this induces an attenuation bias in the estimate of beta1

*Var  * 2 ˆ 1 1 21 11    22 2   21*Var  1  Var   2 

Typically OLS estimates of spatial lag models will tend to over-estimate the importance of interdependence and underestimate the importance of other factors.

© Thomas Plümper 2017 - 2018 353

OLS estimates that ignore interdependence – that omit spatial lags – will suffer the converse omitted- variable bias, which is equal to:

ˆ 12  21  1 11    1 12  21

© Thomas Plümper 2017 - 2018 354

The Causal Mechanism is in the Spatial-Weight

In spatial econometrics, W refers to the matrix that weights the value of the spatially lagged variable of other units. As unimportant as it may appear, W specifies, or at least ought to specify, why and how other units of analysis affect the unit under observation.

© Thomas Plümper 2017 - 2018 355

Geography as the Origin of Spatial

Anselin:

Spatial dependence is present “whenever correlation across cross-sectional units is non-zero, and the pattern of non-zero correlations follows a certain spatial ordering”.

Neumayer and Plümper

Geographical proximity is not the causal mechanism that causes spatial dependence. Rather, contact (or interaction) is. Space is not only “more than geography” (Beck et al. 2006), spatial dependence is clearly not caused by geography, proximity and contiguity itself. Rather, spatial dependence is caused by contact, connections, transactions, interactions, and relations. Employing geographical proximity is thus nothing more than based on the functionalistic assumption that proximity is correlated with contact intensity or contact frequency. Thus, a-theoretical connectivity variables such as geographical proximity typically cannot provide insights into the true causal mechanism of spatial dependence and are therefore often ineligible for the purpose of testing theories of spatial dependence.

© Thomas Plümper 2017 - 2018 356

The Standard Spatial Model

 wikt yXy     it kt it it , (1) k  wikt k

where iN1,2,.., , tT1,2,.., , kN1,2,.. . Notation is standard so that yit is the value of the dependent variable in unit i at time t, and

 wikt y  kt (2) k  wikt k

is a row-standardized spatial lag variable, X it is a vector of unit specific variables influencing yit , and it is an identically and independently distributed (i.i.d.) error process.

© Thomas Plümper 2017 - 2018 357

Parameter

The spatial autoregression parameter represents the estimated degree of spatial dependence. The spatial effect variable (2) consists of the product of two elements. The first element is the NNT block-diagonal row-standardized spatial weighting matrix W, which measures the relative connectivity between N number of units i, call them recipients of spatial stimulus, and N number of units k, call them senders of spatial stimulus, in T number of time periods in the off-diagonal cells of the matrix as represented by the connectivity variable wikt , which takes on strictly non-negative values only (Anselin 2002: 258).

© Thomas Plümper 2017 - 2018 358

The Spatio-Temporal Lag Model

The spatio-temporal-lag model in matrix notation:

y is an N×T vector of cross sections stacked by periods rho is the spatial autoregressive coefficient

W is an NT× NT block-diagonal spatial-weighting matrix

Wy is the spatial lag; Wy directly reflects the dependence of each unit i’s outcome on unit j’s outcome phi is the temporal autoregressive coefficient

M is an NT×NT matrix with ones on the minor diagonal, and zeros elsewhere

My is just a (first-order) temporal lag

© Thomas Plümper 2017 - 2018 359

The conditional likelihood function for the spatio-temporal-lag model, which assumes the first observation in each unit to be non-stochastic, is a straightforward extension of the standard spatial lag likelihood function, which, in turn, adds only one mathematically and conceptually small complication (albeit a computationally intense one) to the likelihood function for the standard linearnormal model (OLS). To see this, start by rewriting the spatial-lag model with the stochastic component on the left:

Assuming i.i.d. normality, the likelihood function for ε is then just the typical linear-normal one:

which, in this case, will produce a likelihood in terms of y as follows:

This still resembles the typical linear-normal likelihood, except that the transformation from ε to y is not by the usual factor, 1, but by . Written in (N×1 ) vector notation, the spatiotemporal-model conditional- likelihood is mostly conveniently separable into parts:

where © Thomas Plümper 2017 - 2018 360

The unconditional (exact) likelihood function, which retains the first time-period observations as non- predetermined, is more complicated:

where

When T is small, the first observation contributes greatly to the overall likelihood, and scholars should use the unconditional likelihood to estimate the model. In other cases, the more compact conditional likelihood is acceptable for estimation purposes.

One easy way to ameliorate or even eliminate the simultaneity problem with S-OLS is to lag temporally the spatial lag. To the extent that time-lagging renders the spatial lag pre-determined—that is, to the extent spatial interdependence does not incur instantaneously, where instantaneous here means within an observation period, given the model—the S-OLS bias disappears.

© Thomas Plümper 2017 - 2018 361

Provided that the spatial-interdependence process does not operate within an observational period but only with a time lag, and also that spatial and temporal dynamics are sufficiently modeled to prevent that problem arising via measurement/specification error, OLS with a temporally lagged spatial-lag on the RHS is a simple and effective estimation strategy without simultaneity bias.

© Thomas Plümper 2017 - 2018 362

Implicit Assumptions of the Anselin Approach

1. By row-standardizing W – each wikt is divided by  wikt , the row sum of connectivities – the assumption of k homogeneous total exposure to spatial stimulus is imposed across all recipient subjects.

2. Equation (1) assumes that spatial dependence is uni-dimensional.

3. By requiring to take on strictly non-negative values only and by estimating one coefficient for one single spatial lag variable, specification (1) assumes that spatial dependence is uni-directional.

4. By assuming the weight is either contiguity or inverse distance, the standard model lacks a theoretical underpinning of the causal mechanism. Distance is not a causal mechanism, but may at best be correlated with the true causal mechanism.

© Thomas Plümper 2017 - 2018 363 a) Row-Standardization

Spatial econometricians find it convenient to ‘row-standardize’ the weighting matrix. It is a convention that is ‘typically’ (Anselin 2002: 257), ‘commonly’ (Franzese & Hays 2006: 174), ‘generally’ (Darmofal 2006: 8), or ‘usually’ (Beck et al. 2006: 28) followed.

© Thomas Plümper 2017 - 2018 364

Row-Standardization

For each row of the matrix, each cell is divided by its row sum – weights in each row add up to one – the spatial lag becomes a weighted average of the spatially lagged dependent variable in other units. (spatial lag = same metric as the DV)

Stability: sum of coefficients of spatial LDV and temporal LDV < 1

No row standardization – spatial lag is a weighted sum of the spatially lagged dependent variable

Row standardization changes the relative influence of other units on the spatial effect – changes estimation results

Functional form of the W-Matrix? – changes estimation results

Claim: Row-standardization ‘normalizes’ the spatial weight for all observations.

© Thomas Plümper 2017 - 2018 365

Consider a contiguity matrix for two countries Netherlands and Germany.

Netherlands has two neighbouring countries: Germany and Belgium. Germany has nine neighbouring countries: Denmark, Netherlands, Belgium, Luxembourg, France, Switzerland, Austria, Czech Rep., Poland.

Row-standardization gives each neighbour a weight of 0.5 in the case of France and of 0.11 in the case of Germany – so that the sum of all contiguities is 1.0 in both cases.

© Thomas Plümper 2017 - 2018 366

Consequences of Row-Standardization

Row-standardization takes out all level effects from the connectivity matrix – for each recipient i the sum of connectivities to all sources k equals 1. Row-standardization thus imposes the assumption that the total exposure to the spatial stimulus is equal for all units i. It implies that if two different recipients are linked to the same senders but one has barely any connectivity to senders and the other is strongly connected to them, they will end up with the exact same row-standardized spatial stimulus (same value of the spatial effect variable). We call this homogeneity of total exposure to spatial stimulus.

© Thomas Plümper 2017 - 2018 367

Issues of Row-Standardization

Table 1: The Homogeneous Total Exposure Assumption of Row-Standardization

k1 k2 k3 k4 k5 k1 k2 k3 k4 k5

wik wik

i1 0.7 1.1 0.8 1.4 1.0 0.14 0.22 0.16 0.28 0.20

i2 7 11 8 14 10 0.14 0.22 0.16 0.28 0.20

© Thomas Plümper 2017 - 2018 368

Table 2: Adding Further Contacts Reduces the Spatial Weight of Each One

k1 k2 k3 k4 k5 k1 k2 k3 k4 k5

wik wik i1 0 0 0 1 1 0.00 0.00 0.00 0.50 0.50 i2 1 0 0 1 1 0.33 0.00 0.33 0.00 0.33

© Thomas Plümper 2017 - 2018 369

Kelejian and Prucha (2010):

“… [I]n row-normalizing a matrix one does not use a single normalization factor, but rather a different factor for the elements of each row. Therefore, in general, there exists no corresponding re-scaling factor for the autoregressive parameter that would lead to a specification that is equivalent to that corresponding to the un-normalized weight matrix. Consequently, unless theoretical issues suggest a row-normalized weight matrix, this approach will in general lead to a misspecified model.”

© Thomas Plümper 2017 - 2018 370

Let the Data do the Talking

Neumayer and Plümper (2012) demonstrate that we can test whether a row-standardized spatial effect becomes stronger as the total exposure to spatial stimuli increases across subjects. This is possible with a model in which a row-standardized spatial effect variable is interacted with a measure of exposure: Alternatively, as Neumayer and Plümper (2012) demonstrate, one can test whether a row-standardized spatial effect becomes stronger as the total exposure to spatial stimuli increases across subjects. This is possible with a model in which a row-standardized spatial effect variable is interacted with a measure of exposure zit :     wwikt ikt yyy       z   z   X   it1kt  2  kt  it 3 it it it (3) kkwwikt ikt kk   

© Thomas Plümper 2017 - 2018 371 b) Relative Relevance

What determines a) which other cases have an influence? b) the relative influence of cases?

Determining the relative relevance of sources is a broader specification issue, not only influenced by whether or not to row-standardize W. Its starting point is considering whether any of the potentially sending subjects k are entirely irrelevant for recipient subject i under observation. If so, this results in the value of zero for the cell in W representing the link between subject i and subject k.1

1 Units of observation i that are not linked to any other units k create a problem for row-standardized spatial effect variables since one cannot divide by zero. © Thomas Plümper 2017 - 2018 372

Uncertainty

If researchers are uncertain whether the spatial effect coming from the group deemed to be irrelevant is actually zero, they can estimate the following specification (we show all specifications without row- standardization):

y11   2 w 2   X   itwyikt kt ikty kt it it , (4) kk

1 10if w ikt  w2  1 where ikt  1 . For the case in which w ikt is a dichotomous variable, this simplifies to 00if w ikt 

21 wwikt(1 ikt ) .

This does not solve the ‘functional form’ issue…

© Thomas Plümper 2017 - 2018 373

Semi-parametric Estimation of Unknown Functional Forms

Given this under-specification problem, a semi-parametric approach represents a promising alternative. One divides one’s connectivity variable into several categories, creating separate dummy variables for each category. For example, for distance one would create separate dummies for bands of distance, e.g., from 0 to 1,000 kilometres, 1,001 to 2,000 kilometres, etc. One then creates separate spatial effect variables, one for each of the categories. This will allow the strength of spatial stimulus to vary flexibly across the range of the connectivity variable rather than imposing a particular functional form.

© Thomas Plümper 2017 - 2018 374 c) Dimensionality

Anselin-type spatial models assume that connectivity is unidimensional (inverse distance i.e.).

Yet, connectivity can be multi-dimensional. Sometimes, theory will require multi-dimensional connectivity if several causal mechanisms exist that transmit spatial stimulus from sources to recipients.

Multiple dimensions of connectivity can represent links between i and k that are independent of each other, substitutive for each other or conditional on each other. Multiple dimensions of connectivity that are truly independent of each other – that is, neither substitutive for each other nor conditional on each other – are probably rare since even different causal mechanisms may not be entirely independent.

© Thomas Plümper 2017 - 2018 375

Options

Perfect substitutes

1 2 3 yit w ikt  w ikt  w ikt y kt   X it   it . k

Perfect independence

1 1 2 2 3 3 yit w ikt y kt    w ikt y kt    w ikt y kt   X it   it , k k k

Imperfect substitutes y  y   X   1 2 3 it ikt kt it it , where ikt is a principal component of wikt , wikt and wikt . k

© Thomas Plümper 2017 - 2018 376

Multiplicative 1 12 yit w ikt  w ikt y kt   X it   it k

Multiplicative 2

1 1 2 2 3 1 2 yit w iktkt y    w iktkt y    w iktkt y   w iktkt y   X it   it k k k k

© Thomas Plümper 2017 - 2018 377 d) Directionality

With few exceptions (see, for example, Brooks and Kurtz 2012), analyses of spatial dependence assume that spatial effects are uni-directional.

For all senders and all recipients, the spatial stimulus that emanates from relevant senders k onto the recipient i is assumed to be in the same direction – either consistently positive or consistently negative – for relevant senders and zero for irrelevant senders.

In reality, however, the stimulus from sub-group k1 of relevant senders can be in the opposite direction of the stimulus coming from sub-group k2 of relevant senders. Moreover, the sub-groups k1 and k2 can be different for different groups of recipients and, in the extreme case, even be different for each recipient i.

© Thomas Plümper 2017 - 2018 378

Examples

- pollution

- military spending

- taxation?

© Thomas Plümper 2017 - 2018 379

Estimation

y1 w 1 y   2 w 2 y   X   itik12 t kt ik t kt it it with strictly positive connectivity variables kk12

yit w ikty kt    X it   it with positive and negative connectivity variables. k

© Thomas Plümper 2017 - 2018 380

Example: Military Spending in Alliances

- analyses the free-riding in alliances hypothesis from a spatial perspective - usually, it is assumed that if allies do not spend the same share of military spending to GDP as the USA do, the allies ‘free-ride’ - this implies that Denmark has identical geostrategic interests as the USA

- the spatial approach avoids this wrong assumption - however, it merely analysis elasticities, if free-riding occurs already in year 1, then perfect correlation of elasticities (which suggests no free-ring) would merely be constant free-riding.

- thus, difficult to interpret…

© Thomas Plümper 2017 - 2018 381

Table 2. Estimation results for entire period 1956 to 1988.

Country-specific response of: to US growth to Soviet growth (if in excess of US growth) Canada 0.121** -0.274** (0.0477) (0.115) Great Britain -0.0659 -0.296*** (0.0601) (0.0820) Netherlands -0.183*** 0.181** (0.0581) (0.0704) Belgium -0.281*** 0.119* (0.0446) (0.0626) France -0.128*** -0.121 (0.0285) (0.0897) Portugal 0.336*** 0.782*** (0.0411) (0.0718) West Germany -0.0786 0.712*** (0.0524) (0.0708) Italy 0.0583* 0.103 (0.0293) (0.0942) Greece -0.0669 0.192* (0.0918) (0.0946) Norway 0.0233 0.484*** (0.0250) (0.0699) Denmark -0.0568 0.431*** (0.0479) (0.0697) Turkey -0.402*** 0.210 (0.0437) (0.162) Lagged dependent variable 0.0635 (0.0469) GDP growth 0.723* (0.386) Intensity of armed conflict involvement © Thomas 0.000650Plümper 2017 - 2018 382 (0.00613) Initial level of military spending to GDP 0.00963 (0.00651) Linear year trend 0.000263 (0.000348) Spatial error term 0.555*** (0.0981) Constant -0.542 (0.686) Observations 395 R-squared 0.305

1.0 POR GER 0.8

NOR 0.6 DNK 0.4

BEL 0.2 FRA TUR ITA 0.0

-0.2 GBR free-riding -0.4 CAN -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 responsechange to inUnionSoviet military spending response to change US military spending

© Thomas Plümper 2017 - 2018 383

Summary

- spatial analysis is the quintessence of social science - assumptions that units of analysis are independent of each other does not make much sense in the social sciences

- still, though, spatial analyses are relatively rare, and were they are conducted, often misinterpret spatial dependence as function of inverse distance or contiguity.

- this is a good proxy at best and a really bad proxy at worst (i.e. in research on tax competition)

© Thomas Plümper 2017 - 2018 384

Chapter 13: The Analysis of Directional Data

© Thomas Plümper 2017 - 2018 385

Chapter 14: Effect Strengths and Cases in Quantitative Research

© Thomas Plümper 2017 - 2018 386

What do Social Scientist want to know about ‘Cases’?

Case studies are often viewed with some circumspection: A work that focuses its attention on a single example of a broader phenomenon is apt to be described as a “mere” case study.

At the same time, social scientists continue to produce a vast number of case studies. Judging by recent scholarly output, the case study method retains considerable appeal, even among scholars in research communities not traditionally associated with this style of research—e.g., among political economists and quantitatively inclined political scientists.

© Thomas Plümper 2017 - 2018 387

Gerring on Case Studies

A case study is an intensive study of a single unit for the purpose of understanding a larger class of (similar) units.

© Thomas Plümper 2017 - 2018 388

What is a Case?

In quantitative research, a case is one unit of analysis. Cases are sometimes distinguished from observations, which are cases at a period.

In qualitative research, a case is usually an event: i.e. the Cuba crisis, the 2nd World War, a political reform.

© Thomas Plümper 2017 - 2018 389

What are Case Studies interested in?

Descriptive Inference

- historical accuracy - process tracing

Causal Inference

- causal mechanisms (?)

(Pseudo-) Statistical Inference

- is the case representative? - if not, why not?

© Thomas Plümper 2017 - 2018 390

Are Case Studies always Descriptive or Inductive?

Not necessarily, but most qualitative methodologists think so…

Gerring: It should be clear that the affinity between case study research and descriptive inference does not denigrate the possibility of causal analysis through case study research, of which one might cite many illustrious examples. Indeed, the discussion that follows is primarily concerned with propositions of a causal nature. My point is simply that it is easier to conduct descriptive work than to investigate causal propositions while working in a case study mode.

© Thomas Plümper 2017 - 2018 391

Gerring on Qualitative versus Quantitative Approaches

Note that Gerring’s perception of quantitative research is limited. And the above table is inconsistent…

© Thomas Plümper 2017 - 2018 392

Qualitative Research Designs

- single case study - outlier analysis but how do we know whether a case is an outlier or not?

- comparative case studies - process-tracing

QCA (Ragin)

© Thomas Plümper 2017 - 2018 393

Process-Tracing

(qualitative time-series analysis)

The method works by extracting all of the observable implications of a theory, rather than merely the observable implications regarding the dependent variable. Once these observable implications are extracted (particularly with reference to the microfoundations of how a theory's independent variable causes the predicated change in the dependent variable) they are then tested empirically, often through the method of elite interviews but also often through other rigorous forms of data analysis.

It is often used to complement comparative case study methods. By tracing the causal process from the independent variable of interest to the dependent variable, it may be possible to rule out potentially intervening variables in imperfectly matched cases. This can create a stronger basis for attributing causal significance to the remaining independent variables.

© Thomas Plümper 2017 - 2018 394

QCA

In statistics, qualitative comparative analysis (QCA) is a data analysis technique for determining which logical conclusions a data set supports. The analysis begins with listing and counting all the combinations of variables observed in the data set, followed by applying the rules of logical inference to determine which descriptive inferences or implications the data supports. The technique was originally developed by Charles Ragin in 1987.

The technique is based on the binary logic of Boolean algebra, and attempts to maximize the number of comparisons that can be made across the cases under investigation, in terms of the presence or absence of characteristics (variables) of analytical interest. Thus, for example, 18 cases (say, nation-states) involving 7 independent variables (presence or absence of economic recession, of an external threat to state security, and so on) might be examined in order to identify the causal factors involved in the emergence of revolutions, in this example yielding no fewer than 128 (27) different combinations of causal conditions. Ragin claims that the technique combines the strengths of case-oriented (qualitative) and variable- oriented (quantitative) approaches to comparative sociology. Critics argue that it allows only for logical rather than statistical representativeness; makes no allowance for missing variables or error in the data; that not all variables of interest have only two values; and that the method is therefore highly sensitive to the way in which each case must be coded in a binary fashion.

© Thomas Plümper 2017 - 2018 395

The Case in Quantitative Research

Regression analysis studies the effect of one or many variables on an outcome.

Effects are ‘mean effects’.

The sample is assumed to be ‘conditional homogeneous’, that is homogeneous after controlling for controls.

Generalizations are from the mean case to the population, which on average is assumed to be very similar to the mean case included in the sample.

© Thomas Plümper 2017 - 2018 396

Outliers

In statistics, an outlier is an observation point that is distant from other observations.

Here is what statisticians believe:

An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

Hmm.

© Thomas Plümper 2017 - 2018 397

Causes for Outliers

- the case does not belong to the population - other cases may not belong to the population - the model simplifies relevant to the case - stochastic processes - misspecified functional form - misspecified conditionality - misspecified temporal and spatial dependence - measurement error - and many more...

© Thomas Plümper 2017 - 2018 398

Does the existence of outliers indicate that the model is wrong?

© Thomas Plümper 2017 - 2018 399

Does the existence of outliers indicate that the model is wrong?

No, outliers could be cases that do not belong to the population.

In this case, dropping the case from the sample would be a good idea. In all other cases, it is not.

© Thomas Plümper 2017 - 2018 400

Outlier Detection

Definitions of what constitutes an outlier are arbitrary (like all definitions). Determining whether or not an observation is an outlier is ultimately a subjective exercise.

Detection: - graphs, i.e. normal probability plots. Others are model-based. Box plots are a hybrid. - residual plot - tests (but why would we?) Chauvenet's criterion Grubbs' test for outliers Dixon's Q test Mahalanobis distance and leverage are often used to detect outliers.

© Thomas Plümper 2017 - 2018 401

The Leverage of a Case: Jackknife

Leverage is defined as influence of a case on estimates.

High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.

Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying influential observations: among these measures is partial leverage, a measure of how a variable contributes to the leverage of a datum.

© Thomas Plümper 2017 - 2018 402

Leverage Plot: EU and Counterterrorist Regulation

© Thomas Plümper 2017 - 2018 403

Stata’s Margins

Margins are statistics calculated from predictions of a previously fit model at fixed values of some covariates and averaging or otherwise integrating over the remaining covariates.

The margins command estimates margins of responses for specified values of covariates and presents the results as a table.

Capabilities include estimated marginal means, least-squares means, average and conditional marginal and partial effects (which may be reported as derivatives or as elasticities), average and conditional adjusted predictions, and predictive margins.

© Thomas Plümper 2017 - 2018 404

What are ‘Margins’?

Cameron & Trivedi note (p. 333): “An ME [marginal effect], or partial effect, most often measures the effect on the conditional mean of y of a change in one of the regressors, say Xk . In the linear regression model, the ME equals the relevant slope coefficient, greatly simplifying analysis. For nonlinear models, this is no longer the case, leading to remarkably many different methods for calculating MEs.”

© Thomas Plümper 2017 - 2018 405

Stata on Margins Plot

© Thomas Plümper 2017 - 2018 406

Example

© Thomas Plümper 2017 - 2018 407

© Thomas Plümper 2017 - 2018 408

© Thomas Plümper 2017 - 2018 409

Stata Margins Command predict(pred opt) estimate margins for predict, pred opt expression(pnl exp) estimate margins for pnl exp dydx(varlist) estimate marginal effect of variables in varlist eyex(varlist) estimate elasticities of variables in varlist dyex(varlist) estimate semielasticity—d(y)=d(lnx) eydx(varlist) estimate semielasticity—d(lny)=d(x) continuous treat factor-level indicators as continuous

At at(atspec) estimate margins at specified values of covariates atmeans estimate margins at the means of covariates asbalanced treat all factor variables as balanced

© Thomas Plümper 2017 - 2018 410

Examples

at(atspec) specifies values for covariates to be treated as fixed.

at(age=20) fixes covariate age to the value specified. at() may be used to fix continuous or factor covariates.

at(age=20 sex=1) simultaneously fixes covariates age and sex at the values specified.

at(age=(20 30 40 50)) fixes age first at 20, then at 30, .... margins produces separate results for each specified value.

at(age=(20(10)50)) does the same as at(age=(20 30 40 50)); that is, you may specify a numlist.

at((mean) age (median) distance) fixes the covariates at the summary statistics specified. at((p25) _all) fixes all covariates at their 25th percentile values. See Syntax of at() for the full list of summary-statistic modifiers.

at((mean) _all (median) x x2=1.2 z=(1 2 3)) is read from left to

© Thomas Plümper 2017 - 2018 411

right, with latter specifiers overriding former ones. Thus all covariates are fixed at their means except for x (fixed at its median), x2 (fixed at 1.2), and z (fixed first at 1, then at 2, and finally at 3). at((means) _all (asobserved) x2) is a convenient way to set all covariates except x2 to the mean.

© Thomas Plümper 2017 - 2018 412

© Thomas Plümper 2017 - 2018 413

Effect Strengths Analysis and the Delta Method

Plümper, T. and Neumayer, E., 2009. Famine mortality, rational political inactivity, and international food aid. World Development, 37(1), pp.50-61.

© Thomas Plümper 2017 - 2018 414

Abstract

Famine mortality is preventable by government action and yet some famines kill. We develop a political theory of famine mortality based on the selectorate theory of Bueno de Mesquita et al. (2002, 2003). We argue that it can be politically rational for a government, democratic or not, to remain inactive in the face of severe famine threat. We derive the testable hypotheses that famine mortality is possible in democracies, but likely to be lower than in autocracies. Moreover, a larger share of people being affected by famine relative to population size together with large quantities of international food aid being available will lower mortality in both regime types, but more so in democracies.

© Thomas Plümper 2017 - 2018 415

Summary Stats of Used Variables summarize faminekill2 civilconflictcorr fhpolrightsreversed gdpcwb popwb_log foodlevel famineratio3 demfhpolrightsfoodratio autocfhpolrightsfoodratio populationdensity_interpol ann wateravailabletowithdrawal

Variable | Obs Mean Std. Dev. Min Max ------+------faminekill2 | 2553 986.385 30958.57 0 1500000 civilconfl~d | 2433 .5532265 1.043731 0 3 fhpolright~d | 2553 3.513122 1.927548 1 7 gdpcwb | 2553 1342.755 1616.713 49.32309 14966.05 popwb_log | 2553 15.57016 1.782878 10.62133 20.73906 ------+------foodlevel | 2553 95.34919 239.1446 0 2484.82 famineratio3 | 2553 .0132152 .0706952 0 .9935824 demfhpolri~o | 2553 .4328263 5.515633 0 154.4546 autocfhpol~o | 2553 2.353005 28.23845 0 1070.147 population~l | 2553 .8892949 1.402885 .01 11.1 ------+------ann | 2553 1265.95 824.778 32.2 4231 wateravail~l | 2423 393.833 1942.599 .5121621 17002.67

© Thomas Plümper 2017 - 2018 416

Density Plot

2

1.5

1

Frequency

.5 0 0 500000 1000000 1500000 (mean) faminekill2

© Thomas Plümper 2017 - 2018 417

Tab tabstat faminekill2 if faminekill2>0, by(year) stats(mean v n)

year | mean variance N ------+------1972 | 65000 . 1 1973 | 65000 . 1 1974 | 1500000 . 1 1978 | 63 . 1 1979 | 18 . 1 1980 | 15 . 1 1982 | 280 . 1 1983 | 260 115200 2 1984 | 83410 5.81e+09 3 1985 | 151500 4.41e+10 2 1986 | 84 . 1 1987 | 230 64800 2 1988 | 125183.5 3.12e+10 2 1989 | 2718.5 1.23e+07 2 1991 | 316 67712 2 1992 | 857 1442043 3 1997 | 460 . 1 1998 | 14769.4 9.56e+08 5 1999 | 29 . 1 2000 | 114 1682 2 ------+------Total | 71949.74 6.67e+10 35 ------

© Thomas Plümper 2017 - 2018 418

Results nbreg faminekill2 civilconflictcorr fhpolrightsreversed gdpcwb popwb_log foodlevel famineratio3 demfhpolrightsfoodratio autocfhpolrightsfoodratio populationdensity_interpol ann wateravailabletowithdrawal, robust nolog

Negative binomial regression Number of obs = 2399 Wald chi2(11) = 787.98 Dispersion = mean Prob > chi2 = 0.0000 Log pseudolikelihood = -472.96941 Pseudo R2 = 0.0752

------| Robust faminekill2 | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------+------civilconflictcorrected | 1.490012 .4052419 3.68 0.000 .6957528 2.284272 fhpolrightsreversed | -.8264418 .228656 -3.61 0.000 -1.274599 -.3782843 gdpcwb | -.0036521 .0004099 -8.91 0.000 -.0044554 -.0028488 popwb_log | 1.460211 .2813867 5.19 0.000 .9087028 2.011718 foodlevel | -.00031 .0011931 -0.26 0.795 -.0026484 .0020284 famineratio3 | 72.58015 13.53947 5.36 0.000 46.04328 99.11701 demfhpolrightsfoodratio | -.1180083 .0284593 -4.15 0.000 -.1737875 -.062229 autocfhpolrightsfoodratio | -.0220031 .0068696 -3.20 0.001 -.0354672 -.008539 populationdensity_interpol | -.1762986 .3954648 -0.45 0.656 -.9513954 .5987982 ann | -.0016039 .0004254 -3.77 0.000 -.0024377 -.0007701 wateravailabletowithdrawal | -.0156225 .0057209 -2.73 0.006 -.0268353 -.0044097 _cons | -17.8548 4.426095 -4.03 0.000 -26.52979 -9.179818 ------+------/lnalpha | 5.672947 .1792588 5.321606 6.024287 ------+------alpha | 290.8904 52.14468 204.7123 413.347 ------© Thomas Plümper 2017 - 2018 419 nbreg faminekill2 civilconflictcorr polity2 gdpcwb popwb_log foodlevel famineratio3 polity2d_famineratio3_foodlevel polity2a_famineratio3_foodl > evel populationdensity_interpol ann wateravailabletowithdrawal, robust cluster(id) nolog

Negative binomial regression Number of obs = 2304 Wald chi2(11) = 507.22 Dispersion = mean Prob > chi2 = 0.0000 Log pseudolikelihood = -474.73906 Pseudo R2 = 0.0691

(Std. Err. adjusted for 110 clusters in id) ------| Robust faminekill2 | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------+------civilconflictcorrected | 1.276352 .7677006 1.66 0.096 -.2283136 2.781018 polity2 | -.0888768 .0892914 -1.00 0.320 -.2638848 .0861311 gdpcwb | -.0041983 .0010902 -3.85 0.000 -.006335 -.0020616 popwb_log | 1.379723 .8422211 1.64 0.101 -.2710001 3.030446 foodlevel | -.0003834 .0007686 -0.50 0.618 -.0018897 .001123 famineratio3 | 68.187 24.1439 2.82 0.005 20.86582 115.5082 polity2d_famineratio3_foodlevel | -.1189438 .0608508 -1.95 0.051 -.2382092 .0003216 polity2a_famineratio3_foodlevel | -.0199404 .0122186 -1.63 0.103 -.0438885 .0040076 populationdensity_interpol | -.4792009 .451658 -1.06 0.289 -1.364434 .4060325 ann | -.0015313 .0012218 -1.25 0.210 -.003926 .0008634 wateravailabletowithdrawal | -.0161682 .0072527 -2.23 0.026 -.0303831 -.0019532 _cons | -18.05726 11.87186 -1.52 0.128 -41.32568 5.211158 ------+------/lnalpha | 5.724045 .3041161 5.127989 6.320102 ------+------alpha | 306.1409 93.10239 168.6775 555.6297 ------

© Thomas Plümper 2017 - 2018 420

Simplification nbreg faminekill2 polity2 famineratio3 foodlevel, robust cluster(id) nolog

Negative binomial regression Number of obs = 2332 Wald chi2(3) = 23.39 Dispersion = mean Prob > chi2 = 0.0000 Log pseudolikelihood = -495.41363 Pseudo R2 = 0.0294

(Std. Err. adjusted for 116 clusters in id) ------| Robust faminekill2 | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------+------polity2 | -.1402906 .1697868 -0.83 0.409 -.4730666 .1924853 famineratio3 | 47.53168 24.4019 1.95 0.051 -.2951594 95.35852 foodlevel | .0260482 .0091054 2.86 0.004 .0082019 .0438945 _cons | -1.277426 1.557467 -0.82 0.412 -4.330005 1.775152 ------+------/lnalpha | 6.343212 .2810376 5.792389 6.894036 ------+------alpha | 568.62 159.8036 327.7952 986.3743 ------

© Thomas Plümper 2017 - 2018 421

What would Happen, if all Countries had a Democracyscore=7 (Borderline Democracy)

10

5

0

-5 observed polity score polity observed

-10

-1000 -500 0 500 1000 1500 2000 change in predicted number of deaths

© Thomas Plümper 2017 - 2018 422

What would happen, if the Polityscore increased by 1? (quasi-marginal effects plot)

10

5

0

-5 observed polityscore observed

-10

-2000 -1500 -1000 -500 0 change in predicted deaths

© Thomas Plümper 2017 - 2018 423

Useful Options for Effect Strengths Calculation

Delta method by hand

generate copy of variable of interest gen x_original=x estimate model reg y x z postestimate prediction predict est1, yhat change value of variable of interest replace x=x_original-1 or replace x=0 or replace x=0.9 * x_original postestimate counterfactual prediction predict est2, yhat

compute difference gen effect=est1-est2 correct data replace x=x_original

© Thomas Plümper 2017 - 2018 424

Marginal Effects minor absolute change in xit minor relative change in xit change xit to 0 (if 0 is meaningful)

Scenarios:

What if unit i had the covariates of unit j?

What if unit i had been treated? (replace treatment=0 by treatment=1)

Out of sample prediction (leverage test): set yi to missing estimate model predict yi for model that excludes i compare yi to model that includes i plots:

© Thomas Plümper 2017 - 2018 425

if e(sample) eliminates out of sample predictions

© Thomas Plümper 2017 - 2018 426

Interaction Effects in MLE

© Thomas Plümper 2017 - 2018 427

The Conclusion

© Thomas Plümper 2017 - 2018 428

An Example

© Thomas Plümper 2017 - 2018 429

© Thomas Plümper 2017 - 2018 430